Appendix D. Tokenizer plugin

1. Introduction
2. Installation and command line use
3. Mac OS X specifics
4. Troubleshooting

1. Introduction

Tokenizers (or stemmers) improve the quality of matches by recognizing inflected words in source and translation memory data. They also improve glossary matching.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish". This is especially useful in case of languages that use pre- and postfix forms for the stem words. Borrowing an example from Slovenian, here "good" in all possible grammatically correct forms:

  • lep, lepa, lepo - singular, masculine, feminine, neutral

  • lepši, lepša, lepše . - comparative, nominative, masculine, feminine, neutral, resp. Plural form of the adjective

  • najlepših - superlative, plural, genitive for M,F,N

2. Installation and command line use

A tokenizer package adapted from the Lucene project is distributed as an OmegaT plug-in at http://sourceforge.net/projects/omegat-plugins/files/. Download the most recent files (OmegaT-tokenizers_0.4_2-2.1.zip at the time of this writing).

To install the tokenizer, create a folder with the name "plugins" in the folder where OmegaT.jar is found and unpack the above files within that folder.

To run OmegaT with the tokenizer you need to specify which tokenizer you'll use for the source language and which tokenizer you'll use for the target language. The syntax is as follows:

java -jar OmegaT.jar --ITokenizer=[source language tokenizer name] --ITokenizerTarget=[target language tokenizer name]

The tokenizer names are given in the Readme.txt file distributed with the tokenizer files. For example, if you wish to use the Lucene CJK tokenizer in source and the Lucene French tokenizer in target, your command will look like this:

java -jar OmegaT.jar --ITokenizer=org.omegat.plugins.tokenizer.LuceneCJKTokenizer --ITokenizerTarget=org.omegat.plugins.tokenizer.LuceneFrenchTokenizer

3. Mac OS X specifics

If you wish to use the tokenizers with the Mac OS X OmegaT.app package, the tokenizer installation described above applies (right-click on OmegaT.app to find the location of OmegaT.jar), but you'll need to specify the tokenizer names in the info.plist that contains the Java launch options. Follow the instructions above to access the info.plist file and edit it so that it looks as follows for the example we just gave:


<key>VMOptions</key>
<string>-Xmx1024M</string>

<key>Arguments</key>
<array>
  <string>--ITokenizer=org.omegat.plugins.tokenizer.LuceneCJKTokenizer</string>
  <string>--ITokenizerTarget=org.omegat.plugins.tokenizer.LuceneFrenchTokenizer</string>
</array>

4. Troubleshooting

To make sure that the tokenizers are being used, open a project and check the log information from the console. With the example above it should look like that:


84528: Info: Source tokenizer: org.omegat.plugins.tokenizer.LuceneCJKTokenizer 
84528: Info: Target tokenizer: org.omegat.plugins.tokenizer.LuceneFrenchTokenizer 

The numbers on the left are likely to be different on your system so make sure that the source and target tokenizer names,specified in the start-up options, correspond to what the log is showing. If the tokenizers are not properly launched, the log will look like this:


12719: Info: Source tokenizer: org.omegat.core.matching.Tokenizer 
12719: Info: Target tokenizer: org.omegat.core.matching.Tokenizer 

With the Mac OS X OmegaT.app package, double-click on the JavaApplicationStub located in /OmegaT.app/Contents/MacOS/ (see above to access it) to launch OmegaT from the console and get immediate access to the log.