1.2. Tokenizers and Analyzers

Most of the linguistic work is performed by Tokenizers and Analyzers. Tokenizers identify the words in text. Analyzers determine their linguistic attributes such as lemmas and parts of speech. Factories are used to configure and create these objects.

RBL-JE will integrate easily with multi-threaded architectures, but, to avoid performance penalties, it does not make promiscuous use of locks. Instead, most of the objects and interfaces in RBL-JE are either read-only, re-entrant objects or read-write, per-thread objects. TokenizerFactory and AnalyzerFactory objects are hybrids as they have both thread-safe and per-thread methods. The create methods of these factories are thread-safe because they do not alter any data within the factories, but the setOption and addUserDefinedDictionary methods are not thread-safe because they do alter data within the factory. The Tokenizers and Analyzers created by these factories are always meant to be used on a per-thread basis because they are not re-entrant and do alter data within the objects. One can use a factory across multiple threads to create objects as long as calls to the factory methods for setting options [86] or adding user dictionaries are synchronized appropriately. The objects created by the factory must each be created and used by only one thread, which need not be the thread initializing the factory.

1.2.1. Tokenizers

TokenizerFactory produces a language-specific Tokenizer that processes documents, producing a stream of tokens.

For the European languages, the Tokenizer implements Unicode Standard Annex #29¹ guidelines for determining boundaries between sentences and for breaking each sentence into individual tokens.

For Chinese, Japanese, and Thai, the Tokenizer determines sentence boundaries, and then uses statistical models to segment each sentence into individual tokens. If Latin-script or other non-Chinese, non-Japanese, or non-Thai fragments greater than a certain length (defined by TokenizerOption.minNonHanRegionLength; the default is 10) are embedded in the Chinese, Japanese, or Thai text, then the Tokenizer applies default Unicode tokenization to those fragments.

For Chinese and Japanese, in addition to the statistical model described above, RBL-JE includes "Chinese Language Analyzer" (CLA) and "Japanese Language Analyzer" (JLA) modules which are optimized for search and are compatible with the Chinese and Japanese language processors found in the native (C++) product. They are activated by using the BaseLinguisticsOption or TokenizerOption alternativeTokenization enums. Parameters are passed using the BaseLinguisticsOption or TokenizerOption alternativeTokenizationOptions enums.

For Hebrew, the Tokenizer also generates a lemma and a Semitic root for each token.

For all languages (except Hebrew and Arabic), the RBL-JE Tokenizer can apply Normalization Form KC (NFKC) as specified in Unicode Standard Annex #15² to normalize the tokens. NFKC normalization is turned off by default. Use the TokenizerOption.nfkcNormalize enum to turn it on.

¹. http://www.unicode.org/reports/tr29/ ↩

². http://unicode.org/reports/tr15/ ↩

1.2.2. Analyzers

AnalyzerFactory produces a language-specific Analyzer [19] that uses dictionaries and statistical analysis to add analysis objects to tokens.

Analyzers can be used for all languages except Hebrew. For Hebrew, the Tokenizer returns tokens with analyses already generated. Apart from numbers, lemmas do not apply to Chinese or Thai, and there is no Chinese or Thai lemma dictionary.

For each token and normalized form in the token stream, the Analyzer performs a dictionary lookup starting with the user dictionaries, if any, followed by the RBL-JE dictionary. During lookup, RBL-JE ignores the context in which the token or normalized form appears.

Once the Analyzer has found one or more lemmas in a dictionary, it does not consult additional dictionaries. In other words, if two user dictionaries are specified, and the filter finds a lemma in the first dictionary, it does not consult the second user dictionary or the RBL-JE dictionary.

Guessing. No dictionary can ever be complete: new words get added to languages; languages change and borrow. So, in general, analysis for each of our languages includes some sort of guessing capability. The job of a guesser is to take a word and to come up with some analysis of it. Whatever facts we generate for a language, those are all possible outputs of a guesser.

In European languages, guessers deliver lemmas and parts of speech. In Korean, guessers provide morphemes, morpheme tags, compound components, and parts of speech (for the algorithm for deriving a lemma, see Korean analysis [62]). RBL-JE does not yet have guessers for Arabic, Chinese, Japanese, Thai and Turkish.

Whitespace in Lemmas. By default, the Analyzer returns any lemma that contains whitespace as multiple lemmas (each with no whitespace). To allow lemmas with whitespace (such as International Business Machines as a lemma for the token IBM) to be placed as such in the token stream, you can create a user lemma dictionary [59] with an entry that defines such a lemma.
For example:
IBM International[^_]Business[^_]Machines[+PROP]

Compounds. The Analyzer decomposes Danish, Dutch, German, Hungarian, Norwegian and Swedish compounds, returning the lemmas of each of the components.

The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.

For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting 's' is not present in the component list. For this input token, the RBL-JE Tokenizer and RBL-JE Analyzer return the following entries:

Original form: Eingangstüren (type = <ALNUM>)
Lemma for the compound: Eingangstür (type = <LEMMA>)
Component lemmas: Eingang, Tür (type = <COMP>)

Other German examples include letter removal (Rennrad ⇒ rennen + Rad), vowel changes (Mängelliste ⇒ Mangel + Liste), and capitalization changes (Blaugrünalge ⇒ blau + grün + Alge).

User Dictionaries. To extend the coverage that RBL-JE provides for each supported language (except Arabic, Hebrew, Romanian, and Turkish), you can create user dictionaries. See User-Defined Dictionaries [59].

1.2 Tokenizers and Analyzers