Chapter 7. User-Defined Dictionaries
Users can create supplementary dictionaries to change default linguistic analyses. Available dictionaries are listed here.
Lemma Dictionary. A lemma dictionary allows users to modify the default lemmatization of an analysis by defining new lemmas, for any word. As described below, the analysis associated with a lemma includes part-of-speech tag and additional characteristics for some languages. Use of these dictionaries is not supported for Arabic, Hebrew, Persian, Romanian, Turkish, and Urdu.
Segmentation Dictionary. A segmentation dictionary allows users to specify strings that are to be segmented as tokens for Chinese, Japanese, or Thai.
CSC Dictionary. Users can specify conversion for use with the Chinese Script Converter (CSC).
See CSC User Dictionaries [78] .
CLA/JLA Dictionaries. CLA and JLA have their own user dictionaries. See CLA and JLA User Dictionaries [164] .
Many-to-one Normalization Dictionaries. Users can implement a many-to-one dictionary to map multiple spelling variants to a single normalized form.
In all dictionaries, the entries should be Form KC normalized. Japanese Katakana characters, for example, should be full width, and Latin characters, numerals, and punctuation should be half width. Lemma dictionaries can contain characters of any script, while for most consistent performance token dictionaries should only contain characters in the Hanzi (Kanji), Hiragana,
Katakana, and Thai scripts.1 Chinese and Japanese segmentation user dictionary entries may not contain the ideographic full stop.
For any given type, user dictionaries are consulted in the order loaded. Once a token is found in a user dictionary, RBL-JE stops and will consult neither the remaining user dictionaries nor the RBLJE dictionary.
1. Text in other scripts (such as Latin script) in the input that equals or exceeds the length specified by TokenizerOption.minNonHanRegionLength
(the default is 10) is passed to the standard Tokenizer and not seen by a user segmentation dictionary. ↩