Appendix I. CLA and JLA User Dictionaries
Chinese Language Analyzer and Japanese Language Analyzer both include the capability to create and use one or more segmentation (tokenization) user dictionaries for vocabulary specific to an industry or application. A common usage for both languages is to add new nouns like organizational and product names. These and existing nouns can have a compounding scheme specified if, for example, you wish to prevent an otherwise compound product name from being segmented as such. Japanese also supports adding further classes of words, as enumerated in the section below.
When the language is Japanese, you can also create user reading dictionaries with transcriptions
rendered in Hiragana. The readings can override the readings returned from the JLA reading
dictionary and override readings that are otherwise guessed from segmentation (tokenization) user
dictionaries.
Segmentation (tokenization) user dictionaries and reading user dictionaries are compiled into
separate binary forms with big-endian or little-endian byte order to match the platform. Both
dictionary types can be compiled from the same source file.
Procedure for Creating and Using a Chinese or Japanese User Dictionary
I.1. Creating the Source File
The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries [167] ). The file may begin with a byte order
mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese
dictionary may begin !DICT_LABEL
followed by Tab and an arbitrary string to set the dictionary's
name, which is not currently used anywhere.
Each entry in the dictionary source file is a single line:
word Tab POS Tab DecompPattern Tab Reading1,Reading2,...
where word is the noun, POS is one of the user-dictionary part-of-speech tags listed below,
DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify
the number of characters from word to include in each component of the compound (0 for no
decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered
in Hiragana or Katakana (only applicable to Japanese).
The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.
Chinese User Dictionary POS Tags
- ABBREVIATION
- ADJECTIVE
- ADVERB
- AFFIX
- CONJUNCTION
- CONSTRUCTION
- DERIVATIONAL_AFFIX
- DIRECTION_WORD
- FOREIGN_PERSON
- IDIOM
- INTERJECTION
- MEASURE_WORD
- NON_DERIVATIONAL_AFFIX
- NOUN
- NUMERAL
- ONOMATOPE
- ORGANIZATION
- PARTICLE
- PERSON
- PLACE
- PREFIX
- PREPOSITION
- PRONOUN
- PROPER_NOUN
- PUNCTATION
- SUFFIX
- TEMPORAL_NOUN
- VERB
- VERB_ELEMENT
Japanese User Dictionary POS Tags
- NOUN
- PROPER_NOUN
- PLACE
- PERSON
- ORGANIZATION
- GIVEN_NAME
- SURNAME
- FOREIGN_PLACE_NAME
- FOREIGN_GIVEN_NAME
- FOREIGN_SURNAME
- AJ (adjective)
- AN (adjectival noun)
- HS (honorific suffix)
- V1 (vowel-stem verb)
- VN(verbal noun
- VS (suru-verb)
- VX (irregular verb)
Note: For examples of standard (non-user-dictionary) use of the one and two-letter POS tags in the preceding list, see Japanese POS Tags [138] .
Examples (the last three entries include readings)
!DICT\_LABEL New Words 2014
デジタルカメラ NOUN
デジカメ NOUN 0
東京証券取引所 ORGANIZATION 2,2,3
狩野 SURNAME 0
安倍晋三 PERSON 2,2 あべしんぞう
麻垣康三 PERSON 2,2 あさがきこうぞう
商人 NOUN 0 しょうにん, あきんど
The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:
東京証券取引所 organization 2,2,3
The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into
東京
証券
取引所
I.2. Valid Characters for Chinese and Japanese User Dictionary Entries
An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the
following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin
script. In this listing, ..
indicates an inclusive range of valid code points:
0025..0039, 0040..005A, 005F..007A, 007E, 00B7, 0370..03FF, 0400..04FF, 2010..206F, 2160..217B, 2200..22FF, 2460..24FF, 25A0..25FF, 2600..26FF, 3003..3007, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FA, 30FC..30FE, 3200..32FF, 3300..33FF, 4E00..9FFF, D800..DBFF, DC00..DFFF, E000..F8FF, F900..FA2D, FF00, FF02..FFEF
For example, the full stop 。
(3002), indicates a sentence break and must not be included in a dictionary entry. The Katakana middle dot ・
(30FB) must not appear in a dictionary entry; input
strings with this character match the corresponding dictionary entries without the character.
I.3. Compiling the User Dictionary
Use the dictionary compiler described in Segmentation Dictionaries [64] . To compile a Chinese
segmentation dictionary, use the -type cla
option. To compile a Japanese segmentation dictionary,
use the -type jla option
. To compile a Japanese reading dictionary, use the -type jla-reading
option.
I.4. Activating the User Dictionary
From an API perspective, the JLA and CLA user dictionaries function like segmentation dictionaries. See Activating User Dictionaries [66] in the User Dictionary chapter for details.