Appendix I. CLA and JLA User Dictionaries

Chinese Language Analyzer and Japanese Language Analyzer both include the capability to create and use one or more segmentation (tokenization) user dictionaries for vocabulary specific to an industry or application. A common usage for both languages is to add new nouns like organizational and product names. These and existing nouns can have a compounding scheme specified if, for example, you wish to prevent an otherwise compound product name from being segmented as such. Japanese also supports adding further classes of words, as enumerated in the section below.

When the language is Japanese, you can also create user reading dictionaries with transcriptions
rendered in Hiragana. The readings can override the readings returned from the JLA reading
dictionary and override readings that are otherwise guessed from segmentation (tokenization) user
dictionaries.

Segmentation (tokenization) user dictionaries and reading user dictionaries are compiled into
separate binary forms with big-endian or little-endian byte order to match the platform. Both
dictionary types can be compiled from the same source file.

Procedure for Creating and Using a Chinese or Japanese User Dictionary

I.1. Creating the Source File

The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries [167] ). The file may begin with a byte order
mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese
dictionary may begin !DICT_LABEL followed by Tab and an arbitrary string to set the dictionary's
name, which is not currently used anywhere.

Each entry in the dictionary source file is a single line:

word Tab POS Tab DecompPattern Tab Reading1,Reading2,...

where word is the noun, POS is one of the user-dictionary part-of-speech tags listed below,
DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify
the number of characters from word to include in each component of the compound (0 for no
decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered
in Hiragana or Katakana (only applicable to Japanese).

The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.

Chinese User Dictionary POS Tags

ABBREVIATION
ADJECTIVE
ADVERB
AFFIX
CONJUNCTION
CONSTRUCTION
DERIVATIONAL_AFFIX
DIRECTION_WORD
FOREIGN_PERSON
IDIOM
INTERJECTION
MEASURE_WORD
NON_DERIVATIONAL_AFFIX
NOUN
NUMERAL
ONOMATOPE
ORGANIZATION
PARTICLE
PERSON
PLACE
PREFIX
PREPOSITION
PRONOUN
PROPER_NOUN
PUNCTATION
SUFFIX
TEMPORAL_NOUN
VERB
VERB_ELEMENT

Japanese User Dictionary POS Tags

NOUN
PROPER_NOUN
PLACE
PERSON
ORGANIZATION
GIVEN_NAME
SURNAME
FOREIGN_PLACE_NAME
FOREIGN_GIVEN_NAME
FOREIGN_SURNAME
AJ (adjective)
AN (adjectival noun)
HS (honorific suffix)
V1 (vowel-stem verb)
VN(verbal noun
VS (suru-verb)
VX (irregular verb)

Note: For examples of standard (non-user-dictionary) use of the one and two-letter POS tags in the preceding list, see Japanese POS Tags [138] .

Examples (the last three entries include readings)

!DICT\_LABEL New Words 2014
デジタルカメラ NOUN
デジカメ NOUN 0
東京証券取引所 ORGANIZATION 2,2,3
狩野 SURNAME 0
安倍晋三 PERSON 2,2 あべしんぞう
麻垣康三 PERSON 2,2 あさがきこうぞう
商人 NOUN 0 しょうにん, あきんど

The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:

東京証券取引所 ｏｒｇａｎｉｚａｔｉｏｎ ２,２,３

The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into

東京
証券
取引所

I.2. Valid Characters for Chinese and Japanese User Dictionary Entries

An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin script. In this listing, .. indicates an inclusive range of valid code points:

0025..0039, 0040..005A, 005F..007A, 007E, 00B7, 0370..03FF, 0400..04FF, 2010..206F, 2160..217B, 2200..22FF, 2460..24FF, 25A0..25FF, 2600..26FF, 3003..3007, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FA, 30FC..30FE, 3200..32FF, 3300..33FF, 4E00..9FFF, D800..DBFF, DC00..DFFF, E000..F8FF, F900..FA2D, FF00, FF02..FFEF

For example, the full stop 。 (3002), indicates a sentence break and must not be included in a dictionary entry. The Katakana middle dot ・ (30FB) must not appear in a dictionary entry; input strings with this character match the corresponding dictionary entries without the character.

I.3. Compiling the User Dictionary

Use the dictionary compiler described in Segmentation Dictionaries [64] . To compile a Chinese segmentation dictionary, use the -type cla option. To compile a Japanese segmentation dictionary, use the -type jla option. To compile a Japanese reading dictionary, use the -type jla-reading option.

I.4. Activating the User Dictionary

From an API perspective, the JLA and CLA user dictionaries function like segmentation dictionaries. See Activating User Dictionaries [66] in the User Dictionary chapter for details.

Appendix I. CLA and JLA User Dictionaries

Appendix I. CLA and JLA User Dictionaries

I.1. Creating the Source File

I.2. Valid Characters for Chinese and Japanese User Dictionary Entries

I.3. Compiling the User Dictionary

I.4. Activating the User Dictionary

results matching ""

No results matching ""