7.1. Creating User Dictionaries

7.1.1. Preparing Source

The source file for a user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM). Each entry is a single line. Empty lines are ignored. The source file must be compiled into a binary format, as described below.

7.1.1.1. Lemma Dictionaries

Each entry is a word, followed by a tab and an analysis. The analysis must end with a lemma and a part-of-speech (POS) tag.

word lemma[+POS]

For POS tags, see Part-of-Speech Tags [112] . For those languages for which RBL-JE does not return POS tags, use DUMMY.

Case. User dictionary lookups are case sensitive and RBL-JE provides an option,
AnalyzerOption.caseSensitive, to control whether or not the analysis phase is case sensitive.

If this option is "true", which is the default, then the token itself is used to query the dictionary. It it is "false", the token is lowercased before consulting the dictionary. This requires that the words in a user dictionary intended for use in a case-insensitive analysis be in lowercase. See Activating User Dictionaries [66] to learn how to associate a dictionary with the appropriate analysis. For Danish, Norwegian, and Swedish, the dictionaries we provide are lowercase. AnalyzerOption.caseSensitive is automatically set to "false" for these languages.

Variations. You may want to provide more than one analysis for a word or more than one version of a word for an analysis. Note: The shortcut for including lines with empty analyses or empty words to repeat a word or analysis is deprecated.

The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog" for the same analysis (noun).

    telephone telephone[+NOUN]
    telephone telephone[+VI]
    dog dog[+NOUN]
    Dog dog[+NOUN]

For some languages, the analysis may include special tags and additional information.

Contracted forms. For English, French, Italian, and Portuguese, ^= is a separator for a
contraction or elision. English example:

doesn't does\[^=\]not\[+VDPRES\]

Multi-Word Analysis. For English, Italian, Spanish, and Dutch, ^_ indicates a space. English example:

IBM International\[^\_\]Business\[^\_\]Machines\[+PROP\]

Compound Boundary. For Danish, Dutch, Norwegian, German, and Swedish, ^# indicates the boundary between elements in a compound word. For Hungarian, the compound boundary tag is ^CB+. German example:

heimatländern Heimat\[^\#\]Land\[+NOUN\]

Compound Linking Element. For German ^/, indicates a compound linking element. For
Dutch, use ^//.
German example:

arbeitskreis Arbeit\[^/\]s\[^\#\]Kreis\[+NOUN\]

Derivation Boundary or Separator for Clitics. For Italian, Portuguese, and Spanish, ^| indicates a derivation boundary or separator for clitics. Spanish example with derivation boundary:

duramente duro\[^\|\]\[+ADV\]

Italian example with separator for clitics:

farti fare\[^\|\]tu\[+VINF\_CLIT\]

Japanese Readings and Normalized Forms. For Japanese, [^r] precedes a reading (there may be
more than one), and [^n] precedes a normalization. For example:

行わ 行う\[^r\]オコナワ\[+V\]
                  tv テレビ\[^r\]テレビ\[^n\]テレビ\[+NC\]
                 アキュムレータ アキュムレーター\[^r\]アキュムレータ\[^n\]アキュムレーター\[+NC\]

Korean Analysis. A Korean analysis uses a different pattern than the analysis for other languages. The pattern for an analysis in our Korean dictionary or a user Korean dictionary is as follows:

Token Mor1\[/Tag1\]\[^+\]Mor2\[/Tag2\]\[^+\]Mor3\[/Tag3\]

Where each MorN is a morpheme, consisting of one or more Korean characters, and TagN is the POS tag for that morpheme. [^+] indicates the boundary between morphemes.² Here's an example:

유전자이다 유전자\[/NPR\]\['^+\]이\[/CO\]\[^+\]다\[/ECS\]

If the analysis contains one noun morpheme, that morpheme is the lemma and the POS tag is the POS tag for that morpheme. If more than one of the morphemes are nouns, the lemma is the concatenation of those nouns (a compound). Example:

정보검색 정보\[/NNC\]\[^+\]검색/\[NNC\]

Otherwise, the lemma is the first morpheme, and the POS tag is the POS tag associated with that morpheme.

You can override this algorithm for identifying the lemma and/or POS tag in a user dictionary entry by placing [^L] lemma and/or [^P][/Tag] at the end of the analysis. The lemma may or may not correspond to one of the morphemes in the analysis. For example (for illustration only):

유전자이다 유전자\[/NNC\]\[^+\]이\[/CO\]\[^+\]다\[/ECS\]\[^L\]유전\[^P\]\[/NPR\]

The com.basistech.rosette.bl.KoreanAnalysis interface provides access to the morphemes and tags associated with a given token in either the standard Korean dictionary or a user Korean dictionary.

7.1.1.2. Segmentation Dictionaries

The format for a Chinese, Japanese, or Thai segmentation dictionary source file is very simple. Each word is written on its own line, and that word is guaranteed to be segmented as a single token when seen in the input text, regardless of context. Japanese example:

三菱UFJ銀行
                    酸素ボンベ

7.1.1.3. Many-to-one Normalization Dictionaries

The source file is a tab-separated with each normalization on a separate line. The first value on each line is the normalized form. Subsequent, values are variants to be mapped to the normalized form. For example:

norm1 var1 var2
norm1 var3 var4 var5

². An analysis from our standard dictionary may include a morpheme with a POS tag and a second tag: ^KrName (person name). For example:유전자이다 유전자[/NPR^KrName]['^+]이[/CO][^+]다[/ECS] | ^KrName is not supported in user dictionaries. ↩

7.1. Creating User Dictionaries