8.5. CSC User Dictionaries

You may create and use user dictionaries for converting from TC to SC and vice versa. A CSC user dictionary supports orthographic and may support lexemic conversion. It is not used for codepoint conversion.

8.5.1. Creating a CSC User Dictionary

The source file for a CSC user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM). Each entry is a single line. Empty lines are ignored. The source file must be compiled into a binary format, as described below.

Each entry contains two or three tab-delimited elements:

input_token orthographic_translation [lexemic_translation]

If the input_token is TC, then the orthographic_translation and optional lexemic_translation should be SC. Or vice versa.

Sample entries for a TC to SC user dictionary:

電腦 电脑 计算机
宇宙飛船 宇宙飞船

Compiling a CSC User Dictionary. In the tools/bin directory, RBL-JE includes a shell script for Unix

rbl-build-csc-dictionary

and a .bat file for Windows.

rbl-build-csc-dictionary.bat

The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS to -Xmx8g.

Unix shell:

export JAVA_OPTS=-Xmx8g

Windows command prompt:

set JAVA_OPTS=-Xmx8g

Compile the CSC user dictionary from the RBL-JE root directory:

tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE

INPUT_FILE is the pathname of the source file you have created, and OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:

tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin

Byte Order. For information about the byte order of binary dictionaries, see Handling Little-Endian and Big-Endian Dictionaries [71].

8.5.2. Activating CSC User Dictionaries

Using RBL-JE Core. The com.basistech.rossette.csc.CSCAnalyzerFactory provides a method for loading a user-defined CSC dictionary:

void addUserDefinedDictionary(LanguageCode language, LanguageCode targetLanguage, String path);

For example:

CSCAnalyzerFactory caf = new CSCAnalyzerFactory();
caf.addUserDefinedDictionary(LanguageCode.TRADITIONAL_CHINESE,
        LanguageCode.SIMPLIFIED_CHINESE,
        "path/to/my_tc2sc.bin");

Using RBL-JE Lucene. You can use the com.basistech.rosette.lucene.BaseLinguisticsCSCFilterFactory constructor to add userdefined CSC dictionaries.

public BaseLinguisticsCSCTokenFilterFactory(Map<String, String> args);

The args Map may include a userDefinedDictionaryPath.

For example:

Map<String, String> args = new HashMap<String, String>();
args.put("rootDirectory", "path/to/rootDirectory");
args.put("licensePath", "path/to/licenses/rlp-license.xml");
args.put("language", "zht");
args.put("targetLanguage", "zhs");
args.put("CSCAnalyzerOption.conversionLevel", "orthographic");
args.put("userDefinedDictionaryPath", "path/to/my_tc2sc.bin");

BaseLinguisticsCSCTokenFilterFactory cscTokenFilterFactory =
        new BaseLinguisticsCSCTokenFilterFactory(args);

8.5. CSC User Dictionaries