8.5. CSC User Dictionaries
You may create and use user dictionaries for converting from TC to SC and vice versa. A CSC user dictionary supports orthographic and may support lexemic conversion. It is not used for codepoint conversion.
8.5.1. Creating a CSC User Dictionary
The source file for a CSC user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM). Each entry is a single line. Empty lines are ignored. The source file must be compiled into a binary format, as described below.
Each entry contains two or three tab-delimited elements:
input_token orthographic_translation [lexemic_translation]
If the input_token is TC, then the orthographic_translation and optional lexemic_translation should be SC. Or vice versa.
Sample entries for a TC to SC user dictionary:
電腦 电脑 计算机
宇宙飛船 宇宙飞船
Compiling a CSC User Dictionary. In the tools/bin directory, RBL-JE includes a shell script for Unix
rbl-build-csc-dictionary
and a .bat file for Windows.
rbl-build-csc-dictionary.bat
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS to -Xmx8g.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Compile the CSC user dictionary from the RBL-JE root directory:
tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE
INPUT_FILE is the pathname of the source file you have created, and OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin
Byte Order. For information about the byte order of binary dictionaries, see Handling Little-Endian and Big-Endian Dictionaries [71].
8.5.2. Activating CSC User Dictionaries
Using RBL-JE Core. The com.basistech.rossette.csc.CSCAnalyzerFactory provides a method
for loading a user-defined CSC dictionary:
void addUserDefinedDictionary(LanguageCode language, LanguageCode targetLanguage, String path);
For example:
CSCAnalyzerFactory caf = new CSCAnalyzerFactory();
caf.addUserDefinedDictionary(LanguageCode.TRADITIONAL_CHINESE,
LanguageCode.SIMPLIFIED_CHINESE,
"path/to/my_tc2sc.bin");
Using RBL-JE Lucene. You can use the com.basistech.rosette.lucene.BaseLinguisticsCSCFilterFactory constructor to add userdefined CSC dictionaries.
public BaseLinguisticsCSCTokenFilterFactory(Map<String, String> args);
The args Map may include a userDefinedDictionaryPath.
For example:
Map<String, String> args = new HashMap<String, String>();
args.put("rootDirectory", "path/to/rootDirectory");
args.put("licensePath", "path/to/licenses/rlp-license.xml");
args.put("language", "zht");
args.put("targetLanguage", "zhs");
args.put("CSCAnalyzerOption.conversionLevel", "orthographic");
args.put("userDefinedDictionaryPath", "path/to/my_tc2sc.bin");
BaseLinguisticsCSCTokenFilterFactory cscTokenFilterFactory =
new BaseLinguisticsCSCTokenFilterFactory(args);