Chapter 8. Chinese Script Converter (CSC)

8.1. Overview

There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in the People’s Republic of China (PRC), normally employing the GB2312-80 or GBK character set. TC is used in Taiwan, Hong Kong, and Macau, normally employing the Big Five character set.

Conversion from one script to another is a complex matter. The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with vocabulary differences and context-dependence.

The Chinese Script Converter converts text in simplified script to text in traditional script, or vice versa. The conversion can be on any of three levels:

Codepoint Conversion. Codepoint conversion uses a mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified form 头发 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or 發. Using this approach, however, there is no recognition that 头发 is a word, so the choice could be 發, in which case the end result 頭發 is nonsense. On the other hand, the choice of 髮 leads to errors for other words. So while conversion mapping is straightforward, it is unreliable.

Orthographic Conversion. The second level of conversion is orthographic. This level relies upon identification of the words in the input text. Within each word, orthographic variants of each character may be reflected in the conversion. In the above example, 头发 is identified as a word and is converted to a traditional variant of the word, 頭髮. There is no basis for converting it to 頭 發, because the conversion considers the word as a whole rather than as a collection of individual characters.

Lexemic Conversion. The third level of conversion is lexemic. This level also relies upon identification of words. But rather than converting a word to an orthographic variant, the im here is to convert it to an entirely different word. For example, "computer" is usually 计算机 in SC but 電脳 in TC. Whereas codepoint conversion is strictly character-by-character and orthographic conversion is character-by-character within a word, lexemic conversion is word-by-word.

Note

If you ask for a lexemic conversion, and none is available for a given token, CSC provides the orthographic conversion unless it is not available, in which case CSC provides a codepoint conversion.

Options. When you create a script converter, you must define three options, source script, target script, and conversion level.

Output. For each token in the input, the Chinese Script Converter posts the conversion. Mixed Input. The Chinese input may contain a mixture of TC and SC, and even some non-Chinese text. The Chinese Script Converter converts to the target (SC or TC), leaving any tokens already in the target form and any non-Chinese text unchanged.

8.2. Using CSC with RBL-JE Core

In conjunction with the RBL-JE Tokenizer, use a CSCAnalyzer as described below.

  1. Set up a com.basistech.rosette.breaks.TokenizerFactory.

  2. Use the TokenizerFactory to create acom.basistech.rosette.bl.Tokenizer to tokenize Chinese text.

  3. Set up a com.basistech.rosette.csc.CSCAnalyzerFactory with a conversion level.

  4. Use the CSCAnalyzerFactory to create a com.basistech.rosette.csc.CSCAnalyzer to convert from TC to SC or vice versa.

  5. Use the CSCAnalyzer to analyze each com.basistech.rosette.bl.Token found by the Tokenizer.

  6. Get the conversion (SC or TC) from each Token.

Example: convert the tokens in TC text to SC, using the orthographic conversion level.

import com.basistech.rosette.bl.Token;
import com.basistech.rosette.breaks.Tokenizer;
import com.basistech.rosette.breaks.TokenizerFactory;
import com.basistech.rosette.breaks.TokenizerOption;
import com.basistech.util.LanguageCode;
import com.basistech.rosette.csc.CSCAnalyzerFactory;
import com.basistech.rosette.csc.CSCAnalyzer;
import com.basistech.rosette.csc.CSCAnalyzerOption;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;

public void translate(String rootDirectory, String tcInput) {
	String licensePath =
		new File(rootDirectory, "licenses/rlp-license.xml").getAbsolutePath();

	TokenizerFactory tf = new TokenizerFactory();
	tf.setOption(TokenizerOption.rootDirectory, rootDirectory);
	tf.setOption(TokenizerOption.licensePath, licensePath);

	try {
		Tokenizer tokenizer = tf.create(new StringReader(tcInput), LanguageCode.CHINESE);
		CSCAnalyzerFactory caf= new CSCAnalyzerFactory();
		caf.setOption(CSCAnalyzerOption.rootDirectory, rootDirectory);
		caf.setOption(CSCAnalyzerOption.licensePath, licensePath);
		caf.setOption(CSCAnalyzerOption.conversionLevel, "orthographic");

		CSCAnalyzer cscAnalyzer =
			caf.create(LanguageCode.TRADITIONAL_CHINESE, LanguageCode.SIMPLIFIED_CHINESE);

		Token token;
		while ((token = tokenizer.next()) != null) {
			String tokenIn = new String(token.getSurfaceChars(),
				token.getSurfaceStart(),
				token.getLength());
			System.out.println("Input: " + tokenIn);
			cscAnalyzer.analyze(token);
			System.out.println("SC translation: " + token.getTranslation());
		}
	} catch (Exception e) {
		e.printStackTrace();
		System.exit(1);
	}
}
RBL-JE Core Distribution Sample. The RBL-JE distribution includes a sample (CSCAnalyze) that you can compile and run with an ant build script.

In a Bash shell script (Unix) or Command Prompt (Windows), navigate to rbl-je-7.20.0.c58.3/ samples/csc-analyze and call

ant run

The sample reads an input file in SC and prints each token with its TC conversion to standard out.

8.3. Using CSC with the RBL-JE Command Line Driver

You can use com.basistech.rosette.bl.RBLCmd [31] to convert with CSC. See the Javadoc: RBLCmd1.

8.4. Using CSC with Lucene

  1. Set up a com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory.

  2. Use the TokenizerFactory to create a com.basistech.rosette.lucene.BaseLinguisticsTokenizer, which contains a Lucene Tokenizer.

  3. Set up a com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilterFactory

  4. Use the BaseLinguisticsCSCTokenFilterFactory to create a com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilter to convert from TC to SC or vice versa.

  5. Use the BaseLinguisticsCSCTokenFilter to convert each Token found by the Tokenizer.

Example: convert the tokens in TC text to SC, using the orthographic conversion level.

1. apidocs/rbl-je/com/basistech/rosette/bl/RBLCmd.html

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory;
import com.basistech.rosette.lucene.BaseLinguisticsTokenizer;
import com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilterFactory;
import com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilter;
import com.basistech.util.LanguageCode;
import java.io.IOException;
import java.io.StringReader;
import java.util.Map;

public static void translate(String rootDirectory, String tcInput) {
	String licensePath =
		new File(rootDirectory, "licenses/rlp-license.xml").getAbsolutePath();
	Map options = new HashMap();
	options.put("rootDirectory", rootDirectory);
	options.put("licensePath", licensePath);
	BaseLinguisticsTokenizerFactory blTokenizerFactory =
		new BaseLinguisticsTokenizerFactory(options);

	Tokenizer tokenizer;
	try {
		tokenizer =
			blTokenizerFactory.create(new StringReader(tcInput), LanguageCode.CHINESE);

		// more options for CSC: orthographic TC to SC conversion
		options.put("language", "zht");
		options.put("targetLanguage", "zhs");
		options.put("conversionLevel", "orthographic");

		BaseLinguisticsCSCTokenFilterFactory cscTokenFilterFactory =
			new BaseLinguisticsCSCTokenFilterFactory(options);
		TokenStream tokens = cscTokenFilterFactory.create(tokenizer);
		tokens.reset();
		CharTermAttribute charTerm
			= tokens.getAttribute(CharTermAttribute.class);

		while (tokens.incrementToken()) {
			System.out.println("SC translation: " + charTerm.toString());
		}
		tokens.close();
	} catch (Exception e) {
		e.printStackTrace();
		System.exit(1);
	}
}

RBL-JE/Lucene Distribution Sample. For supported versions of Lucene, the RBL-JE distribution includes a sample ( CSCCharTermAttributeSample) that you can compile and run with an ant build script.

In a Bash shell script (Unix) or Command Prompt (Windows), navigate to rbl-je-7.20.0.c58.3/ samples/csc-analyze-4_3, rbl-je-7.20.0.c58.3/samples/csc-analyze-4_9, rbl-je-7.20.0.c58.3/samples/csc-analyze-4_10, or rbl-je-7.20.0.c58.3/samples/csc-analyze-5_0, and call

ant run

The sample reads an input file in SC and prints the TC conversion for each token to standard out.

results matching ""

    No results matching ""