5.4. Creating your own RBL-JE Analysis Chain
When creating an analysis chain, you can do the following:
- Use the
BaseLinguisticsTokenizerFactory
to generate a language-specific tokenizer that applies Rosette Base Linguistics to tokenize text. - Use the
BaseLinguisticsTokenFilterFactory
to generate a language-specific token filter that enhances a stream of tokens. - Add other token filters to the analysis chain.
5.4.1. Japanese Tokenizer and Filter Sample
This Lucene sample, JapaneseTokenizerAndFilterSample.java, creates an analysis chain to generate an enriched token stream. The sample does the following:
- Uses a tokenizer factory to set up a language-specific base linguistics tokenizer, which puts tokens in the token stream.
- Uses a base linguistics token filter factory to set up language-specific base linguistics token filter, which adds lemmas and readings to the tokens in the token stream.
- To replicate the behavior of the analyzer in the previous example, this sample also includes the LowerCaseFilter4 and CJKWidthFilter5.
- Writes each element in the token stream with its type attribute to an output file.
package com.basistech.rosette.samples; import com.basistech.rosette.breaks.TokenizerFactory; import com.basistech.rosette.breaks.TokenizerOption; import com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory; import com.basistech.rosette.lucene.BaseLinguisticsTokenizer; import com.basistech.util.LanguageCode; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.cjk.CJKWidthFilter; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.nio.charset.StandardCharsets; import java.util.HashMap; import java.util.Map; /** * Example program that does uses a Japanese tokenizer and Japanese TokenFilter. This does not set up and run a Lucene index; it just shows the construction of the analysis chain. */ public final class JapaneseTokenizerAndFilterSample { private TokenizerFactory tokenizerFactory; private BaseLinguisticsTokenFilterFactory tokenFilterFactory; private String rootDirectory; private String inputPathname; private String outputPathname; private JapaneseTokenizerAndFilterSample() { // } private void initialize() { File rootPath = new File(rootDirectory); String licensePath = new File( rootPath, "licenses/rlp-license.xml").getAbsolutePath(); tokenizerFactory = new TokenizerFactory(); tokenizerFactory.setOption( TokenizerOption.rootDirectory, rootDirectory); tokenizerFactory.setOption(TokenizerOption.licensePath, licensePath); tokenizerFactory.setOption(TokenizerOption.partOfSpeech, "true"); Map options = new HashMap<>(); options.put("language", "jpn"); options.put("rootDirectory", rootDirectory); options.put("addReadings", "true"); tokenFilterFactory = new BaseLinguisticsTokenFilterFactory(options); } private void run() throws IOException { try (BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(inputPathname), StandardCharsets.UTF_8))) { input.mark(1); int bomPerhaps = input.read(); if (bomPerhaps != 0xfeff) { input.reset(); } Tokenizer tokenizer = new BaseLinguisticsTokenizer(tokenizerFactory.create(null, LanguageCode.JAPANESE)); tokenizer.setReader(input); TokenStream tokens = tokenFilterFactory.create(tokenizer); // To replicate behavior of JavaAnalyzerSample, include LowerCaseFilter // (a Japanese document may contain Roman script) and CJKWidthFilter // (to normalize fullwidth Roman script letters and digits to halfwidth, // and halfwidth Katakana variants into the equivalent Kana. tokens = new LowerCaseFilter(tokens); tokens = new CJKWidthFilter(tokens); tokens.reset(); CharTermAttribute charTerm = tokens.getAttribute(CharTermAttribute.class); TypeAttribute type = tokens.getAttribute(TypeAttribute.class); try (PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputPathname), StandardCharsets.UTF_8))) { while (tokens.incrementToken()) { printWriter.format("%s\t%s%n", charTerm.toString(), type.type()); } } } catch (IOException ie) { System.err.printf("Failed to open input file %s%n", inputPathname); System.exit(1); return; } System.out.println("See " + outputPathname); } public static void main(String[] args) { if (args.length != 3) { System.err.println("Usage: rootDirectory input output"); return; } JapaneseTokenizerAndFilterSample that = new JapaneseTokenizerAndFilterSample(); that.rootDirectory = args[0]; that.inputPathname = args[1]; that.outputPathname = args[2]; that.initialize(); try { that.run(); } catch (IOException e) { System.err.println("Exception processing the data."); e.printStackTrace(); } } }
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-7.20.0.c58.3/samples/lucene-5_0 and use the Ant build script:
ant runTokenizerAndFilter
The example reads the same file as the previous sample and writes the output to a jpn-analyzedbyTokenizerAndFilter.txt. The content matches the content generated by the previous example.
5.4.2. Using the BaseLinguisticsSegmentationTokenFilter
If you are using your own whitespace tokenizer and processing text that requires segmenting Chinese, Japanese, or Thai, you can use the BaseLinguisticsSegmentationTokenFilterFactory
to create a BaseLinguisticsSegmentationTokenFilter
, then place the segmentation token filter in an analysis chain following the whitespace tokenizer and preceding other filters, such as a base linguistics token filter.
The segmentation token filter segments each of the tokens from the whitespace tokenizer into individual tokens where necessary. See the Javadoc for the RBL-JE API for Lucene 4.3-4.8, Lucene 4.9, Lucene 4.10, Lucene 5.0-5.5, Lucene 6.0-6.1, or Lucene 6.2-6.3.
4. http://lucene.apache.org/core/4_6_0/analyzerscommon/org/apache/lucene/analysis/core/LowerCaseFilter.html ↩
5. http://lucene.apache.org/core/4_6_0/analyzerscommon/org/apache/lucene/analysis/cjk/CJKWidthFilter.html ↩