Chapter 5. Using RBL-JE in Apache Lucene
5.1. Introduction
RBL-JE provides an API for integrating with Apache Lucene 4.3-6.3. See the Javadoc that accompanies this package.
RBL-JE supports two usage patterns for incorporating Rosette Base Linguistics in a Lucene application.
- Use the RBL-JE Lucene base linguistics analyzer [42] to parse an input stream in one of the languages RBL-JE supports and to generate a token stream with tokens and lemmas.
- Create your own analysis chain [46] , using a language-specific RBL-JE tokenizer and token filter.
5.2. Samples
Samples are included in the use cases below. The code comes from the Lucene 5.0 samples found in rbl-je-7.20.0.c58.3/samples/lucene-5_0. Samples for other versions of Lucene are loacted in the following directories:
| Lucene Version | Sample Directory |
|---|---|
| 4.3 - 4.8 | rbl-je-7.20.0.c58.3/samples/lucene-4_3 |
| 4.9 | rbl-je-7.20.0.c58.3/samples/lucene-4_9 |
| 4.10 | rbl-je-7.20.0.c58.3/samples/lucene-4_10 |
| 5.0 - 5.5 | rbl-je-7.20.0.c58.3/samples/lucene-5_0 |
| 6.0 - 6.1 | rbl-je-7.20.0.c58.3/samples/lucene-6_0 |
| 6.2 - 6.3 | rbl-je-7.20.0.c58.3/samples/lucene-6_2 |
5.3. Using the RBL-JE Lucene Base Linguistics Analyzer
The RBL-JE Lucene base linguistics analyzer (BaseLinguisticsAnalyzer) provides an analysis chain with a language-specific tokenizer and token filter.
You can configure the analyzer with a number of tokenizer and token filter options (see the Javadoc).
The analysis chain also includes the LowerCaseFilter1 and CJKWidthFilter2, and it provides support for including a StopFilter3.
Japanese Analyzer Sample. This Lucene sample, JapaneseAnalyzerSample.java, uses the baselinguistics analyzer to generate an enriched token stream from Japanese text. The sample does the
following:
- Assembles a set of options that include the root directory, the path to the Rosette license, the language to analyze (Japanese), and the generation of part-of-speech tags for the tokens.
- Instantiates a Japanese base linguistics analyzer with these options.
- Reads in a Japanese text file.
Uses the analyzer to generate a token stream. The token stream contains tokens, lemmas that are not identical to their tokens, and readings. Disambiguation is turned off by default for Japanese, so multiple analyses may be returned for each token. To turn on disambiguation, add
options.put\("disambiguate", "true"\);to the construction of the options Map.
- Writes each element in the token stream with its type attribute to an output file.
package com.basistech.rosette.samples; import com.basistech.rosette.lucene.BaseLinguisticsAnalyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.nio.charset.StandardCharsets; import java.util.HashMap; import java.util.Map; /** * Example program that does Japanese analysis with a Japanese base linguistics analyzer. This does not set up and run a Lucene index; it just shows the construction of the analysis chain. */ public final class JapaneseAnalyzerSample { private String rootDirectory; private String inputPathname; private String outputPathname; private Analyzer rblAnalyzer; private JapaneseAnalyzerSample() { // } private void initialize() { File rootPath = new File(rootDirectory); String licensePath = new File( rootPath, "licenses/rlp-license.xml").getAbsolutePath(); Map options = new HashMap<>(); options.put("language", "jpn"); options.put("rootDirectory", rootDirectory); options.put("licensePath", licensePath); options.put("partOfSpeech", "true"); options.put("addReadings", "true"); rblAnalyzer = new BaseLinguisticsAnalyzer(options); } private void run() throws IOException { try (BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(inputPathname), StandardCharsets.UTF_8))) { input.mark(1); int bomPerhaps = input.read(); if (bomPerhaps != 0xfeff) { input.reset(); } TokenStream tokens = rblAnalyzer.tokenStream("dummy", input); tokens.reset(); CharTermAttribute charTerm = tokens.getAttribute(CharTermAttribute.class); TypeAttribute type = tokens.getAttribute(TypeAttribute.class); try (PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputPathname), StandardCharsets.UTF_8))) { while (tokens.incrementToken()) { printWriter.format("%s\t%s%n", charTerm.toString(), type.type()); } } } catch (IOException ie) { System.err.printf("Failed to open input file %s%n", inputPathname); System.exit(1); return; } System.out.println("See " + outputPathname); } public static void main(String[] args) { if (args.length != 3) { System.err.println("Usage: rootDirectory input output"); return; } JapaneseAnalyzerSample that = new JapaneseAnalyzerSample(); that.rootDirectory = args[0]; that.inputPathname = args[1]; that.outputPathname = args[2]; that.initialize(); try { that.run(); } catch (IOException e) { System.err.println("Exception processing the data."); e.printStackTrace(); } } }
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-7.20.0.c58.3/samples/lucene-5_0 and use the Ant build script:
ant runAnalyzer
The sample reads rbl-je-7.20.0.c58.3/samples/data/jpn-text.txt and writes the output to jpn-Analyzed-byAnalyzer.txt.
The output includes each token, lemma, and reading in the token stream on a separate line with the type attribute for each element: <ALNUM> for tokens, <LEMMA> for lemmas, and <READING> for readings. There may be more than one analysis (hence lemma and reading in the sample output) for a token; lemmas are not put into the token stream when identical to the token.
For example:
メルボルン <ALNUM>
メルボルン <READING>
で <ALNUM>
行わ <ALNUM>
行う <LEMMA>
オコナワ <READING>
れ <ALNUM>
る <LEMMA>
れる <LEMMA>
レ <READING>