Chapter 6. Using RBL-JE in Apache Solr
6.1. Introduction
You can incorporate Rosette Base Linguistics in Solr 4.3-6.3 applications.
To index and search documents with RBL-JE in a Solr application, you must add JARs to the Solr classpath and define Solr analysis chains that apply the RBL analysis components to process text at index and query time.
6.2. Adding to the Solr Classpath
Add the following lib
elements to the solrconfig.xml
for each Solr collection you are using.
<lib path="[[RBLJE_ROOT]]/lib/btrbl-je-7.20.0.c58.3.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/btcommon-api-37.1.0.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/slf4j-api-1.7.5.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/slf4j-simple-1.7.5.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/btrbl-je-lucene-solr-[[LUCENE_SOLR_VER]]-7.20.0.c58.3.jar"/>
where you replace [[RBLJE_ROOT]] with the path to the root of your RBL-JE installation, and [[LUCENE_SOLR_VER]] with a tag that indicates the version of Solr that you are using: for Solr 4.3 through 4.8, use 4_3
; for Solr 4.9, use 4_9
; for Solr 4.10, use 4_10
; for Solr 5.0 through 5.5, use 5_0
; for Solr 6.0 through 6.1, use 6_0
; for Solr 6.2 through 6.3, use 6_2
.
For example, if the root of the RBL-JE installation is /opt/local/bt/rbl-je, the lib entry for the Solr 6.0 lucene-solr JAR is
<lib path="/opt/local/bt/rbl-je/lib/btrbl-je-lucene-solr-6_0-7.20.0.c58.3.jar"/>
The slf4j JARs enable logging [16].
6.3. Defining a Solr Analysis Chain
In the Solr schema.xml, add a fieldType element and a corresponding field element for the language of the documents processed by the application.
Field Type. The fieldType includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter.
Here, for example, is a fieldType for Japanese:
<fieldtype name="basis-japanese" class="solr.TextField">
<analyzer type="index">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
language="jpn"
rootDirectory="[[RBLJE_ROOT]]"
/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
language="jpn"
rootDirectory="[[RBLJE_ROOT]]"
/>
</analyzer>
<analyzer type="query">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
language="jpn"
rootDirectory="[[RBLJE_ROOT]]"
/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
language="jpn"
rootDirectory="[[RBLJE_ROOT]]"
query="true"
/>
</analyzer>
</fieldtype>
where you replace _[[RBLJE_ROOT]]_ with the path to the root of your RBL-JE installation. The fieldType
name indicates the language, and each language attribute is set to the ISO 639-3 [109] language code for Japanese.
Note: Feel free to incorporate in the analyzers any Solr filter that you need, such as the Solr lower case filter; however, you should add them into the chain after the Base Linguistics token filter. If you modify the token stream too radically before RBL, you will degrade its ability to analyze the text.
Field. The analysis chain requires a field
definition with a type
attribute that maps to the fieldType
. For the Japanese example above, add the following field
definition to schema.xml.
<field name="text-japanese" type="basis-japanese" indexed="true" stored="true"/>
In your Solr application, you can now index and query Japanese documents placed in the text-japanese
field.
Setting Options. Most API options [86] can be used in an analysis chain. See
Using Options in Solr [100] for a detailed discussion of how this works and see
Using Solr [67] for a specific discussion of using user-defined dictionaries.