Chapter 6. Using RBL-JE in Apache Solr

6.1. Introduction

You can incorporate Rosette Base Linguistics in Solr 4.3-6.3 applications.

To index and search documents with RBL-JE in a Solr application, you must add JARs to the Solr classpath and define Solr analysis chains that apply the RBL analysis components to process text at index and query time.

6.2. Adding to the Solr Classpath

Add the following lib elements to the solrconfig.xml for each Solr collection you are using.

<lib path="[[RBLJE_ROOT]]/lib/btrbl-je-7.20.0.c58.3.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/btcommon-api-37.1.0.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/slf4j-api-1.7.5.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/slf4j-simple-1.7.5.jar"/>
<lib path="[[RBLJE_ROOT]]/lib/btrbl-je-lucene-solr-[[LUCENE_SOLR_VER]]-7.20.0.c58.3.jar"/>

where you replace [[RBLJE_ROOT]] with the path to the root of your RBL-JE installation, and [[LUCENE_SOLR_VER]] with a tag that indicates the version of Solr that you are using: for Solr 4.3 through 4.8, use 4_3; for Solr 4.9, use 4_9; for Solr 4.10, use 4_10; for Solr 5.0 through 5.5, use 5_0; for Solr 6.0 through 6.1, use 6_0; for Solr 6.2 through 6.3, use 6_2.

For example, if the root of the RBL-JE installation is /opt/local/bt/rbl-je, the lib entry for the Solr 6.0 lucene-solr JAR is

<lib path="/opt/local/bt/rbl-je/lib/btrbl-je-lucene-solr-6_0-7.20.0.c58.3.jar"/>

The slf4j JARs enable logging [16].

6.3. Defining a Solr Analysis Chain

In the Solr schema.xml, add a fieldType element and a corresponding field element for the language of the documents processed by the application.

Field Type. The fieldType includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter.

Here, for example, is a fieldType for Japanese:

<fieldtype name="basis-japanese" class="solr.TextField">
 <analyzer type="index">
  <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
    language="jpn"
    rootDirectory="[[RBLJE_ROOT]]"
/>
 <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
    language="jpn"
    rootDirectory="[[RBLJE_ROOT]]"
 />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
language="jpn"
rootDirectory="[[RBLJE_ROOT]]"
  />
  <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
    language="jpn"
    rootDirectory="[[RBLJE_ROOT]]"
    query="true"
 />
 </analyzer>
</fieldtype>

where you replace _[[RBLJE_ROOT]]_ with the path to the root of your RBL-JE installation. The fieldType name indicates the language, and each language attribute is set to the ISO 639-3 [109] language code for Japanese.

Note: Feel free to incorporate in the analyzers any Solr filter that you need, such as the Solr lower case filter; however, you should add them into the chain after the Base Linguistics token filter. If you modify the token stream too radically before RBL, you will degrade its ability to analyze the text.

Field. The analysis chain requires a field definition with a type attribute that maps to the fieldType. For the Japanese example above, add the following field definition to schema.xml.


<field name="text-japanese" type="basis-japanese" indexed="true" stored="true"/>

In your Solr application, you can now index and query Japanese documents placed in the text-japanese field.

Setting Options. Most API options [86] can be used in an analysis chain. See
Using Options in Solr [100] for a detailed discussion of how this works and see
Using Solr [67] for a specific discussion of using user-defined dictionaries.

Chapter 6. Using RBL-JE in Apache Solr

Chapter 6. Using RBL-JE in Apache Solr

6.1. Introduction

6.2. Adding to the Solr Classpath

6.3. Defining a Solr Analysis Chain

results matching ""

No results matching ""