7.3. Activating User Dictionaries

7.3.1. Using the Core API

com.basistech.rosette.breaks.TokenizerFactory and com.basistech.rosette.bl.AnalyzerFactory include a method, addUserDefinedDictionary, for loading segmentation dictionaries and lemma dictionaries respectively. Here is an example of loading a segmentation dictionary in the TokenizerFactory:

TokenizerFactory factory = new TokenizerFactory();
factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");

Here is an example of loading a lemma dictionary in the AnalyzerFactory:

EnumMap<AnalyzerOption, String> options = new HashMap<AnalyzerOption, String>();
options.put(AnalyzerOption.caseSensitive, "true");
AnalyzerFactory factory = new AnalyzerFactory();
factory.addUserDefinedDictionary(LanguageCode.ITALIAN, "/path/to/my/ita-dict.bin", options);

7.3.2. Using Lucene

In the com.basistech.rosette.lucene package, BaseLinguisticsTokenizerFactory and BaseLinguisticsTokenFilterFactory can load segmentation and lemma dictionaries respectively. BaseLinguisticsTokenizerFactory provides the method addUserDefinedDictionary for adding a segmentation dictionary. For example:

Map<String, String> args = new HashMap<String, String>();
args.put(TokenizerOption.language.name(), LanguageCode.JAPANESE);
args.put(TokenizerOption.nfkcNormalize.name(), "true");
args.put(TokenizerOption.partOfSpeech.name(), "true");
BaseLinguisticsTokenizerFactory factory = new BaseLinguisticsTokenizerFactory(args);
factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");

The constructor for BaseLinguisticsTokenFilterFactory takes a Map of options. Use the userDefinedDictionaryPath option to load a lemma dictionary:

Map<String, String> options = new HashMap<String, String>();
options.put("language", LanguageCode.ITALIAN);
options.put("userDefinedDictionaryPath", "/path/to/my/ita-dict.bin");
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);

7.3.3. Using Solr

Use the option userDefinedDictionaryPath as shown in this example:

<fieldType class="solr.TextField" name="basis-japanese">
    <analyzer>
        <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
          alternativeTokenization="true"
          language="jpn"
          rootDirectory="${bt_root}"
          userDefinedDictionaryPath= "/path/to/my/jpn-udd.bin"/>
         <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
          alternativeTokenization="true"
          language="jpn"
          rootDirectory="${bt_root}"/>
    </analyzer>
</fieldType>

Here is an example of using a JLA reading dictionary:

<fieldType class="solr.TextField" name="basis-japanese-rd">
    <analyzer>
        <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
            alternativeTokenization="true"
            alternativeTokenizationOptions="{Readings: true}"
            language="jpn"
            rootDirectory="${bt_root}"
            userDefinedReadingDictionaryPath="/path/to/my/readings.bin"/>
           <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
            alternativeTokenization="true"
            language="jpn"
            rootDirectory="${bt_root}"/>
    </analyzer>
</fieldType>

7.3.4. Using Elasticsearch

At the analyzer level in a mapping, use either lemDictionaryPath or segDictionayPath. Otherwise, use userDefinedDictionaryPath. The following example illustrates how to use a lemmatization dictionary at the analyzer level:

{
    "settings":{
        "analysis":{
            "analyzer":{
                "rbl":{
                    "type":"rbl",
                    "language":"eng",
                    "lemDictionaryPath":"/path/to/lemma_dictionary.bin"
                }
            }
        }
    }
}

The following example illustrates how to use a lemmatization dictionary at the filter level:

{
    "settings":{
        "analysis":{
            "analyzer":{
                "rbl":{
                    "type":"custom",
                    "language":"eng",
                    "tokenizer":"rbl_tok",
                    "filter":["rbl_filter"]
                }
            },
            "tokenizer":{
                "rbl_tok":{
                    "type":"rbl",
                    "language":"eng"
                }
            },
            "filter":{
                "rbl_filter":{
                    "type":"rbl",
                    "language":"eng",
                    "userDefinedDictionaryPath":"/path/to/lemma_dictionary.bin"
                }
            }
        }
    }
}

The following examples illustrate how to use a segmentation dictionary at the analyzer and filter levels:

{
    "settings":{
        "analysis":{
            "analyzer":{
                "rbl":{
                    "type":"rbl",
                    "language":"jpn",
                    "segDictionaryPath":"/path/to/segmentation_dictionary.bin"
                }        
            }
        }
    }

{
    "settings":{
        "analysis":{
            "analyzer":{
            "rbl":{
                "type":"custom",
                "language":"jpn",
                "tokenizer":"rbl_tok",
                "filter":["rbl_filter"]
            }
        },
        "tokenizer":{
            "rbl_tok":{
                "type":"rbl",
                "language":"jpn",
                "userDefinedDictionaryPath":"/path/to/segmentation_dictionary.bin"
            }
        },
            "filter":{
                "rbl_filter":{
                "type":"rbl",
                "language":"jpn"
                }
            }
        }
    }
}

7.3. Activating User Dictionaries