7.3. Activating User Dictionaries
7.3.1. Using the Core API
com.basistech.rosette.breaks.TokenizerFactory
and com.basistech.rosette.bl.AnalyzerFactory
include a method, addUserDefinedDictionary, for loading segmentation dictionaries and lemma dictionaries respectively.
Here is an example of loading a segmentation dictionary in the TokenizerFactory:
TokenizerFactory factory = new TokenizerFactory();
factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");
Here is an example of loading a lemma dictionary in the AnalyzerFactory:
EnumMap<AnalyzerOption, String> options = new HashMap<AnalyzerOption, String>();
options.put(AnalyzerOption.caseSensitive, "true");
AnalyzerFactory factory = new AnalyzerFactory();
factory.addUserDefinedDictionary(LanguageCode.ITALIAN, "/path/to/my/ita-dict.bin", options);
7.3.2. Using Lucene
In the com.basistech.rosette.lucene
package, BaseLinguisticsTokenizerFactory
and BaseLinguisticsTokenFilterFactory
can load segmentation and lemma dictionaries respectively.
BaseLinguisticsTokenizerFactory
provides the method addUserDefinedDictionary
for adding a segmentation dictionary. For example:
Map<String, String> args = new HashMap<String, String>();
args.put(TokenizerOption.language.name(), LanguageCode.JAPANESE);
args.put(TokenizerOption.nfkcNormalize.name(), "true");
args.put(TokenizerOption.partOfSpeech.name(), "true");
BaseLinguisticsTokenizerFactory factory = new BaseLinguisticsTokenizerFactory(args);
factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");
The constructor for BaseLinguisticsTokenFilterFactory
takes a Map of options. Use the
userDefinedDictionaryPath
option to load a lemma dictionary:
Map<String, String> options = new HashMap<String, String>();
options.put("language", LanguageCode.ITALIAN);
options.put("userDefinedDictionaryPath", "/path/to/my/ita-dict.bin");
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);
7.3.3. Using Solr
Use the option userDefinedDictionaryPath
as shown in this example:
<fieldType class="solr.TextField" name="basis-japanese">
<analyzer>
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
alternativeTokenization="true"
language="jpn"
rootDirectory="${bt_root}"
userDefinedDictionaryPath= "/path/to/my/jpn-udd.bin"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
alternativeTokenization="true"
language="jpn"
rootDirectory="${bt_root}"/>
</analyzer>
</fieldType>
Here is an example of using a JLA reading dictionary:
<fieldType class="solr.TextField" name="basis-japanese-rd">
<analyzer>
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
alternativeTokenization="true"
alternativeTokenizationOptions="{Readings: true}"
language="jpn"
rootDirectory="${bt_root}"
userDefinedReadingDictionaryPath="/path/to/my/readings.bin"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
alternativeTokenization="true"
language="jpn"
rootDirectory="${bt_root}"/>
</analyzer>
</fieldType>
7.3.4. Using Elasticsearch
At the analyzer level in a mapping, use either lemDictionaryPath
or segDictionayPath
. Otherwise, use userDefinedDictionaryPath
.
The following example illustrates how to use a lemmatization dictionary at the analyzer level:
{
"settings":{
"analysis":{
"analyzer":{
"rbl":{
"type":"rbl",
"language":"eng",
"lemDictionaryPath":"/path/to/lemma_dictionary.bin"
}
}
}
}
}
The following example illustrates how to use a lemmatization dictionary at the filter level:
{
"settings":{
"analysis":{
"analyzer":{
"rbl":{
"type":"custom",
"language":"eng",
"tokenizer":"rbl_tok",
"filter":["rbl_filter"]
}
},
"tokenizer":{
"rbl_tok":{
"type":"rbl",
"language":"eng"
}
},
"filter":{
"rbl_filter":{
"type":"rbl",
"language":"eng",
"userDefinedDictionaryPath":"/path/to/lemma_dictionary.bin"
}
}
}
}
}
The following examples illustrate how to use a segmentation dictionary at the analyzer and filter levels:
{
"settings":{
"analysis":{
"analyzer":{
"rbl":{
"type":"rbl",
"language":"jpn",
"segDictionaryPath":"/path/to/segmentation_dictionary.bin"
}
}
}
}
{
"settings":{
"analysis":{
"analyzer":{
"rbl":{
"type":"custom",
"language":"jpn",
"tokenizer":"rbl_tok",
"filter":["rbl_filter"]
}
},
"tokenizer":{
"rbl_tok":{
"type":"rbl",
"language":"jpn",
"userDefinedDictionaryPath":"/path/to/segmentation_dictionary.bin"
}
},
"filter":{
"rbl_filter":{
"type":"rbl",
"language":"jpn"
}
}
}
}
}