5.6. Case Sensitivity During the Analysis
In some languages, case distinctions are meaningful. For example, in German, a word may be a noun if it begins with an upper-case letter, and not a noun if it does not. As a result, RBL-JE delivers higher accuracy in selecting lemmas and splitting compounds when it can process text with correct casing. On the other hand, users typing in queries may be sloppy with capital letters.
For this reason, the default behavior of the Lucene integration is to perfrom the following analysis steps:
- tokenize
- determine lemmas
- map to lower case
The result is that the index contains the lowercase form of the most accurately selected lemma.
However, some applications work with text in which case distinctions are not reliably present, even in languages where they are important. These applications need to determine lemmas and compound components even though the spelling is nominally incorrect with respect to case.
To support these applications, RBL provides a 'case-insensitive' mode of operation. In this mode,
RBL-JE performs the following analysis steps:
- tokenize, ignoring the case of abbreviations and such
- determine lemmas, ignoring case in choosing lemmas and compound components
- map to lower case
The mapping is still required to ensure that the index or query ends up with uniformly lower-case text.
To specify case sensitivity for the analysis, set
com.basistech.rosette.bl.AnalyzerOption.caseSensitive
to "true" or "false". By default, the
setting is "true", except for Danish, Norwegian, and Swedish, for which our dictionaries are lower
case and the setting is "false" irrespective of the user setting.
When you are making this setting in the com.basistech.rosette.lucene
package, include the
"caseSensitive" option as a string. For example:
Map<String, String> options = new HashMap<String, String>();
options.put("language", LanguageCode.ITALIAN);
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);