5.6. Case Sensitivity During the Analysis

In some languages, case distinctions are meaningful. For example, in German, a word may be a noun if it begins with an upper-case letter, and not a noun if it does not. As a result, RBL-JE delivers higher accuracy in selecting lemmas and splitting compounds when it can process text with correct casing. On the other hand, users typing in queries may be sloppy with capital letters.

For this reason, the default behavior of the Lucene integration is to perfrom the following analysis steps:

tokenize
determine lemmas
map to lower case

The result is that the index contains the lowercase form of the most accurately selected lemma.

However, some applications work with text in which case distinctions are not reliably present, even in languages where they are important. These applications need to determine lemmas and compound components even though the spelling is nominally incorrect with respect to case.

To support these applications, RBL provides a 'case-insensitive' mode of operation. In this mode,

RBL-JE performs the following analysis steps:

tokenize, ignoring the case of abbreviations and such
determine lemmas, ignoring case in choosing lemmas and compound components
map to lower case

The mapping is still required to ensure that the index or query ends up with uniformly lower-case text.

To specify case sensitivity for the analysis, set com.basistech.rosette.bl.AnalyzerOption.caseSensitive to "true" or "false". By default, the setting is "true", except for Danish, Norwegian, and Swedish, for which our dictionaries are lower

case and the setting is "false" irrespective of the user setting. When you are making this setting in the com.basistech.rosette.lucene package, include the "caseSensitive" option as a string. For example:

Map<String, String> options = new HashMap<String, String>();
options.put("language", LanguageCode.ITALIAN);
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);

5.6. Case Sensivity During the Analysis

5.6. Case Sensitivity During the Analysis

results matching ""

No results matching ""