Appendix A. API Options
Options are available for changing default behavior. Most are described in Table of Options [86] below. For convenience, options that are only valid when alternativeTokenization is true are listed separately in Table of Alternative Tokenization Options [96] . These tables also specify which of the following enum classes contain the option if applicable.
| Enum class | Accepting factories |
|---|---|
AnalyzerOption |
AnalyzerFactory, BaseLingistuicsAnalyzer,BaseLinguisticsTokenFilterFactory |
BaseLinguisticsOption |
BaseLinguisticsFactory |
CSCAnalyzerOption |
BaseLinguisticsCSCTokenFilterFactory, CSCAnalyzerFactory |
FilterOption |
BaseLinguisticsAnalyzer, BaseLinguisticsTokenFilterFactory |
TokenizerOption |
BaseLinguisticsAnalyzer, BaseLinguisticsSegmentationTokenFilterFactory, BaseLinguisticsTokenizerFactory, TokenizerFactory |
A.1. Table of Options
This table displays all of the API options and their containing enum class(es).
| Option | Enum class(es) | Description | Type (default value) | Supported languages |
|---|---|---|---|---|
addLemmaTokens |
FilterOption |
Indicates whether the token filter should add the lemmas (if none, the stems) of each surface token to the tokens being returned. |
Boolean (true) |
All |
addReadings |
FilterOption |
Indicates whether the token filter should add the readings of each surface token to the tokens being returned. |
Boolean (false) |
Chinese, Japanese |
alternativeJapaneseTokenization |
AnalyzerOption,BaseLinguisticsOption, TokenizerOption |
Deprecated synonym of alternativeTokenization. |
Boolean (false) |
Chinese, Japanese |
alternativeTokenization |
AnalyzerOption,BaseLinguisticsOption, TokenizerOption |
Directs the use of the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA) tokenization algorithm. Disables post-tokenization analysis for use with the alternative tokenizer, i.e. an Analyzer created with this option will leave its input tokens unchanged. |
Boolean (false) |
Chinese, Japanese |
alternativeJapaneseTokenizationOptions |
BaseLinguisticsOption, TokenizerOption |
Deprecated synonym of alternativeTokenizationOptions. |
String (none) |
Chinese, Japanese |
alternativeTokenizationOptions[96] |
BaseLinguisticsOption, TokenizerOption |
Deprecated. Supplies additional options to the alternative tokenization algorithm as a string of YAML. |
String (none) | Chinese, Japanese |
analysisCacheSize |
BaseLinguisticsOption |
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. |
Integer (100,000) |
All |
analyze |
BaseLinguisticsOption |
Indicates whether to do analysis. If false, the annotator will only do tokenization. |
Boolean (true) |
All |
atMentions |
BaseLinguisticsOption, TokenizerOption |
Indicates whether to detect @mentions. |
Boolean (false) |
All |
cacheSize |
AnalyzerOption, CSCAnalyzerOption |
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory If zero, caching is off. |
Integer (100,000) |
All |
caseSensitive |
AnalyzerOption |
Indicates whether analyzers produced by factory are casesensitive. If false, they ignore case distinctions. |
Boolean (true) |
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
caseSensitive |
BaseLinguisticsOption, TokenizerOption |
Indicates whether analyzers and tokenizers use case in determining parts of speech, lemmas, and tokens. If false, they ignore case distinctions. |
Boolean (true) |
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
conversionLevel[72] |
BaseLinguisticsOption, CSCAnalyzerOption |
Indicates highest conversion level to use. |
lexemic (default, highest); orthographic (middle); codepoint (lowest) |
Chinese |
customPosTagsUri |
BaseLinguisticsOption |
URI of a POS tag map [162] file for use by universalPosTags. |
URI (none) | All |
customTokenizeContractionRulesUri |
BaseLinguisticsOption |
URI of a contraction rule[158] file for use by tokenizeContractions. |
URI (none) | All |
defaultTokenizationLanguage |
BaseLinguisticsOption,TokenizerOption |
Specify language to use for script regions, other than the script of the overall language. |
Language code (xxx: language unknown) |
Chinese, Japanese, Thai |
deliverExtendedTags |
AnalyzerOption,BaseLinguisticsOption |
Indicates whether the analyzers produced should return extended tags with the raw analysis. If true, the extended tags are returned. |
Boolean (false) |
All |
dictionaryDirectory |
AnalyzerOption,BaseLinguisticsOption,CSCAnalyzerOption |
The path of the lemma and compound dictionary, if it exists. Not needed if rootDirectoryis set. |
Path ({rootDir}/ dicts) |
All |
disambiguate |
AnalyzerOption,BaseLinguisticsOption |
Indicates whether the analyzers should disambiguate the results. When false, all possible analyses are returned. When true, the disambiguator determines the best analysis for each word given the context in which it appears. The disambiguated result is returned either directly or, when ADM results are returned, at the head of the list of all possible analyses. |
Boolean (true) |
See Feature Set [12] , Disambigua- tion column |
emailAddresses |
BaseLinguisticsOption,TokenizerOption |
Indicates whether to detect email addresses. |
Boolean (false) |
All |
emoticons |
BaseLinguisticsOption,TokenizerOption |
Indicates whether to detect emoticons. |
Boolean (false) |
All |
fragmentBoundaryDetection |
BaseLinguisticsOption,TokenizerOption |
Turn on fragment boundary detection. |
Boolean (false) |
All |
fstTokenize |
BaseLinguisticsOption,TokenizerOption |
Turn on FST tokenization. |
Boolean (false) |
Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish |
hashtags |
BaseLinguisticsOption,TokenizerOption |
Indicates whether to detect hashtags. |
Boolean (false) |
All |
identifyContractionComponents |
FilterOption |
Indicates whether the token filter should identify contraction components, rathe than as lemmas. |
Boolean (false) |
All |
includeHebrewRoots |
BaseLinguisticsOption |
Indicates whether to generate Semitic root forms. |
Boolean (false) |
Hebrew |
includeRoots |
TokenizerOption |
Indicates whether to generate Semitic root forms. |
Boolean (false) |
Hebrew |
koreanDecompounding |
AnalyzerOption,BaseLinguisticsOption |
Indicates whether to use experimental Korean decompounding. |
Boolean (false) |
Korean |
language |
AnalyzerOption,BaseLinguisticsOption,CSCAnalyzerOption, TokenizerOption |
The language to process by analyzers or tokenizers created by the factory. |
Language code (none) |
All |
lemDictionaryPath |
None. This is only used in the Lucene connector API and is passed in via an options Map. |
A list of paths to user-defined lemma dictionaries, separated by semicolons or the OS-specific path separator. |
List of paths (none) |
See Feature Set [12] , Lemma User Dictionary column |
licensePath |
AnalyzerOption,BaseLinguisticsOption,CSCAnalyzerOption,TokenizerOption |
The path of the RBL-JE license file. |
Path ({rootDir}/ licenses/ rlp- license.xml) |
All |
licenseString |
AnalyzerOption,BaseLinguisticsOption,CSCAnalyzerOption,TokenizerOption |
The XML license content, overrides licensePath. |
String (none) |
All |
minNonPrimaryScriptRegionLength |
BaseLinguisticsOption,TokenizerOption |
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length, and adjacent to a primary script region, it is appended to the primary script region. |
Integer (10) |
Chinese, Japanese, Thai |
modelDirectory |
AnalyzerOption,BaseLinguisticsOption,TokenizerOption |
The directory containing model files and data. |
Path ({rootDir}/ models) |
Chinese, Hebrew, Japanese, Thai |
normalizationDictionaryPaths |
AnalyzerOption, BaseLinguisticsOption |
A list of paths to user-defined manyto-one normalization dictionaries, separated by semicolons or the OSspecific path separator. |
List of paths (none) |
All |
nfkcNormalize |
BaseLinguisticsOption,TokenizerOption |
Turns on Unicode NFKC normalization before tokenization. This normalization includes a fullwidth numeral to a halfwidth numeral, a fullwidth latin to a halfwidth latin, and a halfwidth katakana to a fullwidth katakana. |
Boolean (false) |
Arabic, Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hungarian, Italian, Korean, Norwegian, Pecusrsian, Polish, Portuguese, Pushto, Russian, Spanish, Swedish, Thai, Turkish, Urdu |
query |
AnalyzerOption,BaseLinguisticsOption, TokenizerOption |
Indicate the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation). |
Boolean (false) |
All |
replaceTokensWithLemmas |
FilterOption |
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled as well. |
Boolean (false) |
All |
rootDirectory |
AnalyzerOption,BaseLinguisticsOption,CSCAnalyzerOption, TokenizerOption |
Set the root directory. Also sets default values for other required options ( dictionaryDirectory, licensePath,licenseStringand modelDirectory,{ rootDir}) |
Path (none) |
All |
segDictionaryPath |
None. This is only used in the Lucene connector API and is passed in via an options Map. |
A list of paths to user-defined segmentation dictionaries, separated by semicolons or the OS-specific path separator. If the language is Chinese or Japanese and alternativeTokenizationis true, then sets dictionaries to CLA/JLA user dictionaries. |
List of paths (none) |
See Feature Set [12], Token User Dictionary column |
targetLanguage |
BaseLinguisticsOption,CSCAnalyzerOption |
The language to which CSCAnalyzer is converting. |
Language code (none) |
All |
tokenizeContractions |
BaseLinguisticsOption |
Indicates whether to deliver contractions as multiple tokens. If false, they are delivered as one token. |
Boolean (false) |
All |
tokenizeForScript |
BaseLinguisticsOption, TokenizerOption |
Indicates whether to use a different word-breaker for each script. If false, uses script- specific breaker for primary script and default breaker for other scripts. |
Boolean (false) |
Chinese, Japanese, Thai |
universalPosTags |
BaseLinguisticsOption |
Indicates whether the language- specific annotators produced should convert POS tags to the universalversions. If true, the universal tags are returned. If false, the traditional tags are returned. |
Boolean (false) |
All |
urls |
BaseLinguisticsOption,TokenizerOption |
Indicates whether to detect URLs. |
Boolean (false) |
All |
userDefinedDictionaryPath |
None. This is only used in the Lucene connector API and is passed in via an options Map. |
A list of paths to user-defined dictionaries, separated by semicolons or the OS- specific path separator. |
String (none) |
All |
userDefinedReadingDictionaryPath |
None. This is only used in the Lucene connector API and is passed in via an options Map. |
A list of paths to user-defined reading dictionaries, separated by semicolons or the OS-specific path separator. Currently only supported for Japanese when alternativeTokenizationis true. |
String (none) |
Japanese |