Appendix A. API Options

Options are available for changing default behavior. Most are described in Table of Options [86] below. For convenience, options that are only valid when alternativeTokenization is true are listed separately in Table of Alternative Tokenization Options [96] . These tables also specify which of the following enum classes contain the option if applicable.

Enum class Accepting factories
AnalyzerOption AnalyzerFactory, BaseLingistuicsAnalyzer,BaseLinguisticsTokenFilterFactory
BaseLinguisticsOption BaseLinguisticsFactory
CSCAnalyzerOption BaseLinguisticsCSCTokenFilterFactory, CSCAnalyzerFactory
FilterOption BaseLinguisticsAnalyzer, BaseLinguisticsTokenFilterFactory
TokenizerOption BaseLinguisticsAnalyzer, BaseLinguisticsSegmentationTokenFilterFactory, BaseLinguisticsTokenizerFactory, TokenizerFactory

A.1. Table of Options

This table displays all of the API options and their containing enum class(es).

Option Enum class(es) Description Type (default value) Supported languages
addLemma
Tokens
Filter
Option
Indicates whether
the token filter
should add
the lemmas (if none,
the stems) of
each surface token
to the tokens
being returned.
Boolean
(true)
All
addReadings Filter
Option
Indicates whether
the token filter
should add the
readings of
each surface
token to the tokens
being returned.
Boolean
(false)
Chinese, Japanese
alternative
Japanese
Tokenization
Analyzer
Option,
Base
Linguistics
Option, Tokenizer
Option
Deprecated
synonym
of alternative
Tokenization.
Boolean
(false)
Chinese, Japanese
alternative
Tokenization
Analyzer
Option,
Base
Linguistics
Option, Tokenizer
Option
Directs the use of
the Chinese
Language
Analyzer (CLA)
or Japanese
Language Analyzer
(JLA) tokenization
algorithm. Disables
post-tokenization
analysis for use
with the alternative
tokenizer, i.e. an
Analyzer created
with this option will
leave its input
tokens unchanged.
Boolean
(false)
Chinese, Japanese
alternative
Japanese
Tokenization
Options
Base
Linguistics
Option, Tokenizer
Option
Deprecated
synonym
of alternative
Tokenization
Options.
String
(none)
Chinese, Japanese
alternative
Tokenization
Options[96]
Base
Linguistics
Option, Tokenizer
Option
Deprecated.
Supplies
additional options
to the alternative
tokenization
algorithm as a
string of YAML.
String (none) Chinese, Japanese
analysis
CacheSize
Base
Linguistics
Option
Maximum number
of entries in the
analysis cache.
Larger values
increase
throughput,
but use extra
memory. If zero,
caching is off.
Integer
(100,000)
All
analyze Base
Linguistics
Option
Indicates whether
to do analysis. If
false, the annotator
will only do
tokenization.
Boolean
(true)
All
atMentions Base
Linguistics
Option, Tokenizer
Option
Indicates whether
to detect
@mentions.
Boolean
(false)
All
cacheSize Analyzer
Option, CSCAnalyzer
Option
Maximum number
of entries in the
analysis cache.
Larger values
increase
throughput,
but use extra
memory If zero,
caching is off.

Integer
(100,000)
All
case
Sensitive
Analyzer
Option
Indicates whether
analyzers produced
by factory are
casesensitive. If
false, they ignore
case distinctions.
Boolean
(true)
Czech,
Danish,
Dutch,
English,
French,
German,
Greek,
Hebrew,
Hungarian,
Italian,
Norwegian,
Polish,
Portuguese,
Russian,
Spanish,
Swedish
case
Sensitive
Base
Linguistics
Option, Tokenizer
Option
Indicates whether
analyzers and
tokenizers use
case in determining
parts of speech,
lemmas, and tokens.
If false, they ignore
case distinctions.
Boolean
(true)
Czech,
Danish,
Dutch,
English,
French,
German,
Greek,
Hebrew,
Hungarian,
Italian,
Norwegian,
Polish,
Portuguese,
Russian,
Spanish,
Swedish
conversion
Level
[72]
Base
Linguistics
Option, CSCAnalyzer
Option
Indicates highest
conversion level
to use.
lexemic
(default,
highest);
orthographic
(middle);
codepoint
(lowest)
Chinese
customPos
TagsUri
Base
Linguistics
Option
URI of a POS tag
map [162]
file for
use by
universalPosTags.
URI (none) All
custom
Tokenize
Contraction
RulesUri
Base
Linguistics
Option
URI of a
contraction
rule[158]
file for
use by tokenize
Contractions.
URI (none) All
default
Tokenization
Language
Base
Linguistics
Option,
Tokenizer
Option
Specify language
to use for script
regions, other
than the script
of the overall
language.
Language
code (xxx:
language
unknown)
Chinese,
Japanese,
Thai
deliver
Extended
Tags
Analyzer
Option,
Base
Linguistics
Option
Indicates whether
the analyzers
produced should
return extended
tags with the raw
analysis. If true,
the extended tags
are returned.
Boolean
(false)
All
dictionary
Directory
Analyzer
Option,
Base
Linguistics
Option,
CSCAnalyzer
Option
The path of the
lemma and
compound
dictionary, if it
exists. Not needed
if rootDirectory
is set.
Path
({rootDir}/
dicts)
All
disambiguate Analyzer
Option,
Base
Linguistics
Option
Indicates whether
the analyzers
should disambiguate
the results. When
false, all possible
analyses are returned.
When true, the
disambiguator
determines the
best analysis for
each word given
the context in
which it appears.
The disambiguated
result is returned
either directly or,
when ADM results
are returned, at
the head of the list
of all possible
analyses.
Boolean
(true)
See Feature
Set [12]
, Disambigua-
tion column
email
Addresses
Base
Linguistics
Option,
Tokenizer
Option
Indicates whether
to detect email
addresses.
Boolean
(false)
All
emoticons Base
Linguistics
Option,
Tokenizer
Option
Indicates whether
to detect
emoticons.
Boolean
(false)
All
fragment
Boundary
Detection
Base
Linguistics
Option,
Tokenizer
Option
Turn on fragment
boundary
detection.
Boolean
(false)
All
fstTokenize Base
Linguistics
Option,
Tokenizer
Option
Turn on FST
tokenization.
Boolean
(false)
Czech,
Dutch,
English,
French,
German,
Greek,
Hungarian,
Italian,
Polish,
Portuguese,
Romanian,
Russian,
Spanish,
Turkish
hashtags Base
Linguistics
Option,
Tokenizer
Option
Indicates whether
to detect
hashtags.
Boolean
(false)
All
identify
Contraction
Components
Filter
Option
Indicates whether
the token filter
should identify
contraction
components,
rathe than as
lemmas.
Boolean
(false)
All
include
HebrewRoots
Base
Linguistics
Option
Indicates whether
to generate Semitic
root forms.
Boolean
(false)
Hebrew
includeRoots Tokenizer
Option
Indicates whether
to generate Semitic
root forms.
Boolean
(false)
Hebrew
korean
Decompounding
Analyzer
Option,
Base
Linguistics
Option
Indicates whether
to use experimental
Korean
decompounding.
Boolean
(false)
Korean
language Analyzer
Option,
Base
Linguistics
Option,
CSCAnalyzer
Option, Tokenizer
Option
The language to
process by
analyzers or
tokenizers created
by the factory.
Language
code
(none)
All
lem
Dictionary
Path
None. This is
only used in
the Lucene connector API and is passed
in via an
options Map.
A list of paths
to user-defined
lemma dictionaries, separated by
semicolons or the
OS-specific path
separator.
List of
paths
(none)
See Feature
Set [12]
, Lemma User Dictionary column
licensePath Analyzer
Option,
Base
Linguistics
Option,
CSCAnalyzer
Option,
Tokenizer
Option
The path of the
RBL-JE license
file.
Path
({rootDir}/
licenses/
rlp-
license.xml)
All
license
String
Analyzer
Option,
Base
Linguistics
Option,
CSCAnalyzer
Option,
Tokenizer
Option
The XML license
content, overrides
licensePath.
String
(none)
All
minNon
Primary
Script
Region
Length
Base
Linguistics
Option,
Tokenizer
Option
Minimum length
of sequential
characters that
are not in the
primary script. If
a non-primary
script region is less
than this length,
and adjacent to a
primary script
region, it is appended
to the primary script
region.
Integer
(10)
Chinese,
Japanese,
Thai
model
Directory
Analyzer
Option,
Base
Linguistics
Option,
Tokenizer
Option
The directory
containing
model files
and data.
Path
({rootDir}/
models)
Chinese, Hebrew, Japanese,
Thai
normalization
Dictionary
Paths
Analyzer
Option,
Base
Linguistics
Option
A list of paths
to user-defined
manyto-one
normalization
dictionaries,
separated by
semicolons or the
OSspecific path
separator.
List of
paths
(none)
All
nfkcNormalize Base
Linguistics
Option,
Tokenizer
Option
Turns on
Unicode NFKC
normalization before tokenization. This
normalization
includes a fullwidth
numeral to
a halfwidth numeral,
a fullwidth latin to
a halfwidth latin, and
a halfwidth katakana
to a fullwidth katakana.
Boolean
(false)
Arabic,
Chinese,
Czech,
Danish,
Dutch,
English,
French,
German,
Greek,
Hungarian,
Italian,
Korean,
Norwegian,
Pecusrsian,
Polish,
Portuguese,
Pushto,
Russian,
Spanish, Swedish,
Thai,
Turkish,
Urdu
query Analyzer
Option,
Base
Linguistics
Option, Tokenizer
Option
Indicate the
input will be
queries, likely
incomplete
sentences. If true,
analyzers may
change their
behavior (e.g.
disable
disambiguation).
Boolean
(false)
All
replaceTokens
WithLemmas
Filter
Option
Indicates whether
the token filter
should replace a
surface token
with its lemma.
Disambiguation
must be enabled
as well.
Boolean
(false)
All
root
Directory
Analyzer
Option,
Base
Linguistics
Option,
CSCAnalyzer
Option, Tokenizer
Option
Set the root
directory. Also
sets default values
for other required
options (dictionary
Directory, licensePath,
licenseStringand modelDirectory,
{rootDir})
Path
(none)
All
seg
Dictionary
Path
None. This is
only used in
the Lucene connector API and is passed
in via an
options Map.
A list of paths
to user-defined segmentation
dictionaries,
separated by
semicolons or
the OS-specific
path separator.
If the language
is Chinese
or Japanese and alternative
Tokenization
is true, then sets
dictionaries to
CLA/JLA user
dictionaries.
List of
paths
(none)
See Feature
Set [12]
, Token
User
Dictionary
column
target
Language
Base
Linguistics
Option,
CSCAnalyzer
Option
The language
to which
CSCAnalyzer
is converting.
Language
code
(none)
All
tokenize
Contractions
Base
Linguistics
Option
Indicates whether
to deliver
contractions as
multiple tokens.
If false, they are
delivered as
one token.
Boolean
(false)
All
tokenize
ForScript
Base
Linguistics
Option, Tokenizer
Option
Indicates whether
to use a different
word-breaker for
each script. If
false, uses script-
specific breaker
for primary script
and default
breaker for
other scripts.
Boolean
(false)
Chinese, Japanese,
Thai
universal
PosTags
Base
Linguistics
Option
Indicates whether
the language-
specific annotators produced should
convert POS
tags to the
universalversions.
If true, the
universal tags
are returned. If
false, the
traditional tags
are returned.
Boolean
(false)
All
urls Base
Linguistics
Option,
Tokenizer
Option
Indicates whether
to detect URLs.
Boolean
(false)
All
userDefined
Dictionary
Path
None. This is
only used in
the Lucene
connector API
and is passed
in via an
options Map.
A list of paths
to user-defined
dictionaries, separated
by semicolons
or the OS-
specific path
separator.
String
(none)
All
userDefined
Reading
Dictionary
Path
None. This is
only used in
the Lucene
connector API
and is passed
in via an
options Map.
A list of paths
to user-defined
reading dictionaries, separated
by semicolons
or the OS-specific
path separator. Currently
only supported
for Japanese when alternative
Tokenization
is true.
String
(none)
Japanese

results matching ""

    No results matching ""