A.2. Table of Alternative Tokenization Options

These options are available when alternativeTokenization is set to true. Alternatively, one can set these options in a string of YAML by using alternativeTokenizationOptions. Specify them by capitalizing the first letter of the option name below, for example
"{ DecomposeCompounds: false, MinLengthForScriptChange: 20}". (Although this feature was deprecated in version 7.16.0.c58.2, YAML is still used to specify these options in Solr and Elasticsearch.)

Option Description Default
value
Supported
languages
breakAtAlphaNum
IntraWordPunct
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when consistentLatinSegmentation is true. false Chinese
consistentLatin
Segmentation
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of segmentNonJapanese is ignored. true Chinese, Japanese
decomposeCompounds Indicates whether to decompose compounds. true Chinese, Japanese
deepCompound
Decomposition
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when decomposeCompounds is false. false Chinese, Japanese
deliverExtended
Attributes
Indicates whether to add ADM extended properties to tokens. false Japanese
favorUser
Dictionary
Indicates whether to favor words in the user dictionary during segmentation. false Chinese, Japanese
generateAll Indicates whether to return all the readings for a token. For characters with multiple readings, all the readings are returned in brackets and separated by semicolons. Has no effect when readings is false. false Chinese
ignore
Separators
Indicates whether to ignore whitespace separators when segmenting input text. If false, whitespace separators will be treated as morpheme delimiters. Has no effect when whitespaceTokenization is true. true Japanese
ignore
Stopwords
Indicates whether to filter stopwords [99] out of the output. false Chinese, Japanese
minLengthFor
ScriptChange
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when consistentLatinSegmentation is false. 10 Chinese, Japanese
pos Indicates whether to add parts of speech to morphological analyses. true Chinese, Japanese
readingByCharacter Indicates whether to skip directly to the fallback behavior of readings without considering readings for whole words. Has no effect when readings is false. false Chinese, Japanese
readings Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. false Chinese, Japanese
readingsSeparate
Syllables
Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when readings is false. false Chinese, Japanese
readingType Sets the representation of Chinese readings. Possible values (case-insensitive) are:
cjktex: macros for the CJKTeX pinyin.sty style
no_tones: pinyin without tones
tone_marks: pinyin with diacritics over the appropriate vowels
tone_numbers: pinyin with a number from 1 to 4 suffixed to each syllable, or no number for neutral tone
tone_marks Chinese
segmentNonJapanese Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when consistentLatin
Segmentation is true.
true Japanese
separateNumbers
FromCounters
Indicates whether to return numbers and counters as separate tokens. true Japanese
separatePlaceName
FromSuffix
Indicates whether to segment place names from their suffixes. true Japanese
useVForUDiaeresis Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when readingType is cjktex or tone_marks, which always use 'v' and 'ü' respectively. It is probably most useful when readingType is tone_numbers. Has no effect when readings is false false Chinese
whiteSpaceIs
NumberSep
Indicates whether to treat whitespace as a number separator. Has no effect when consistentLatinSegmentation is true. true Chinese
whitespace
Tokenization
Indicates whether to treat whitespace as a morpheme delimiter. false Chinese, Japanese



A.3. Editing the stopwords list

The ignoreStopwords option uses a stopwords list to define stopwords. The path to the stopwords list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8 and Japanese uses root/dicts/jpn/jla/JP_stop.utf8.

You may want to add stopwords to these files. When you edit one of these files, you must follow these rules:

• The file must be encoded in UTF-8. • The file may include blank lines. • Comment lines begin with #. • Each non-blank non-comment line represents exactly one lexeme (stopword).

results matching ""

    No results matching ""