A.2. Table of Alternative Tokenization Options
These options are available when alternativeTokenization is set to true. Alternatively, one can set these options in a string of YAML by using alternativeTokenizationOptions. Specify them by capitalizing the first letter of the option name below, for example
"{ DecomposeCompounds: false, MinLengthForScriptChange: 20}". (Although this feature was deprecated in version 7.16.0.c58.2, YAML is still used to specify these options in Solr and Elasticsearch.)
| Option | Description | Default value |
Supported languages |
|---|---|---|---|
breakAtAlphaNumIntraWordPunct |
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when consistentLatinSegmentation is true. |
false | Chinese |
consistentLatinSegmentation |
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of segmentNonJapanese is ignored. | true | Chinese, Japanese |
decomposeCompounds |
Indicates whether to decompose compounds. | true | Chinese, Japanese |
deepCompoundDecomposition |
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when decomposeCompounds is false. |
false | Chinese, Japanese |
deliverExtendedAttributes |
Indicates whether to add ADM extended properties to tokens. | false | Japanese |
favorUserDictionary |
Indicates whether to favor words in the user dictionary during segmentation. | false | Chinese, Japanese |
generateAll |
Indicates whether to return all the readings for a token. For characters with multiple readings, all the readings are returned in brackets and separated by semicolons. Has no effect when readings is false. |
false | Chinese |
ignoreSeparators |
Indicates whether to ignore whitespace separators when segmenting input text. If false, whitespace separators will be treated as morpheme delimiters. Has no effect when whitespaceTokenization is true. |
true | Japanese |
ignoreStopwords |
Indicates whether to filter stopwords [99] out of the output. | false | Chinese, Japanese |
minLengthForScriptChange |
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when consistentLatinSegmentation is false. |
10 | Chinese, Japanese |
pos |
Indicates whether to add parts of speech to morphological analyses. | true | Chinese, Japanese |
readingByCharacter |
Indicates whether to skip directly to the fallback behavior of readings without considering readings for whole words. Has no effect when readings is false. |
false | Chinese, Japanese |
readings |
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. | false | Chinese, Japanese |
readingsSeparateSyllables |
Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when readings is false. |
false | Chinese, Japanese |
readingType |
Sets the representation of Chinese readings. Possible values (case-insensitive) are: • cjktex: macros for the CJKTeX pinyin.sty style• no_tones: pinyin without tones• tone_marks: pinyin with diacritics over the appropriate vowels• tone_numbers: pinyin with a number from 1 to 4 suffixed to each syllable, or no number for neutral tone |
tone_marks |
Chinese |
segmentNonJapanese |
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when consistentLatinSegmentation is true. |
true | Japanese |
separateNumbersFromCounters |
Indicates whether to return numbers and counters as separate tokens. | true | Japanese |
separatePlaceNameFromSuffix |
Indicates whether to segment place names from their suffixes. | true | Japanese |
useVForUDiaeresis |
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when readingType is cjktex or tone_marks, which always use 'v' and 'ü' respectively. It is probably most useful when readingType is tone_numbers. Has no effect when readings is false |
false | Chinese |
whiteSpaceIsNumberSep |
Indicates whether to treat whitespace as a number separator. Has no effect when consistentLatinSegmentation is true. |
true | Chinese |
whitespaceTokenization |
Indicates whether to treat whitespace as a morpheme delimiter. | false | Chinese, Japanese |
A.3. Editing the stopwords list
The ignoreStopwords option uses a stopwords list to define stopwords. The path to the stopwords list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8 and Japanese uses root/dicts/jpn/jla/JP_stop.utf8.
You may want to add stopwords to these files. When you edit one of these files, you must follow these rules:
• The file must be encoded in UTF-8. • The file may include blank lines. • Comment lines begin with #. • Each non-blank non-comment line represents exactly one lexeme (stopword).