A.2. Table of Alternative Tokenization Options

These options are available when alternativeTokenization is set to true. Alternatively, one can set these options in a string of YAML by using alternativeTokenizationOptions. Specify them by capitalizing the first letter of the option name below, for example
"{ DecomposeCompounds: false, MinLengthForScriptChange: 20}". (Although this feature was deprecated in version 7.16.0.c58.2, YAML is still used to specify these options in Solr and Elasticsearch.)

Option	Description	Default value	Supported languages
`breakAtAlphaNum` `IntraWordPunct`	Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when `consistentLatinSegmentation` is `true`.	false	Chinese
`consistentLatin` `Segmentation`	Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of segmentNonJapanese is ignored.	true	Chinese, Japanese
`decomposeCompounds`	Indicates whether to decompose compounds.	true	Chinese, Japanese
`deepCompound` `Decomposition`	Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when `decomposeCompounds` is `false`.	false	Chinese, Japanese
`deliverExtended` `Attributes`	Indicates whether to add ADM extended properties to tokens.	false	Japanese
`favorUser` `Dictionary`	Indicates whether to favor words in the user dictionary during segmentation.	false	Chinese, Japanese
`generateAll`	Indicates whether to return all the readings for a token. For characters with multiple readings, all the readings are returned in brackets and separated by semicolons. Has no effect when `readings` is `false`.	false	Chinese
`ignore` `Separators`	Indicates whether to ignore whitespace separators when segmenting input text. If false, whitespace separators will be treated as morpheme delimiters. Has no effect when `whitespaceTokenization` is `true`.	true	Japanese
`ignore` `Stopwords`	Indicates whether to filter stopwords [99] out of the output.	false	Chinese, Japanese
`minLengthFor` `ScriptChange`	Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when `consistentLatinSegmentation` is `false`.	10	Chinese, Japanese
`pos`	Indicates whether to add parts of speech to morphological analyses.	true	Chinese, Japanese
`readingByCharacter`	Indicates whether to skip directly to the fallback behavior of `readings` without considering readings for whole words. Has no effect when `readings` is `false`.	false	Chinese, Japanese
`readings`	Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters.	false	Chinese, Japanese
`readingsSeparate` `Syllables`	Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when `readings` is `false`.	false	Chinese, Japanese
`readingType`	Sets the representation of Chinese readings. Possible values (case-insensitive) are: • `cjktex`: macros for the CJKTeX pinyin.sty style • `no_tones`: pinyin without tones • `tone_marks`: pinyin with diacritics over the appropriate vowels • `tone_numbers`: pinyin with a number from 1 to 4 suffixed to each syllable, or no number for neutral tone	`tone_marks`	Chinese
`segmentNonJapanese`	Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when `consistentLatin` `Segmentation` is `true`.	true	Japanese
`separateNumbers` `FromCounters`	Indicates whether to return numbers and counters as separate tokens.	true	Japanese
`separatePlaceName` `FromSuffix`	Indicates whether to segment place names from their suffixes.	true	Japanese
`useVForUDiaeresis`	Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when `readingType` is `cjktex` or `tone_marks`, which always use 'v' and 'ü' respectively. It is probably most useful when `readingType` is `tone_numbers`. Has no effect when `readings` is `false`	false	Chinese
`whiteSpaceIs` `NumberSep`	Indicates whether to treat whitespace as a number separator. Has no effect when `consistentLatinSegmentation` is `true`.	true	Chinese
`whitespace` `Tokenization`	Indicates whether to treat whitespace as a morpheme delimiter.	false	Chinese, Japanese

A.3. Editing the stopwords list

The ignoreStopwords option uses a stopwords list to define stopwords. The path to the stopwords list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8 and Japanese uses root/dicts/jpn/jla/JP_stop.utf8.

You may want to add stopwords to these files. When you edit one of these files, you must follow these rules:

• The file must be encoded in UTF-8. • The file may include blank lines. • Comment lines begin with #. • Each non-blank non-comment line represents exactly one lexeme (stopword).

A.2. Table of Alternative Tokenization Options

A.2. Table of Alternative Tokenization Options

A.3. Editing the stopwords list

results matching ""

No results matching ""