Appendix F. Contraction Splitting Rules
The tokenizeContractions option splits contractions into multiple tokens. Contractions are
defined by contraction rule files. By default, the tokenizer uses the rules in rootDirectory/
contractions/contraction-rules-language.yaml, where language is an ISO 639-3 language
code [109] . customTokenizeContractionRulesUri
allows you to specify custom rules.
F.1. Contraction Splitting Rule File Format
A contraction rule file is a YAML file encoded in UTF-8. It must be a sequence of contraction rules.
A contraction rule is a sequence of two elements: a contraction key and a contraction replacement. Any token which matches the key is replaced with the replacement. Rules are checked in the order they appear in the rules file. A token which matches a rule is not checked against any further rules. A token which matches no rule is not rewritten.
A contraction key is a sequence of a surface form and a POS tag. A token matches a key if and only if its surface form and POS tag match the key's surface form and POS tag.
A surface form is a string. Surface forms are compared case-insensitively.
A POS tag is a string. POS tags are compared case-sensitively.
A contraction replacement is a sequence of replacement tokens.
A replacement token is a sequence of a replacement surface form, POS tag, lemma, and raw analysis. All four are strings. The raw analysis can also be null.
F.2. Example
-
- \[ "ain't", "VBPRES" \]
-
- \[ "am", "VBPRES", "be", null \]
- \[ "not", "NOT", "not", null \]
-
- \[ "amn't", "ADJ" \]
-
- \[ "am", "VBPRES", "be", null \]
- \[ "not", "NOT", "not", null \]
-
- \[ "amn't", "NOUN" \]
-
- \[ "am", "VBPRES", "be", null \]
- \[ "not", "NOT", "not", null \]
The first entry is for ain't with POS tag VBPRES. This splits into am and not. The next is for amn't as an ADJ, and the third is for amn't as a NOUN.
The replacement surface form uses the same capitalization format as the original surface form. Using the first entry of the above example, ain't becomes am not, Ain't becomes Am not, and AIN'T becomes AM NOT.