Appendix G. Customizing the ICU Tokenizer

The ICU tokenizer is the default tokenizer used for European languages. It works based on behavior defined in a rule file. The default rule files that RBL-JE uses can be obtained by contacting Basis support. If the default behavior is not exactly what is desired, RBL-JE allows custom rule files to be supplied that will determine the behavior of the tokenizer. How to make these customizations is briefly outlined here. It is important to note however that user customizations to the tokenizer behavior are not supported. BaseLinguisticsFactory and TokenizerFactory both have a method addCustomTokenizerRules which can be used to specify a custom rule file. RBLCmd also has the -ctr option to specify a path on the command line. All of these methods accept a case sensitivity value, which is important because only when BaseLinguisticsOption.caseSensitive is the same as the value for a rule file will it be selected. Custom rule files are not cumulative, i.e. only one set of rules may be used at a time for any one combination of case sensitivity and language. Note that Basis reserves the right to change the version of ICU used in RBL-JE. Thus any rules file provided by Basis for a particular version of RBL-JE may or may not work with newer versions.

G.1. Tokenization Rule File Format

A tokenization rule file is a list of regular expressions that will be used to find word boundaries. This file must be encoded in UTF-8 for use with RBL-JE. The regular expressions should be constructed according to the Unix Extended Regular Expression standard. RBL-JE also provides the ability to pass in a subrule file if desired. This is for splitting tokens produced according to rules in the main file. The subrule file is a list of subrules, each of which is a number and a regex separated by a tab character. This number corresponds to the "rule status" 1 of the main rule whose tokens the subrule splits. Each capturing group in the subrule regex corresponds to a token that will be produced by the tokenizer. Each regex in the main file must be a member of one of four sets of rules, and each of the sets must be defined in the main file. For an explanation of that, and the further specialized syntax that may be used to define rules for the tokenizer, please visit this page 2.

1. Essentially an ID number; see the documentation below
2. http://userguide.icu-project.org/boundaryanalysis\#TOC-ICU-BreakIterator-Data-Files .

G.2. Example

The ICU tokenizer does not normally tokenize with an eye to emoticons, but perhaps that is important to your use case. You could add the following to the default rule file.

...
$Smiley = [\:=][)}\]];
!!forward;
$Smiley;
...

For the input:

=)

instead of the output:

=
)

of two tokens with the Basis default rules you would get back one token:

=)

results matching ""

    No results matching ""