1.3. Feature Set
The following table indicates the type of support that RBL-JE provides for each supported language. The RBL-JE Tokenizer provides normalization, tokenization, and sentence boundary detection. The RBL-JE Analyzer provides lemma lookup (including orthographic normalization for Japanese), lemma guessing (when the lookup fails), decompounding, and supports lemma, segmentation, and many-to-one normalization user dictionaries.
| A r a b i c |
C z e c h |
C h i n e s e h |
D a n i s h |
D u t c h |
E n g l i s h |
F i n n i s h |
F r e n c h |
G e r m a n |
|
|---|---|---|---|---|---|---|---|---|---|
| Tokenization | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Sentence Boundary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Token Normalizationa | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Lemma Lookupb | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Part-of-Speech Taggingc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Disambiguationd | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Lemma User Dictionary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Segmentation User Dictionary |
✓ | ||||||||
| Decompounding | ✓ | ✓ | |||||||
| Readings | |||||||||
| Script Conversion | ✓ i | ||||||||
| Stem e | ✓ | ✓ | |||||||
| Semitic Rootf | ✓ | ||||||||
| n:1 Normalization User Dictionaryg |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| G r e e k |
H e b r e w ^j] |
H u n g a r i a n |
I t a l i a n |
J a p a n e s e |
K o r e a n |
N o r w e g i a n m |
P e r s i a n n |
P o l i s h |
|
|---|---|---|---|---|---|---|---|---|---|
| Tokenization | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Sentence Boundary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Token Normalizationa | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Lemma Lookupb | ✓ | ✓ | ✓ | ✓ | ✓k | ✓ | ✓ | ✓ | |
| Part-of-Speech Taggingc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Disambiguationd | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Lemma User Dictionary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Segmentation User Dictionary |
✓ | ✓ | |||||||
| Decompounding | ✓ | ✓ | ✓ | ||||||
| Readings | ✓ l | ||||||||
| Script Conversion | |||||||||
| Steme | |||||||||
| Semitic Rootf | ✓ | ||||||||
| n:1 Normalization User Dictionaryg |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| P o r t u g e e s e |
P u s h t o |
R o m a n i a n |
R u s s a n |
S p a n i s h |
S w e d i s h |
T h a i |
T u r k i s h |
U r d u |
|
|---|---|---|---|---|---|---|---|---|---|
| Tokenization | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Sentence Boundary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Token Normalizationa | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Lemma Lookupb | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Part-of-Speech Taggingc | ✓ | ✓ | ✓ | ✓ | |||||
| Disambiguationd | ✓ | ✓ | ✓ | ||||||
| Lemma User Dictionary | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
| Segmentation User Dictionary |
✓ | ||||||||
| Decompounding | ✓ | ||||||||
| Readings | |||||||||
| Script Conversion | |||||||||
| Steme | |||||||||
| Semitic Rootf | |||||||||
| n:1 Normalization User Dictionaryg |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
a. With the exception of Hebrew and Arabic, the tokenizer can apply Normalization Form KC (NFKC) to the tokens. For Arabic, Persian, and Urdu, see Arabic, Persian, and Urdu Token Analysis [103]. ↩
b. For Japanese, Japanese Lemma Normalization [108] is also available. ↩
c. See Part-of-Speech Tags [112]. ↩
d. With the exception of Japanese, by default the analyzer returns a disambiguated analysis for the supported languages. For performance, Japanese disambiguation of is turned off by default. ↩
e. The base form of the token to which affixes may be added. For Finnish, this is the Porter stem. ↩
f. The Semitic root for the token (an empty string if the root cannot be determined). ↩
g. Maps one or more tokens to a single token. ↩
h. Simplified and Traditional Chinese. ↩
i. See Chinese Script Converter [72]. ↩
j. For Hebrew, the tokenizer generates a lemma and a Semitic root for each token. ↩
k. The base linguistics token filter excludes Japanese lemmas for auxiliary verbs, particles, and adverbs from the token stream. ↩
l. Transcriptions rendered in Hiragana for Japanese tokens. ↩
m. Bokmål and Nynorsk ↩
n. Western Farsi and Dari ↩