1.3. Feature Set

The following table indicates the type of support that RBL-JE provides for each supported language. The RBL-JE Tokenizer provides normalization, tokenization, and sentence boundary detection. The RBL-JE Analyzer provides lemma lookup (including orthographic normalization for Japanese), lemma guessing (when the lookup fails), decompounding, and supports lemma, segmentation, and many-to-one normalization user dictionaries.

	A r a b i c	C z e c h	C h i n e s e ^h	D a n i s h	D u t c h	E n g l i s h	F i n n i s h	F r e n c h	G e r m a n
Tokenization	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sentence Boundary	✓	✓	✓	✓	✓	✓	✓	✓	✓
Token Normalization^a	✓	✓	✓	✓	✓	✓		✓	✓
Lemma Lookup^b	✓	✓	✓	✓	✓	✓		✓	✓
Part-of-Speech Tagging^c	✓	✓	✓		✓	✓		✓	✓
Disambiguation^d	✓	✓			✓	✓		✓	✓
Lemma User Dictionary		✓	✓	✓	✓	✓		✓	✓
Segmentation User Dictionary			✓
Decompounding				✓	✓
Readings
Script Conversion			✓ ⁱ
Stem ^e	✓						✓
Semitic Root^f	✓
n:1 Normalization User Dictionary^g	✓	✓	✓	✓	✓	✓	✓	✓	✓

	G r e e k	H e b r e w ^j]	H u n g a r i a n	I t a l i a n	J a p a n e s e	K o r e a n	N o r w e g i a n ^m	P e r s i a n ⁿ	P o l i s h
Tokenization	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sentence Boundary	✓		✓	✓	✓	✓	✓	✓	✓
Token Normalization^a	✓		✓	✓	✓	✓	✓	✓	✓
Lemma Lookup^b	✓	✓	✓	✓	✓^k	✓	✓		✓
Part-of-Speech Tagging^c	✓		✓	✓	✓	✓		✓	✓
Disambiguation^d	✓		✓	✓	✓	✓			✓
Lemma User Dictionary	✓		✓	✓	✓	✓	✓		✓
Segmentation User Dictionary					✓	✓
Decompounding			✓			✓	✓
Readings					✓ ^l
Script Conversion
Stem^e
Semitic Root^f		✓
n:1 Normalization User Dictionary^g	✓	✓	✓	✓	✓	✓	✓	✓	✓

	P o r t u g e e s e	P u s h t o	R o m a n i a n	R u s s a n	S p a n i s h	S w e d i s h	T h a i	T u r k i s h	U r d u
Tokenization	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sentence Boundary	✓	✓	✓	✓	✓	✓	✓	✓	✓
Token Normalization^a	✓		✓	✓	✓	✓	✓	✓	✓
Lemma Lookup^b	✓		✓	✓	✓	✓	✓	✓
Part-of-Speech Tagging^c	✓			✓	✓				✓
Disambiguation^d	✓			✓	✓
Lemma User Dictionary	✓			✓	✓	✓	✓
Segmentation User Dictionary							✓
Decompounding						✓
Readings
Script Conversion
Stem^e
Semitic Root^f
n:1 Normalization User Dictionary^g	✓	✓	✓	✓	✓	✓	✓	✓	✓

^a. With the exception of Hebrew and Arabic, the tokenizer can apply Normalization Form KC (NFKC) to the tokens. For Arabic, Persian, and Urdu, see Arabic, Persian, and Urdu Token Analysis [103]. ↩

^b. For Japanese, Japanese Lemma Normalization [108] is also available. ↩

^c. See Part-of-Speech Tags [112]. ↩

^d. With the exception of Japanese, by default the analyzer returns a disambiguated analysis for the supported languages. For performance, Japanese disambiguation of is turned off by default. ↩

^e. The base form of the token to which affixes may be added. For Finnish, this is the Porter stem. ↩

^f. The Semitic root for the token (an empty string if the root cannot be determined). ↩

^g. Maps one or more tokens to a single token. ↩

^h. Simplified and Traditional Chinese. ↩

ⁱ. See Chinese Script Converter [72]. ↩

^j. For Hebrew, the tokenizer generates a lemma and a Semitic root for each token. ↩

^k. The base linguistics token filter excludes Japanese lemmas for auxiliary verbs, particles, and adverbs from the token stream. ↩

^l. Transcriptions rendered in Hiragana for Japanese tokens. ↩

^m. Bokmål and Nynorsk ↩

ⁿ. Western Farsi and Dari ↩

1.3 Feature Set

1.3. Feature Set

results matching ""

No results matching ""