1.3. Feature Set

The following table indicates the type of support that RBL-JE provides for each supported language. The RBL-JE Tokenizer provides normalization, tokenization, and sentence boundary detection. The RBL-JE Analyzer provides lemma lookup (including orthographic normalization for Japanese), lemma guessing (when the lookup fails), decompounding, and supports lemma, segmentation, and many-to-one normalization user dictionaries.

A
r
a
b
i
c
C
z
e
c
h

C
h
i
n
e
s
e

h
D
a
n
i
s
h
D
u
t
c
h
E
n
g
l
i
s
h
F
i
n
n
i
s
h
F
r
e
n
c
h
G
e
r
m
a
n
Tokenization
Sentence Boundary
Token Normalizationa
Lemma Lookupb
Part-of-Speech Taggingc
Disambiguationd
Lemma User Dictionary
Segmentation User
Dictionary
Decompounding
Readings
Script Conversion i
Stem e
Semitic Rootf
n:1 Normalization
User Dictionaryg
G
r
e
e
k
H
e
b
r
e
w

^j]
H
u
n
g
a
r
i
a
n
I
t
a
l
i
a
n
J
a
p
a
n
e
s
e
K
o
r
e
a
n
N
o
r
w
e
g
i
a
n

m
P
e
r
s
i
a
n

n
P
o
l
i
s
h
Tokenization
Sentence Boundary
Token Normalizationa
Lemma Lookupb k
Part-of-Speech Taggingc
Disambiguationd
Lemma User Dictionary
Segmentation User
Dictionary
Decompounding
Readings l
Script Conversion
Steme
Semitic Rootf
n:1 Normalization
User Dictionaryg
P
o
r
t
u
g
e
e
s
e
P
u
s
h
t
o
R
o
m
a
n
i
a
n
R
u
s
s
a
n
S
p
a
n
i
s
h
S
w
e
d
i
s
h
T
h
a
i
T
u
r
k
i
s
h
U
r
d
u
Tokenization
Sentence Boundary
Token Normalizationa
Lemma Lookupb
Part-of-Speech Taggingc
Disambiguationd
Lemma User Dictionary
Segmentation User
Dictionary
Decompounding
Readings
Script Conversion
Steme
Semitic Rootf
n:1 Normalization
User Dictionaryg
a. With the exception of Hebrew and Arabic, the tokenizer can apply Normalization Form KC (NFKC) to the tokens. For Arabic, Persian, and Urdu, see Arabic, Persian, and Urdu Token Analysis [103].
b. For Japanese, Japanese Lemma Normalization [108] is also available.
c. See Part-of-Speech Tags [112].
d. With the exception of Japanese, by default the analyzer returns a disambiguated analysis for the supported languages. For performance, Japanese disambiguation of is turned off by default.
e. The base form of the token to which affixes may be added. For Finnish, this is the Porter stem.
f. The Semitic root for the token (an empty string if the root cannot be determined).
g. Maps one or more tokens to a single token.
h. Simplified and Traditional Chinese.
i. See Chinese Script Converter [72].
j. For Hebrew, the tokenizer generates a lemma and a Semitic root for each token.
k. The base linguistics token filter excludes Japanese lemmas for auxiliary verbs, particles, and adverbs from the token stream.
l. Transcriptions rendered in Hiragana for Japanese tokens.
m. Bokmål and Nynorsk
n. Western Farsi and Dari

results matching ""

    No results matching ""