3.4. Running the Samples

The samples are in rbl-je-7.20.0.c58.3/samples/tokenize-analyze. In a Bash shell (Unix) or Command Prompt (Windows), navigate to this directory and use the Ant build script to compile and run both of these samples. Your license (rlp-license.xml) must be in the licenses subdirectory of the RBL-JE installation.

To compile and run both samples, call

ant run

Tokenize tokenizes the sample German document and Analyze provides a disambiguated analysis of each token.

The output appears in two files: deu-tokenized.txt and deu-analyzed.txt. The first file contains a token, Tab, <ALNUM> tag on each line, with a blank line following the end of a sentence. ¹

The second file contains the token, lemma, part of speech, and compound components (where relevant) on each line. For those languages for which disambiguation is not supported², there may be multiple rows for each token (the token appearing in the first column), one for each analysis.

Here is a fragment with a sentence from deu-analyzed.txt:

TOKEN LEMMA POS COMPOUNDS
----- ----- --- ---------
3.11.06 3.11.06 CARD
- - PUNCT
Not not ADV
und und COORD
Elend Elend NOUN
in in PREP
ihren ihr POSDET
Heimatländern Heimatland NOUN \[Heimat, Land\]
lassen lassen VVFIN
immer immer ADV
mehr mehr INDADJ
Afrikaner Afrikaner NOUN
die die ART
Reise Reise NOUN
nach nach PREP
Europa Europa NOUN
antreten antreten VVINF
. . SENT

To run the samples with sample text in a different language, set the test.language parameter with the ISO 639-3 code for the language. For example to tokenize and analyze the Spanish sample, call

ant -Dtest.language=esp run

¹. For Hebrew, Tokenize writes up to three lines for each token: (1) token Tab <ALNUM>, (2) lemma Tab <LEMMA>, (3) root Tab <ROOT>. ↩

². Or for Japanese, where disambiguation is turned off by default. ↩

3.4. Running the Samples

3.4. Running the Samples

results matching ""

No results matching ""