3.4. Running the Samples
The samples are in rbl-je-7.20.0.c58.3/samples/tokenize-analyze. In a Bash shell (Unix) or Command Prompt (Windows), navigate to this directory and use the Ant build script to compile and run both of these samples. Your license (rlp-license.xml) must be in the licenses subdirectory of the RBL-JE installation.
To compile and run both samples, call
ant run
Tokenize
tokenizes the sample German document and Analyze
provides a disambiguated analysis of each token.
The output appears in two files: deu-tokenized.txt and deu-analyzed.txt. The first file contains a token, Tab, <ALNUM>
tag on each line, with a blank line following the end of a sentence. 1
The second file contains the token, lemma, part of speech, and compound components (where relevant) on each line. For those languages for which disambiguation is not supported2, there may be multiple rows for each token (the token appearing in the first column), one for each analysis.
Here is a fragment with a sentence from deu-analyzed.txt:
TOKEN LEMMA POS COMPOUNDS
----- ----- --- ---------
3.11.06 3.11.06 CARD
- - PUNCT
Not not ADV
und und COORD
Elend Elend NOUN
in in PREP
ihren ihr POSDET
Heimatländern Heimatland NOUN \[Heimat, Land\]
lassen lassen VVFIN
immer immer ADV
mehr mehr INDADJ
Afrikaner Afrikaner NOUN
die die ART
Reise Reise NOUN
nach nach PREP
Europa Europa NOUN
antreten antreten VVINF
. . SENT
To run the samples with sample text in a different language, set the test.language
parameter with
the ISO 639-3 code for the language. For example to tokenize and analyze the Spanish sample, call
ant -Dtest.language=esp run
1. For Hebrew,Tokenize
writes up to three lines for each token: (1) token Tab<ALNUM>
, (2) lemma Tab<LEMMA>
, (3) root Tab<ROOT>
. ↩
2. Or for Japanese, where disambiguation is turned off by default. ↩