Appendix B. Arabic, Persian, and Urdu Token Analysis

For Arabic, Persian (Western Farsi and Dari), and Urdu, RBL-JE may return multiple analyses for each token. Each analysis contains the normalized form of the token, a part-of-speech tag, and a stem. For Arabic only, the analysis also includes a lemma and a Semitic root.

This appendix provides information on token normalization and the generation of variant tokens. For Arabic, it also provides information on stems and Semitic roots.

Token normalization is performed in two stages: generic Arabic script normalization and language specific normalization.

B.1. Generic Arabic Script Token Normalization

Generic Arabic script normalization includes the following:

• The following diacritics are removed: kashida, dammatan, kasratan, fatha, damma, kasra, hadda, sukun.

• The following characters are removed: left-to-right marker, right-to-left marker, zero-width joiner, BOM, non-breaking space, soft hyphen, space.

• Alef maksura is converted to yeh unless it is at the end of the word or followed by hamza.

• All numbers are converted to Arabic numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.¹ Thousand separators are removed, and the decimal separator is changed to a period (U+002E). The normalizer handles cases where ر (reh) is (incorrectly) used as the decimal separator.

• Alef with hamza above: ٵ (U+ ٲ ,( 0675 (U+0672), or ا (U+0627) combined with hamza above (U+0654) is converted to أ (U+0623).

• Alef with madda above: ا (U+0627) combined with madda above (U+0653) is converted to آ (U +0622).

• Alef with hamza below: ٳ (U+0673) or ا (U+0627) combined with hamza below (U+0655) is converted to إ (U+0625).

• Misra sign to ain: Misra sign (U+060F) is converted to ع (U+0639).

¹. As distinguished from the Arabic-Indic numerals often used in Arabic script (٠, ١, ٢, ٣, ٤, ٥, ٦, ٧, ٨, ٩) or the Eastern Arabic-Indic numerals often used in Persian and Urdu Arabic script (۰, ۱, ۲, ۳, ۴, ۵, ۶, ۷, ۸, ۹). ↩

• Swash kaf to kaf: ڪ (U+06AA) is converted to ک (U+06A9).

• Heh: ە (U+06D5) is converted to ه (U+0647).

• Yeh with hamza above: The following combinations are converted to ئ (U+0626). ی (U+06CC) combined with hamza above (U+0654) ى (U+0649) combined with hamza above (U+0654) ي (U+064A) combined with hamza above (U+0654)

• Waw with hamza above: و (U+0648) combined with hamza above (U+ ٷ ,( 0654 (U+0677), or ٶ (U+0676) is converted to ؤ (U+0624).

B.1.1. Arabic Token Analysis

B.1.1.1. Token Normalization

For Arabic input, the following language-specific normalizations are performed on the output of the Arabic script normalization: • Zero-width non-joiner (U+200C) and superscript alef (U+0670) are removed.

• Fathatan (U+064B) is removed.

• Farsi yeh (U+06CC) is normalized to yeh (U+064A) if it is initial or medial; if final, it is normalized to alef maksura (U+0649).

• Farsi kaf ک (U+06A9) is converted to ك (U+0643).

• Heh ہ (U+06C1) or ھ (U+06BE) is converted to ه (U+0647).

Following morphological analysis, the normalizer does the following:

• Alef wasla ٱ (U+0671) is replaced with plain alef ا (U+0627).

• If a word starts with the incorrect form of an alef, the normalizer retrieves the correct form: plain alef ا (U+0627), alef with hamza above أ (U+0623), alef with hamza below إ (U+0625), or alef with madda above آ (U+0622).

B.1.1.1.1. Token Variants

The analyzer can generate a number of variant forms for each Arabic token to account for the orthographic irregularity seen in contemporary written Arabic. Each token variant is generated in normalized form.

• If a token contains a word-final hamza preceded by yeh or alef maksura, then a variant is created that replaces these with hamza seated on yeh.

• If a token contains waw followed by hamza on the line, a variant is created that replaces these with hamza seated on waw.

• Variants are created where word-final heh is replaced by teh marbuta, and word-final alef maksura is replaced by yeh.

B.1.1.1.2. Stems and Semitic Roots

The stem returned is the normalized token with affixes (such as prepositions, conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.

In the process of stripping morphemes (affixes) from a token, tne analyzer produces a stem, a lemma, and a Semitic root. Stems and lemmas result from stripping most of the inflectional morphemes, while Semitic roots result from stripping derivational morphemes.

Inflectional morphemes indicate plurality or verb tense. Different forms, such as singular and plural noun, or past and present verb tense share the same stem if the forms are regular. If some of the forms are irregular, they do not share the same stem, but do share the same lemma. Since stems and lemmas preserve the meaning of words, they are very useful in text retrieval and search in general.

Words that have a more distant linguistic relationship share the same Semitic root.

Examples. The singular form الكتابة (al-kitaaba, the writing) and plural form كتابات (kitaabaat, writings) share the same stem: كتاب (kitaab). On the other hand, كُتُب (kutub, books) is an irregular form and does not have the same stem as كِتَاب (kitaab, book). But both forms do share the same lemma, which is the singular form كِتَاب (kitaab). The words مكتبة (maktaba, library), المَكْتَب (almaktab, the desk), كُتُب (kutub, books), and الكتابة (al-kitaaba, the writing) are related in the sense that a library contains books and desks, a desk is used to write on, and writings are often found in books. All of these words share the same Semitic root: كَتَبَ (kataba).

B.1.2. Persian Token Analysis

B.1.2.1. Persian Token Normalization

The following Persian-specific normalizations are performed on the output of the Arabic script normalization:

• Fathatan (U+064B) and superscript alef (U+0670) are removed.

• Alef أ (U+ إ ,( 0623 (U+0625), or ٱ (U+0671) is converted to ا (U+0627).

• Arabic kaf ك (U+0643) is converted to Farsi kaf ک (U+06A9).

• Heh goal (U+06C1) or heh doachashmee (U+06BE) is converted to heh (U+0647).

• Heh with hamza ۀ (U+06C2) is converted to ۀ (U+06C0).

• Arabic yeh ي (U+064A) or ى (U+0649) is converted to Farsi yeh ی (U+06CC).

Following morphological analysis:

• Zero-width non-joiner (U+200C) and superscript alef (U+0670) are removed.

B.1.2.2. Token Variants

The analyzer can generate a variant form for some tokens to account for the orthographic irregularity seen in contemporary written Persian. Each variation is generated with the normalized form.

• If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Farsi yeh (U+06CC).

• If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).

• If a word contains a zero-width non-joiner (U+200C), a variant is generated without the zerowidth non-joiner.

• If a word ends in teh marbuta (U+0629), two variants are generated. The first replaces the teh marbuta with teh (U+062A); the second replaces the teh marbuta with heh (U+0647).

B.1.3. Urdu Token Analysis

B.1.3.1. Token Normalization

The following Urdu-specific normalizations are performed on the output of the Arabic script normalization:

• Fathatan (U+064B), zero-width non-joiner (U+200C), and jazm (U+06E1) are removed.

• Alef أ (U+ إ ,( 0623 (U+0625), or ٱ (U+0671) is converted to ا (U+0627).

• Kaf ك (U+0643) is converted to ک (U+06A9).

• Heh with hamza ۀ (U+06C0) is converted to ۀ (U+06C2).

• Yeh ي (U+064A) or ى (U+0649) is converted to ی (U+06CC).

B.1.3.2. Token Variants

The analyzer can generate a number of variant forms for each Urdu token to account for the orthographic irregularity seen in contemporary written Urdu. Each variation is generated with the normalized form.

• If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Farsi yeh (U+06CC).

• If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).

• If a word contains heh doachashmee (U+06BE), a variant is generated replacing the heh doachashmee with heh goal (U+06C1).

• If a word ends with teh marbuta (U+0647), a variant is generated replacing the teh marbuta with heh goal (U+06C1).

Appendix B. Arabic, Persian, and Urdu Token Analysis

Appendix B. Arabic, Persian, and Urdu Token Analysis

B.1. Generic Arabic Script Token Normalization

B.1.1. Arabic Token Analysis

B.1.1.1. Token Normalization

B.1.1.1.1. Token Variants

B.1.1.1.2. Stems and Semitic Roots

B.1.2. Persian Token Analysis

B.1.2.1. Persian Token Normalization

B.1.2.2. Token Variants

B.1.3. Urdu Token Analysis

B.1.3.1. Token Normalization

B.1.3.2. Token Variants

results matching ""

No results matching ""