Rosette Name Indexer for Elasticsearch

RNI Elasticsearch Plugin

Version 5.3.2.0

Copyright © 2000-2017 Basis Technology Corporation. All rights reserved. This document is property of and is proprietary to Basis Technology Corporation. It is not to be disclosed or reproduced in whole or in part without the express written consent of Basis Technology Corporation.

Basis Technology is a registered trademark of Basis Technology Corporation. All other brand names may be trademarks of their respective owners.

U.S. Government Rights. This software is commercial computer software owned by Basis Technology Corporation. In accordance with DFARS 48 CFR 227-7202-1 and FAR 48 CFR 227.405-3(a), its use, reproduction, and disclosure by the Government is subject to the terms of Basis Technology's standard software license agreement and as may be set forth in the applicable Government Contract. Copyright © 2000-2017 Basis Technology Corporation. All rights reserved. Licensor/Contractor: Basis Technology Corporation, One Alewife Center, Cambridge, MA 02140, USA.

May 2017

1. Introduction

RNI-Elasticsearch is an Elasticsearch¹ plugin for building fuzzy name retrieval and matching applications for persons, locations, and organizations. The plugin was built with Rosette Name Indexer 7.17.1 and is compatible with Elasticsearch 5.3.2..

Supported languages with the ISO 639-3 codes that you use to identify the language are as follows:
English (eng), French (fra), German (deu), Italian (ita), Portuguese (por), Spanish (spa), Korean (kor), Chinese (zho), Japanese (jpn), Russian (rus), Arabic (ara), Western Farsi (pes), Dari (prs), Pushto (pus), and Urdu (urd).²

2. Getting started

To use RNI-Elasticsearch you need the RNI Elasticsearch plugin 5.3.2.0., an RLP License, and Elasticsearch 5.3.2..³

1.If you do not already have it, install Elasticsearch.
Download and unzip Elasticsearch-5.3.2..zip.⁴ Please ensure you have the correct version of Elasticsearch (5.3.2) when trying to install the RNI Elasticsearch plugin. If your version of Elasticsearch does not exactly match, the plugin will not install.

2.Install the RNI-Elasticsearch plugin.
Navigate to the elasticsearch-5.3.2. root directory and run

bin/elasticsearch-plugin install file:///path/to/rni-es- Elasticsearch5.3.2..zip

Use the absolute file path to refer to the plugin zip.
You may be prompted to grant permissions necessary for the plugin to function.
The RNI-Elasticsearch plugin is now in plugins/rni.

3.Copy the RLP License (rlp-license.xml) to plugins/rni/bt_root/rlp/rlp/licenses.
This license must be in place before you can use RNI-Elasticsearch.
To start the Elasticsearch server, run

bin/elasticsearch

3. Usage Pattern

1.Create an index.
2.Define a mapping for fields that will contain person, location, or organization names. The type for each of these fields is "rni_name".
3.Create documents that contain one or more name fields along with other fields of interest. Each name field in a document will contain a name.
4.Test the mapping
5.Query the index.
The following snippets use the cURL[^5] command-line tool to illustrate the Elasticsearch API for running the RNI-Elasticsearch plugin.

3.1 Creating an Index

The following cURL statement creates an index named rni-test.

curl -XPUT 'http://localhost:9200/rni-test'

3.2 Define Mapping

Specify a document type for the documents you plan to create, and set the "type" for name fields to "rni_name". The following statement maps the "primary_name" and "aka" (also known as) fields in the "record" document to the "rni_name" type in the "rni-test" index.

curl -XPUT 'http://localhost:9200/rni-test/record/_mapping' -d '{
    "record" :{
        "properties" :{
            "primary_name" :{ "type" :"rni_name" },
            "aka" :{ "type" :"rni_name" },
            "occupation" :{ "type" :"string" }
        }
    }
}'

3.3 Creating Documents

You may include document fields other than name fields.

curl -XPUT 'http://localhost:9200/rni-test/record/1' -d '{
  "primary_name" :"Joe Schmoe",
  "aka" :"Bossman",
  "occupation" :"business owner" 
}'

For the name fields, you can include individual properties in place of just a name string. Entity type is particularly useful.

Property	Required	Description
"data"	✓	The name string.
"language"		ISO 639-3 Code for the language of use: the language of the document in which the name was found.
"languageOfOrigin"		ISO 639-3 Code for the language of origin of the name. For example, a name of Spanish origin (spa) may be found in an English (eng) document.
"script"		ISO 15924 code for the script: the script for all languages supported in this release is "Latn".
"entityType"		"PERSON", "LOCATION", or "ORGANIZATION".
"uid"		Unique string identifier for the document.

Example:

curl 'http://localhost:9200/rni-test/record/3' -d '{ 
    "primary_name" :{ 
    "data" :"Joe Schmoe", 
    "language" :"eng", 
    "script" :"Latn", 
    "entityType" :"PERSON" 
    } 
}'

3.4 Testing RNI Integration

After you have created a document, you should test that the fields are mapped correctly and that RNI is running.

1.Add a document.

2.Send a test query from Elasticsearch:

curl -XGET localhost:9200/rni-test/rni_plugin/_verify_installation?rni-name-field=primary_name

The value of (rni-name-field) must refer to the name of a field whose type is rni_name, that exists in your database (for example, primary_name).

3.Look at the curl response.

If the response says that the mapping and RNI passed, then the integration was successful.
If the response says that the mapping failed, then make sure you have specified the correct field (rni-name-field) and that the specified field is correctly mapped as type rni_name.
If the response says that RNI failed, then make sure that there is a document for Elasticsearch to query and that the RNI plugin is installed.

3.5 Query the Index

The query for a name consists of two parts.

3.5.1 Base Query

The base query is a standard query against a name field:

"query" :{ 
   "match" :{ 
       "primary_name" :"Jo Shmoe" 
  } 
}

Querying supports the same name properties that you may use when indexing documents. Unlike during document creation, you must pass the JSON object containing the name fields as a string.

curl 'http://localhost:9200/rni-test/record/_search' -d '{ 
   "query" :{ 
       "match" :{ 
           "primary_name" :
"{\"data\" :\"Jo Shmoe\", \"language\" :\"eng\", \"entityType\" :\"PERSON\"}" 
       } 
   } 
}'

Much like during indexing, RNI creates a set of keys based on the name and then generates a more complex internal query to match against the indexed keys.

3.5.2 Rescoring with the RNI Pairwise Name Match

The second part of the query uses Elasticsearch Rescoring⁶ to ensure that only good candidates are passed to the RNI pairwise matcher, which is a computationally intensive process.

Rescoring includes the following parameters:

• window_size (an integer, defaults to 10) specifies how many documents from the base query should be passed to the RNI pairwise matcher.

Use this parameter to limit the number of compute-intensive name matches that need to be

performed, thus decreasing query latency.

• query_weight (a float, defaults to 1.0) specifies the weighting of the score returned by the base query.

In the context of RNI pairwise matching, the base query score has little meaning, so we suggest you set it to 0.0.

• rescore_query_weight (a float, defaults to 1.0) specifies the weighting of the maximum RNI pairwise match score.

If query_weight 0.0 and rescore_query_weight is 1.0, the score that is returned by rescoring is the RNI pairwise match score.

• score_to_rescore_restriction (a float, defaults to 0.4, cannot be negative) dynamically controls the minimum query score a document needs to be passed to the RNI rescorer.

A value of 0.0 will not cut off any documents from being rescored. Higher values rescore fewer documents, increasing speed at the cost of accuracy.

• window_size_allowance (a float, defaults to 0.5, must be in interval (0, 1]) dynamically controls the window size for rescoring. No more than window_size names will be scored.

A value of 1.0 will not cut off any documents from being rescored. Higher values rescore more documents, increasing accuracy at the cost of speed.

In the following example, pairwise matching is performed on the top 200 names returned by the base query.

"rescore" :{ 
   "window_size" :200, 
   "query" :{ 
     "rescore_query" :{ 
       "function_score" :{ 
         "name_score" :{ 
           "field" :"primary_name", 
           "query_name" :"Jo Shmoe" 
         } 
       } 
     }, 
     "query_weight" :0.0, 
     "rescore_query_weight" :1.0 
   } 
 }

The "name_score" function matches every name in the given field against the query name and returns the maximum score to the rescorer.

The "name_score" function score query must be given at least one object that specifies:

• field: the field of type "rni_name" to match against

• query: the query name

It also supports all of the name properties mentioned previously.

This example illustrates the full query incorporating both match and rescore, using RNI query parameters.

curl 'http://localhost:9200/rni-test/record/_search' -d '{   
        "query" :{ 
                "match" :{ 
                        "primary_name" :"Jo Shmoe" 
                } 
        }, 
        "rescore" :{ 
                "window_size" :200, 
                "query" :{ 
                        "rescore_query" :{ 
                                "function_score" :{ 
                                        "name_score" :{ 
                                                "field" : "primary_name", 
                                                "query_name" :"Jo Shmoe", 
                                                "score_to_rescore_restriction":1.0, 
                                                "window_size_allowance":0.5 
                                        } 
                                } 
                        }, 
                        "query_weight" :0.0, 
                        "rescore_query_weight" :1.0 
                } 
        } 
}'

This query returns an RNI match score against "Joe Shmoe" in the "_score" field:

{ 
"_index":"rni-test", 
"_type":"record", 
"_id":"1", 
"_score":0.80217975, 
"_source":{ 
    "primary_name":"Joe Shmoe", 
    "aka":"Bossman", 
    "occupation":"business owner" 
    } 
}

3.6 Using Multivalued Name Fields

If the name field in your documents can contain multiple values, we recommend wrapping that field in a nested object. This allows for more accurate Elasticsearch queries.

The mapping should include a field of type "nested" that contains the name field.

"nested_names" :{ 
    "type" :"nested", 
    "properties" :{ 
        "name" :{ "type" :"rni_name" } 
    } 
}

Multiple names can be added to the nested field in an array.

{ 
    "nested_names" :[ 
        { 
            "name" :"Joe Smith" 
        }, 
        { 
            "name" :"Mike Schmoe" 
        }
    ] 
}

The queries also need to be nested. Make sure to set the "score_mode" to be "max".

{ 
    "query" :{ 
        "nested" :{ 
            "path" :"nested_names", 
            "query" :{ 
                "match" :{ 
                    "nested_names.name" :"Jo Shmoe" 
                } 
            } 
        } 
    }, 
    "rescore" :{ 
        "query" :{ 
            "rescore_query" :{ 
                "nested" :{ 
                    "path" :"nested_names", 
                    "score_mode" :"max", 
                    "query" :{ 
                        "function_score" :{ 
                            "name_score" :{ 
                                "field" :"nested_names.name", 
                                "query_name" :"Jo Shmoe" 
                            } 
                        } 
                    } 
                } 
            }, 
            "query_weight" :0.0, 
            "rescore_query_weight" :1.0 
        } 
    } 
}

Please see the Elasticsearch documentation⁷ for more detailed information on nested objects and queries.

3.7 Data Fields

You can process fielded names by separating the fields with "|". Fields can be empty. We assign no explicit semantics to each field (such as given name or surname), but we do pay attention to the order of the fields when comparing two names that have fields. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in one name matches the second field in another name).

When scoring a potential match between a name with data fields and a name without data fields, RNI treats the name without data fields as if it were a name with one data field.

RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".

Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*.

3.8 Verifying RNI SDK Version

To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version

curl -XGET localhost:9200/rni-test/rni_plugin/_get_version 
{"rosette_sdk_version":"7.15.0"}

3.9 Interpreting RNI Scores

RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties), and scores less than 1.0 for similar names where the query name and index name vary with respect to one or more properties (such as language of origin) and one or more of the following:

Variation	Example(s)
Phonetic and/or spelling differences	Nayif Hawatmeh and Nayif Hawatma
Missing name components	Mohammad Salah and Mohammad Abd El-Hamid Salah
Rarity of a shared name component	Two English names that contain Ditters are more likely to match than two names that contain Smith
Initials	John F. Kennedy and John Fitzgerald Kennedy
Nicknames	Bobby Holguin and Robert Holguin
"Cousin" or cognate names	Pedro Calzon and Peter Calzon
Uppercase/Lowercase	Rosa Elena PACHECO and Rosa Elena Pacheco
Reordered name components	Zedong Mao and Mao Zedong
Variable Segmentation	Henry Van Dick and Henri VanDick, Robert Smith and Robert JohnSmyth
Corresponding name fields	Corresponding name fields For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox]
Truncation of name elements	For Sawyer, the similarity with Sawy is higher than the similarity with Sawi.

Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

4. Date Matching

The RNI-Elasticsearch plugin matches dates as a part of the name matching system. It returns a date match score, reflecting the similarity of two dates.

Similar to name matching, the process is to index dates in connection to the related name. Then query the date and name, and RNI-Elasticsearch returns the match score. For example, a person's name and their date of birth have separate match scores. Within your system, you can weight and combine the date and name match scores to determine the final match score.

4.1 Date Definition

A date contains a year, month, and day. You can write it in a variety of formats, such as December 30 1955, 30 Dec 1955, or 12/30/55. RNI-Elasticsearch will filter out non-date related words.

Omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12-.

4.2 Using Date Matching

4.2.1 Indexing Dates

1.Create an index.

curl -XPUT 'http://localhost:9200/rni-test'

curl -XPUT 'http://localhost:9200/rni-test'

2.Define a mapping for fields that will contain dates. The type for each of these fields is "rni_date".

curl -XPUT 'http://localhost:9200/rni-test/record/_mapping' -d '{ 
    "record" :{ 
        "properties" :{ 
            "birth_date" :{ "type" :"rni_date" }, 
            "primary_name" :{ "type" :"rni_name" } 
        } 
    } 
}'

Optionally, you can specify an Elasticsearch date format⁸. If you specify an ES date format that includes time, RNI-Elasticsearch ignores the time field. All dates must adhere to this format.

curl -XPUT 'http://localhost:9200/rni-test/record/_mapping' -d '{ 
    "record" :{ 
        "properties" :{ 
            "birth_date" :{ 
                "type" :"rni_date", 
                "format" :"MM-yyyy-dd" 
            }, 
            "primary_name" :{ 
                "type" :"rni_name" 
            } 
        } 
    } 
}'

3.Create documents that contain a date field.

curl -XPUT 'http://localhost:9200/rni-test/record/1' -d '{
    "primary_name" : "Joe Schmoe",
    "birth_date" : "07-1955-24"
}'

4.2.2 Querying Dates

There are many ways that you can incorporate date matching within your query. Here are two examples, one with date matching by itself, and one with date and name matching.

Base Query. The base query is a standard query against the date field. Refer to Query the Index.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "match" : {
            "birth_date" : "08-1955-25"
        }
    }
}'

RNI Rescore with Dates. Refer to Rescoring with RNI Pairwise Name Match.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "match" : { "birth_date" : "08-1955-25" }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "date_score" : {
                        "field" : "birth_date",
                        "query_date" : "08-1955-25"
                    }
                }
            }
            , "query_weight" : 0
        }
    }
}'

The query returns a hit, with the RNI date match score.

"hits": {
    "total": 1,
    "max_score": 1.618923,
    "hits": [
        {
            "_index": "test",
            "_type": "record",
            "_id": "AVXMepnorGuybmuiQtQr",
            "_score": 0.8120856,
            "_source": {
                "primary_name": "Joe Schmoe",
                "birth_date": "07-1955-24"
            }
        }
    ]
}

The date match score is a measure of how similar the dates are. Similar dates have a stronger match and their date match score is closer to 1.

For example, 11/05/1993 and 11/07/1993 have a high score, as they very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.

5. Record Matching

The RNI-Elasticsearch plugin includes a function that produces a single similarity score for documents containing multiple fields. The fields can be of type rni_name, rni_date, or any other Elasticsearch field type.

Different weights can be given to every field and custom similarity functions can even be used.

5.1 Using Record Matching

5.1.1 Indexing Records

1.Create an index with a mapping containing fields with different types.

curl -XPUT 'http://localhost:9200/rni-test' -d '{
    "mappings": {
        "record": {
            "properties": {
                "name" : { "type" : "rni_name" },
                "dob": { "type": "rni_date" },
                "height" : { "type" : "integer" },
                "nationality" : { "type" : "keyword" }
            }
        }
    }
}'

2.Index documents that contain those fields.

curl -XPUT 'http://localhost:9200/rni-test/record/1' -d '{
    "name": "Ryan McDonagh",
    "dob": "11/19/1987",
    "nationality": "USA",
    "height": 65
}'

5.1.2 Querying Records

The query can also be a record containing multiple fields. The query record has to have its fields mapped to those of the indexed documents if they aren't already.

Base Query. The base query should be a standard Elasticsearch query against multiple fields that will return good candidates for rescoring.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } }
            ]
        }
    }
}'

RNI Rescore with Records. Use the 'doc_score' function to score the indexed documents against a query record.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } }
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { "query_value": "Brian McDonough" },
                            "dob": { "query_value": "10/19/87" },
                            "height": { "query_value": 67 },
                            "nationality": { "query_value": "CANADA" }
                        }
                    }
                }
            },
            "query_weight" : 0
        }
    }
}'

Additionally each field can be given a weight to reflect its importance in the overall matching logic.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } }
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { "query_value": "Brian McDonough", "weight": 4 },
                            "dob": { "query_value": "10/19/87", "weight": 2 },
                            "height": { "query_value": 67, "weight": 0.5 },
                            "nationality": { "query_value": "CANADA", "weight": 1 }
                        }
                    }
                }
            },
            "query_weight" : 0
        }
    }
}'

By default, if a queried-for field is null in the index, the field is removed from the score calculation, and the weights of the other fields are redistributed. However, you can override this behavior by using the score_if_null option to specify what score should be returned for this field if it is null in the index document.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } }
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { 
                                "query_value": "Brian McDonough",
                                "weight": 4,
                                "score_if_null" : 0.0
                                },
                            "dob": {
                                "query_value": "10/19/87",
                                "weight": 2
                                },
                            "height": {
                                "query_value": 67,
                                "weight": 0.5
                                },
                            "nationality": {
                                "query_value": "CANADA",
                                "weight": 1 , 
                                "score_if_null" : 1.0 
                                }
                        }
                    }
                }
            },
            "query_weight" : 0
        }
    }
}'

While the doc_score function has built-in similarity functions for many core field types, a custom similarity function can be provided at query time. In this manufactured example, we'll use a simple 'script_score' function that matches 'CANADA' and 'USA' with a high score. Refer to the Elasticsearch documentation for more details about Elasticsearch scripting⁹. Any other 'function_score' function¹⁰ can also be used.

curl 'http://localhost:9200/rni-test/record/_search' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } }
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { "query_value": "Brian McDonough", "weight": 4 },
                            "dob": { "query_value": "10/19/87", "weight": 2 },
                            "height": { "query_value": 67, "weight": 0.5 },
                            "nationality": {
                                "function": {
                                    "function_score": {
                                        "script_score": {
                                            "script": {
                                                "lang": "painless",
                                                "params": {
                                                    "query_value": "CANADA"
                                                },
                                                "inline":
                                    "if (params.query_value == '\''CANADA'\'' &&
                                    doc['\''nationality'\''].value == '\''USA'\'')
                                    {return 0.8} else {return 0.2}"
                                            }
                                        }
                                    }
                               }, "weight": 1
                            }
                        }
                    }
                }
            },
            "query_weight" : 0
        }
    }
}'

Note

The single quotes in the above script are escaped ('\'') so it can be used in a curl command.

The document score is calculated by performing a weighted arithmetic mean over the similarity scores of each field. If a query field is missing from an indexed record, then that field is ignored and its weight evenly distributed across the other fields.

5.2 Supported Field Types

The 'doc_score' function has default support for rni_name, rni_date, and many of the Elasticsearch core field types. All default similarity scores are between 0.0 and 1.0.

Field Type(s)	Default Similarity Function	Example(s)
rni_name	name_score (refer to RNI pairwise match score)	'John David Smith' vs 'Jon D Smith' = 0.88
rni_date, date	date_score (refer to RNI pairwise date score)	'2010-11-4' vs '2010-5-11' = 0.92
keyword, text, string	Normalized edit distance	'37 Congress St.' vs '35 Congres St.' = 0.875
integer, long, short, double, float	Normalized difference (eg.percentage)	'65' vs '59' = 0.908
boolean	Equality	'true' vs 'true' = 1.0, 'true' vs 'false' = 0.0
geo_point	Log function over Haversine distance	'[lat=42.361145, lon=-71.057083]' vs '[lat=42.3736, lon=-71.1097]' = 0.83

6. User Configurable Features

There are many ways to configure RNI to better fit your use case and data.

6.1 Stop Patterns and Stopword Prefixes

Stop patterns and stopword prefixes strip matching names elements during indexing and queries. The stripping of prefixes (string literals) can be performed more quickly than the application of stop patterns (regular expressions), so you can rely on stopword prefixes for the efficient removal of prefixes, such as titles, that you do not want to include in name matching.

For each name, RNI first performs character-level normalization, stripping punctuation, with the exception of periods, commas, and hyphens; whitespace is reduced to single spaces; and characters are lowercased. Then RNI cycles its way through the stop patterns then the stopwords, removing during each cycle the patterns and stopwords that strip nothing, until the list of stop patterns and stopwords is empty.

Stop Pattern. A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java 1.8 java.util.regex.Pattern; see the Javadoc for detailed documentation.

Stop patterns for a given language are specified in a UTF-8 file with the ISO639 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt¹¹

where LANG is a three-letter language code. Each row in the file, with the exception of rows that begin with #,¹² is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Elements in the names to be processed that match any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[- ]general stop pattern is applied where applicable when general is also a stop pattern.

The SDK includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in bt_root/rlpnc/ data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entries

^fnu\d
^lnu\d

indicate that the common indicators for first-name-unknown and last-name-unknown followed by nothing are to be removed.

You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tab n, where n is the field number. To seardh multiple fields, include an entry for each field, as illustrated below. When processing a name wihout fields, the field parameter is ignored. For example,

^lnu\d 2
^lnu\d 3

indicates that the regex is to applied to fields 2 and 3 in fielded names.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stopword prefixes.

Stopword Prefixes. A stopword prefix is a string literal that strips the matching prefix from name elements during indexing and queries.

Stopword prefixes for a given language are specified in a UTF-8 file with the ISO639 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt¹¹

where LANG is a three-letter language code. Each row in the file, with the exception of rows that begin with #, 12 is a string literal.

Prefixes in the names to be processed that match any of these string literals are removed.

Like stop patterns, longer stopword prefixes take precedence over shorter prefixes that the longer stopword contains. For example, the lieutenant colonel stopword prefix is applied where applicable when colonel is also a stopword prefix.

The SDK includes files with generic stopword prefixes for names in English and Spanish. These files are in bt_root/rlpnc/data/rnm/ref/override: stopprefixes_eng.txt and stopprefixes_spa.txt. You can modify the contents of these files. To add stopword prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt would include stopword prefixes for use with Russian text.

6.2 Overriding Name Pair Matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses ISO639 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt¹¹

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name. Each row in the file, with the exception of rows that begin with #, is a tab-delimited full-name pair and score:

query_name Tab index_name Tab score

The scores must between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.¹³

The SDK includes a sample file with sample entries commented out: bt_root/rlpnc/data/rnm/ ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,

John Doe Joe Bloggs 1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt:

外山恒 Toyama Koichi 1.0 
ヒラリークリントン Hillary Clinton 1.0

6.3 Overriding Token Pair Matches

You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, and Spanish-Spanish token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. Tokens cannot contain whitespace. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail and Abby constitute a token pair, then the match score for Abigail Harris and Abby Harris will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

tokens_LANG1_LANG2[_TYPE].txt¹¹

where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, with the exception of rows that begin with #, is a tab-delimited token pair and may include an indicator that at least one of the tokens is a nickname or that the tokens are cognates:

Token1 Tab Token2 [Tab NICKNAMEorCOGNATEorVARIANT]

If you would like to prevent a token pair from matching, use the SUPPRESS indicator.

If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.

The SDK includes bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:

Peter Pete NICKNAME 
Peter Pedro COGNATE

This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt, tokens_zho_eng_ORGANIZATION.txt.

When you create an additional file in the same location, use the ISO639 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt.

We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.

6.4 Normalizing Token Variants

You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:

equivalenceclasses_LANG_[TYPE].txt

For example, equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.

Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:

[normal_form1] 
variant1_1 
variant1_2 
variant1_3 
[normal_form2] 
variant2_1 
variant2_2 
variant2_3
```

The SDK includes bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to muhammad:

[muhammad] 
mohammed 
mahamed 
mohamed 
mohamad 
mohammad 
muhammed 
muhamed 
muhammet 
muhamet 
md 
mohd 
muhd

You can add lists of variants to this file, including the normalized form in square brackets to start each list.

6.5 Unimportant Tokens

You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.

The file name is lowWeightTokens_LANG.txt.

For example, bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".

6.6 Additional Configuration

For further configuration options such as custom language model training and tuning name match parameters please contact [email protected]

6.7 Date Match Parameters

Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the parameter_defs.yaml file, located in plugins/rni/bt_root/ rlpnc/data/etc.

tryDayMonthSwap. On, by default. This parameter attempts to correct for parsing errors by swapping the day and month. Turn off if you want RNI-Elasticsearch to match the dates exactly

6.7.1 Weight of score elements

There are multiple elements that affect the final date match score. Each element returns its own score, which is then weighted to influence the final score. The sum of the weights is normalized to 1.

Element	Description	Example(s)
Time Distance	Time between the dates.	January 1 2000 and December 31 1999 have a high time distance score.
Year Distance	Time between the years.	1990 and 1980 have a low year distance score.
Month Distance	Numeric difference between the months.	January 20 2000 and January 3 1950 have a high month distance score. January 1 2000 and December 31 1999 have a low month distance score.
Day Distance	Numeric diference between the days.	January 1 2000 and September 1 1950 have a high day distance score. January 1 2000 and December 31 1999 have a low day distance score.
String Distance	Account for digit swaps and errors in data.	03-12-1909 and 03-21-1990 have a relatively high string distance score.

RNI-Elasticsearch combines the weighted match scores from each of the elements to produce the final date match score.

Footnotes:

¹. Copyright © 2017 by Elasticsearch BV. Licensed under The Apache License Version 2.0. ↩

². The Java only version of the plugin only supports English, French, German, Italian, Portuguese, and Spanish. ↩

³. For RNI plugins that support earlier versions of Elasticsearch (such as 1.x.y), contact [email protected]. ↩

⁴. https://www.elastic.co/downloads/elasticsearch ↩

⁵. http://curl.haxx.se/ ↩

⁶. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html ↩

⁷. https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html ↩

⁸. https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html ↩

⁹. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html ↩

¹⁰. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#score-functions ↩

¹¹. Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type. ↩

¹². # may also be used after an entry on the same line to begin a comment. ↩

¹³. Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name Match operations, on the other hand, will return an override score of 0. ↩

Basic info