Fuzzy Entity
Matching
The details of the
matching algorithms can be found from my earlier posts. For an entity, various
attribute types are supported including integer, double, categorical, text,
time, location etc. Essentially for any pair of entities,
distance is calculated between corresponding attributes. Attribute wise
distances are aggregated over all the attributes of an entity to find the
distance between two entities.
Why is Jaccard
distance not appropriate for this use case? Given two blocks of text, Jaccard
similarity depends on the number of words common to both and the union of
words in the the two text blocks. Two words will not match whether there is
difference in one character or multiple characters. Levenshtein Distance gives us more fine grained text
matching algorithm.
Levneshtein
Distance
Given two
words, it’s defined as the number inserts and deletes of characters necessary
to make one word same as the other. The Levenshtein Distance
between two words w1 and w2 is calculated as follows
distance(w1,w2) = length(w1) + length(w2) – 2 * maxCommonSubSeqLength(w1,w2)
where
maxCommonSubSeqLength is the length of longest common sub sequence between w1 and w2.
where
maxCommonSubSeqLength is the length of longest common sub sequence between w1 and w2.
The distance is normalized, by
dividing the distance with sum of the length of the two words as below, forcing
the normalized distance to be always be between 0 and 1.0.
normDistance(w1,w2) = distance(w1,w2) / (length(w1) + length(w2))
For a pair
of text blocks, normalized distance is calculated between corresponding
word pairs. The final distance between two text fields is the distance averaged
over all the words. The implementation is the class EditDistanceSimilarity. Various
attribute distance algorithms are supported e.g., euclidian, manhattan etc.
This is not the
only way to calculate Levinshtein Distance. It is also
possible to take the whole text field as one token and calculate distance by
setting the parameter edit.dist.token to false. I have one word of caution. The
edit distance calculation computation time grows non linearly with the token
length. Treating
the whole text field as one token may be very computationally intensive.
Similarity
Map Reduce
The class SameTypeSimilarity contains the implementation. The
details of it’s inner workings can be found in my earlier posts. Essentially,
it does a hash based self join and works as follows
1.
We configure a set of buckets
2.
Each record is hashed into one of the buckets
3.
A bucket pair and the associated records is processed in one
reducer call.
4.
The reducer pairs records from each bucket in a nested loop and
calculates distances
It turns out, that
the proper configuration of the number of buckets is critical for this example.
The number of buckets should be large, so that each reducer call does not have
to process too many records. The number of buckets is set using the parameterbucket.count. While running
the map reduce, I had the Hadoop job tracker terminating a reducer task because
of the heart beat time out, until I increased the bucket count.
As thumb
rule, I would suggest choosing the number of buckets such that each bucket has
about 4 or 5 records. Otherwise you may have the unfortunate experience of
reducer taking too much processing time in one call and eventually getting
timed out terminated.
In this Map Reduce,
every record is compared with every other record. So the complexity is O(n x
n). However, for our particular example, if we had two data sets that are
supposed to be identical, a simpler approach could be taken. We
could simply compare corresponding records from each set. The complexity in
that case would have been O(n).
Attribute
Distance Threshold
One way to optimize
the problem and to convert the complexity to a almost O(n) is toabandon distance processing between two
records, as soon as the distance between an attribute pair is found
to be above a predefined threshold.
For example, if the the
distance between the name fields for two customer records is significant
enough, we can skip processing the remaining attributes and set the distance
between the two records to a large value.
The threshold value is defined
in the meta data JSON file as below. The threshold can be set for any number of
attributes. This is how it’s set for the name field of a customer record.
{
"name" : "name",
"ordinal" : 1,
"dataType" : "text",
"distThreshold" : 0.4
}
Two Near
Identical Data Sets
If you have
two data sets where the records are supposed to identical, except for small
differences, the processing can be performed in O(n) time. This is optimization is done by
enabling inter set matching as below.
inter.set.matching=true
set.ID.size=1
The first parameter enables
inter set matching. For this to work, the entity ID of each record needs to be
encoded in a special way. The first few characters of the entity ID is the set
ID and the remaining characters comprise the real entity ID within a set. The
length of the set ID is defined by the second parameter above.
Essentially,
matching is performed between records for the two sets only if entity ID after
peeling off the set ID matches. So, we are finding distances only
between corresponding entities from the two data sets.
Duplicate
Customer Records
We will be using customers
records as example. The record has the the following fields. The first field
which is the ID, does not enter into distance calculation in any way.
1.
ID
2.
Name
3.
Address
4.
City and state
5.
Phone number
The data consists of two sets
of identical customer records. For some records, I introduced small
typographical errors. Here is some sample input
106379,Richard Gomez,2934 Encino Pt,San Antonio TX 78259,210 4811932
158280,Jim Dobbs,2756 Bulls Bay Hwy,Jacksonville FL 32220,312 2850284
137943,Robert Lewis,194 Buckboard Dr,Augusta GA 30907,406 5404029
125849,Jena Marin,276 Durham St,Menlo Park CA 94025,650 2957406
156290,Dharam Patel,84 Prospect Hill Dr,Tewksbury MA 01876,718 4702915
.........
206379,Richard Gomez,2935 Encino Pt,San Antonio TX 78259,210 4811932
258280,Jim Dobbs,2756 Bulls Bay Hwy,Jacksonville FL 32220,312 2850284
237943,Robert Lewis,194 Buckboard Dr,Augusta GA 30908,406 5404029
225849,Jena Marin,276 Durham St,Menlo Park CA 94025,650 2957406
The output MR class
SameTypeSimilarity consists of ID pair followed by the distance. There there
types of output as follows
1.
Distance is 0 for identical records
2.
Distance is close to 0 for records that are identical except for
small typographical errors
3.
Different records with high distance value
The second type is
of main interest to us. Because of the fuzzy matching logic, we are
able identify records as duplicate records in spite of small typographical
errors. Here some
sample output. The last column is the distance scaled by 1000.
106379,278253,757
106379,206379,16
178253,278253,0
178253,206379,757
278253,206379,757
168295,185062,612
We see all 3 types of output as
described earlier. Let’s pay attention to the records with ID 106379 and 206379
with the distance between them as 16. It’s the case of near duplicate records
with small typographical errors. Here are the two corresponding records. There
is one wrong character in the address field.
106379,Richard Gomez,2934 Encino Pt,San Antonio TX 78259,210 4811932
206379,Richard Gomez,2935 Encino Pt,San Antonio TX 78259,210 4811932
That one misplaced character
has caused the distance between the records to be 16 instead of 0. Otherwise
the distance would have been 0.
Attribute
Distance Aggregation
The attribute
distances are aggregated for all attributes. There are different
aggregation algorithms. For this example, I have used mahattan distance a.k.a L1 distance.
All attributes are
treated equally. However it is possible to assign different
attribute weights through the meta data. For example, if I had specified a
weight greater than (e.g. 1.2) for the phone number field, it would have shrunk
the distance. In other works, the distance between two phone numbers are made
to appear less than the actual distance.