Fuzzy matching applied to chant references
December 06, 2020 -In this post, I want to share about my journey in developing and open-sourcing a side project I've been working on for the past few months. Read on to learn more about:
- The problem domain of finding references for Gregorian chants
- What fuzzy matching is and why it was a good fit for this problem
Problem
For a recent side project, I was tasked with finding matches between records of two text corpuses (specifically, between Gregorian chants and Latin Vulgate Scripture references). By nature, matches would generally not be identical, as a chant could include only part of a verse, and even direct quotes would often differ slightly based on translation or encoding.
To contextualize the scope of the problem, here are a few quick stats:
- There are approximately 36,000 references in the Latin Vulgate Bible.
- The number of chants to match varies by input corpus, but the number of chants in a corpus commonly reaches multiple thousands.
Fuzzy matching
What the situation called for is known more generally as approximate string matching, a term that aptly describes the body of techniques that perform fuzzy searches when matching strings against a pattern.
Among the numerous techniques available, I settled on FuzzyWuzzy, a library of approximate string matching functions which was written originally in Python and has ports in many common languages. A good explanation of the types of matching algorithms available is in this blog post, but I'll summarize the relevant details in the sections below.
Levenshtein distance
The fuzzywuzzy
library builds abstractions on top of a core sequence similarity measure first
developed by Levenshtein.
The Levenshtein distance of two strings is the minimum number of edits (insertions, deletions, or replacements) which must be processed for two strings to be made equivalent.
A simple example is below:
let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(lev(s1, s2), 1);
Simple ratio
The first abstraction built on top of Levenshtein distance is a simple similarity ratio which represents how similar two strings are, relative to their total length. The ratio ranges from 0 (totally dissimilar) to 100 (an exact match).
A simple implementation is below:
fn simple_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
if s1.len() == 0 || s2.len() == 0 {
return 0;
}
let length = (s1.len() + s2.len()) as f32;
let distance = lev(s1, s2) as f32;
let ratio = (length - distance) / length;
((100.0 * ratio) as u8)
}
// example:
let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(simple_ratio(s1, s2), 90);
Partial ratio
While simple_ratio
can be helpful when the goal is to differentiate between close matches and poor
matches, it fails to accurately capture similarity when two strings are of greatly dissimilar
length, as in the example below:
let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(simple_ratio(s1, s2), 54);
Here, the simple_ratio
measure is quite poor despite the obvious similarity, in which the chant is
a direct reference of a portion of the entire verse.
This is the problem partial_ratio
is designed to solve.
For two sequences
s1
ands2
, wherem = s1.len()
,n = s2.len()
, andm <= n
,partial_ratio(s1, s2)
represents the best-possiblesimple_ratio
measure betweens1
and slices ofs2
which have the same length ass1
.
A simple (but slightly inefficient) implementation is below:
fn partial_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
if s1.len() == 0 || s2.len() == 0 {
return 0;
}
let (shorter, longer) = if s1.len() <= s2.len() {
(s1, s2)
} else {
(s2, s1)
};
let mut min_distance = std::usize::MAX;
for i in 0..longer.len() - shorter.len() {
let distance = distance_from(shorter, block);
if distance < min_distance {
min_distance = distance;
}
}
}
let length = (shorter.len() * 2) as f32;
let ratio = (length - min_distance as f32) / length;
(ratio * 100.0).round() as u8
}
// example
let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(partial_ratio(s1, s2), 100);
Other ratios
The original fuzzywuzzy
library supports a couple other ratios as well: token_sort_ratio
and
token_set_ratio
. As their name implies, they split each input sequence into tokens before
computing ratios, which can be helpful if sequences with matching but out-of-order words should be
still-considered high-quality matches.
For the chants
matching problem domain, the most appropriate choice seemed to be partial_ratio
,
as order of words is important, but partial references are common.
Conclusion
The chants
utility I developed was mostly a wrapper of the
partial_ratio
algorithm described above with some convenient options for interacting with it on
the commandline, including:
- Downloading the Latin Vulgate texts to a local directory
- Parsing references and chants from input files and evaluating matches
- Allowing users to specify the threshold ratio for matches and returning matches which exceed this.
In future posts, I'll cover more of the journey in open-sourcing this project, with a focus on:
- Evaluating existing implemtations of fuzzy matching
- How some optimizations led to performance improvements of 25x!