# Fuzzy matching applied to chant references

December 06, 2020 -In this post, I want to share about my journey in developing and open-sourcing a side project I've been working on for the past few months. Read on to learn more about:

- The problem domain of finding references for Gregorian chants
- What fuzzy matching is and why it was a good fit for this problem

## Problem

For a recent side project, I was tasked with finding matches between records of two text corpuses (specifically, between Gregorian chants and Latin Vulgate Scripture references). By nature, matches would generally not be identical, as a chant could include only part of a verse, and even direct quotes would often differ slightly based on translation or encoding.

To contextualize the scope of the problem, here are a few quick stats:

- There are approximately 36,000 references in the Latin Vulgate Bible.
- The number of chants to match varies by input corpus, but the number of chants in a corpus commonly reaches multiple thousands.

## Fuzzy matching

What the situation called for is known more generally as approximate string matching, a term that aptly describes the body of techniques that perform fuzzy searches when matching strings against a pattern.

Among the numerous techniques available, I settled on FuzzyWuzzy, a library of approximate string matching functions which was written originally in Python and has ports in many common languages. A good explanation of the types of matching algorithms available is in this blog post, but I'll summarize the relevant details in the sections below.

### Levenshtein distance

The `fuzzywuzzy`

library builds abstractions on top of a core sequence similarity measure first
developed by Levenshtein.

The Levenshtein distance of two strings is the minimum number of edits (insertions, deletions, or replacements) which must be processed for two strings to be made equivalent.

A simple example is below:

```
let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(lev(s1, s2), 1);
```

### Simple ratio

The first abstraction built on top of *Levenshtein distance* is a simple similarity ratio which
represents how similar two strings are, relative to their total length. The ratio ranges from 0
(totally dissimilar) to 100 (an exact match).

A simple implementation is below:

```
fn simple_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
if s1.len() == 0 || s2.len() == 0 {
return 0;
}
let length = (s1.len() + s2.len()) as f32;
let distance = lev(s1, s2) as f32;
let ratio = (length - distance) / length;
((100.0 * ratio) as u8)
}
// example:
let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(simple_ratio(s1, s2), 90);
```

### Partial ratio

While `simple_ratio`

can be helpful when the goal is to differentiate between close matches and poor
matches, it fails to accurately capture similarity when two strings are of greatly dissimilar
length, as in the example below:

```
let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(simple_ratio(s1, s2), 54);
```

Here, the `simple_ratio`

measure is quite poor despite the obvious similarity, in which the chant is
a direct reference of a portion of the entire verse.

This is the problem `partial_ratio`

is designed to solve.

For two sequences

`s1`

and`s2`

, where`m = s1.len()`

,`n = s2.len()`

, and`m <= n`

,`partial_ratio(s1, s2)`

represents the best-possible`simple_ratio`

measure between`s1`

and slices of`s2`

which have the same length as`s1`

.

A simple (but slightly inefficient) implementation is below:

```
fn partial_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
if s1.len() == 0 || s2.len() == 0 {
return 0;
}
let (shorter, longer) = if s1.len() <= s2.len() {
(s1, s2)
} else {
(s2, s1)
};
let mut min_distance = std::usize::MAX;
for i in 0..longer.len() - shorter.len() {
let distance = distance_from(shorter, block);
if distance < min_distance {
min_distance = distance;
}
}
}
let length = (shorter.len() * 2) as f32;
let ratio = (length - min_distance as f32) / length;
(ratio * 100.0).round() as u8
}
// example
let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(partial_ratio(s1, s2), 100);
```

### Other ratios

The original `fuzzywuzzy`

library supports a couple other ratios as well: `token_sort_ratio`

and
`token_set_ratio`

. As their name implies, they split each input sequence into tokens before
computing ratios, which can be helpful if sequences with matching but out-of-order words should be
still-considered high-quality matches.

For the `chants`

matching problem domain, the most appropriate choice seemed to be `partial_ratio`

,
as order of words is important, but partial references are common.

## Conclusion

The `chants`

utility I developed was mostly a wrapper of the
`partial_ratio`

algorithm described above with some convenient options for interacting with it on
the commandline, including:

- Downloading the Latin Vulgate texts to a local directory
- Parsing references and chants from input files and evaluating matches
- Allowing users to specify the threshold ratio for matches and returning matches which exceed this.

In future posts, I'll cover more of the journey in open-sourcing this project, with a focus on:

- Evaluating existing implemtations of fuzzy matching
- How some optimizations led to performance improvements of 25x!