Fuzzy matching applied to chant references

December 06, 2020 - fuzzymatching opensource chants rust

In this post, I want to share about my journey in developing and open-sourcing a side project I've been working on for the past few months. Read on to learn more about:

The problem domain of finding references for Gregorian chants
What fuzzy matching is and why it was a good fit for this problem

Problem

For a recent side project, I was tasked with finding matches between records of two text corpuses (specifically, between Gregorian chants and Latin Vulgate Scripture references). By nature, matches would generally not be identical, as a chant could include only part of a verse, and even direct quotes would often differ slightly based on translation or encoding.

To contextualize the scope of the problem, here are a few quick stats:

There are approximately 36,000 references in the Latin Vulgate Bible.
The number of chants to match varies by input corpus, but the number of chants in a corpus commonly reaches multiple thousands.

Fuzzy matching

What the situation called for is known more generally as approximate string matching, a term that aptly describes the body of techniques that perform fuzzy searches when matching strings against a pattern.

Among the numerous techniques available, I settled on FuzzyWuzzy, a library of approximate string matching functions which was written originally in Python and has ports in many common languages. A good explanation of the types of matching algorithms available is in this blog post, but I'll summarize the relevant details in the sections below.

Levenshtein distance

The fuzzywuzzy library builds abstractions on top of a core sequence similarity measure first developed by Levenshtein.

The Levenshtein distance of two strings is the minimum number of edits (insertions, deletions, or replacements) which must be processed for two strings to be made equivalent.

A simple example is below:

let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(lev(s1, s2), 1);

Simple ratio

The first abstraction built on top of Levenshtein distance is a simple similarity ratio which represents how similar two strings are, relative to their total length. The ratio ranges from 0 (totally dissimilar) to 100 (an exact match).

A simple implementation is below:

fn simple_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
    if s1.len() == 0 || s2.len() == 0 {
        return 0;
    }

    let length = (s1.len() + s2.len()) as f32;
    let distance = lev(s1, s2) as f32;
    let ratio = (length - distance) / length;
    ((100.0 * ratio) as u8)
}

// example:
let s1 = String::from("hello");
let s2 = String::from("hallo");
assert_eq!(simple_ratio(s1, s2), 90);

Partial ratio

While simple_ratio can be helpful when the goal is to differentiate between close matches and poor matches, it fails to accurately capture similarity when two strings are of greatly dissimilar length, as in the example below:

let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(simple_ratio(s1, s2), 54);

Here, the simple_ratio measure is quite poor despite the obvious similarity, in which the chant is a direct reference of a portion of the entire verse.

This is the problem partial_ratio is designed to solve.

For two sequences s1 and s2, where m = s1.len(), n = s2.len(), and m <= n, partial_ratio(s1, s2) represents the best-possible simple_ratio measure between s1 and slices of s2 which have the same length as s1.

A simple (but slightly inefficient) implementation is below:

fn partial_ratio<T: PartialEq>(s1: &Vec<T>, s2: &Vec<T>) -> u8 {
    if s1.len() == 0 || s2.len() == 0 {
        return 0;
    }

    let (shorter, longer) = if s1.len() <= s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    let mut min_distance = std::usize::MAX;

    for i in 0..longer.len() - shorter.len() {
        let distance = distance_from(shorter, block);
            if distance < min_distance {
                min_distance = distance;
            }
        }
    }

    let length = (shorter.len() * 2) as f32;
    let ratio = (length - min_distance as f32) / length;
    (ratio * 100.0).round() as u8
}

// example
let s1 = String::from("dilexisti iustitiam et odisti iniquitatem");
let s2 = String::from("dilexisti justitiam et odisti iniquitatem propterea unxit te deus deus tuus oleo laeiti pr consortibus tuis");
assert_eq!(partial_ratio(s1, s2), 100);

Other ratios

The original fuzzywuzzy library supports a couple other ratios as well: token_sort_ratio and token_set_ratio. As their name implies, they split each input sequence into tokens before computing ratios, which can be helpful if sequences with matching but out-of-order words should be still-considered high-quality matches.

For the chants matching problem domain, the most appropriate choice seemed to be partial_ratio, as order of words is important, but partial references are common.

Conclusion

The chants utility I developed was mostly a wrapper of the partial_ratio algorithm described above with some convenient options for interacting with it on the commandline, including:

Downloading the Latin Vulgate texts to a local directory
Parsing references and chants from input files and evaluating matches
Allowing users to specify the threshold ratio for matches and returning matches which exceed this.

In future posts, I'll cover more of the journey in open-sourcing this project, with a focus on:

Evaluating existing implemtations of fuzzy matching
How some optimizations led to performance improvements of 25x!