Fuzzy Match at Speed

December 09, 2020 - fuzzymatching levenshtein rust

This post is a deep dive on the 25x performance optimizations in fuzzy matching hinted at in the previous post. Feel free to refer to that post for an intro to the problem context and the basic algorithms involved.

As a high-level overview of this post, the main source of performance improvements was related to the following optimization insight:

The fastest code is the code that doesn't run.

Restated slightly, this means that there's no substitute to algorithmic improvements which minimize the total complexity or work a program must perform.

As we'll also see, using good tools (languages or libraries which are oriented towards performance) helps as well.

Baseline

Being familiar with .NET and not having much love for Python (in which the original fuzzywuzzy library was implemented), my first inclination was to use FuzzySharp, a relatively faithful .NET port. This proved to be reasonably effective as a first pass, and early users of the chants utility found it useful.

The biggest issue was that it was fairly slow. With over 36k references and number of chants reaching into the thousands, that meant multiple millions of comparisons, which ended up taking upwards of a day to process on a medium-tier laptop. Not great for iterating or tweaking settings!

Rust

At around this same time, I became fairly interested in Rust as a programming language. Its emphasis on performance as well as ergonomics makes it a great fit for many applications and a general pleasure to use.

I find one of the best ways to learn a new technology is to port an existing, familiar project to it. Since the chants utility was already on my mind, what better way to dive in than to port it to Rust?

Note: While I'm a fan of Rust and think it is well-suited to many applications, as mentioned earlier, most of the performance benefits were the result of algorithmic improvements, which could have been achieved in other languages as well.

In the course of the port, I did consider using existing libraries, either directly or as inspiration. However, none of them seemed to be a particularly good fit:

strsim is a popular library with various string similarity implementations, including Levenshtein, but it lacks some key methods required by the partial ratio fuzzy matching algorithm.
fuzzyrusty and fuzzywuzzy implement the partial ratio measures, but only supported &str instead of &[T]. In addition, they lacked some of the key optimizations I was interested in adding.

In the end, it was more effective (and also more fun!) to implement the algorithms by hand.

Optimizations

Levenshtein

The first optimization was having a really fast Levenshtein implementation.

What contributes to an implementation being really fast?

Efficient algorithms
Low memory use (allocate only when needed)

Let's see how this played out in optimizing levenshtein.

To start, I referenced the FuzzySharp code along with code on Wikipedia to get a sense of common approaches.

From a quick scan of Wikipedia, I learned that while the overall algorithm is O(mn) (for two sequences of length m and n, respectively), some algorithms allocate O(mn) memory as well, whereas more efficient algorithms allocate only O(n), where n < m.

Since levenshtein will be called in our inner loop, and the input size will commonly range into the hundreds of characters each, being efficient with allocations would pay off, even for a language like Rust which doesn't have a garbage collector.

Interestingly, FuzzySharp seems to take this into account, and the core logic is quite efficient. However, it copies each sequence unnecessarily with String.ToCharArray(), which allocates. Instead, in Rust, we can use slices &[T], forcing the caller to send us a generic array reference which has already been allocated.

This points to one reason why I like Rust for this type of work: it's designed in a way that makes you think about things that affect performance (like allocations), which steers you toward the pit of success by default. In contrast, > a C# developer might not know that ToCharArray() allocates unless they are consciously thinking of > performance and reading documentation carefully.

Outside of being efficient about allocations, there are a couple other common opportunities for improvement.

Firstly, in the event that the two sequences share matching prefixes or suffixes, it's cheaper to compare them up front, leaving only the unmatched slices for the inner logic. In the case of totally matching sequences, the runtime is O(m), whereas if we computed full levenshtein, it would still be O(m^2).

A final optimization is adding support for a target_distance argument. Since costs are monotonically increasing, if, in the course of computing distance, all possible scores are known to exceed this threshold, the algorithm can short-circuit and return early with None. Otherwise, it returns Some(distance).

Of note, these optimizations aren't very sophisticated and aren't original ideas, but they also are not so obvious that they're found in all implementations.

pub fn distance_from<T: PartialEq>(s1: &[T], s2: &[T], min_distance: usize) -> Option<usize> {
    // Return early for zero-length inputs
    if s1.len() == 0 {
        return Some(s2.len() * COST_INSERT);
    }

    if s2.len() == 0 {
        return Some(s1.len() * COST_INSERT);
    }

    let (shorter, longer) = if s1.len() <= s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    // Trim matching prefixes and suffixes
    let prefix_length = get_prefix_length(shorter, longer);
    let suffix_length = get_suffix_length(&shorter[prefix_length..], &longer[prefix_length..]);

    // Re-assign shorter and longer to their unmatched sections
    let (shorter, longer) = (
        &shorter[prefix_length..shorter.len() - suffix_length],
        &longer[prefix_length..longer.len() - suffix_length],
    );

    let mut costs = vec![0; shorter.len() + 1];

    // Initialize first row
    for i in 0..shorter.len() {
        costs[i + 1] = costs[i] + COST_INSERT;
    }

    for i in 0..longer.len() {
        let mut diagonal = costs[0];
        costs[0] = diagonal + COST_DELETE;

        let mut min_local_cost = std::usize::MAX;
        for j in 0..shorter.len() {
            let cost_sub = diagonal
                + if &shorter[j] == &longer[i] {
                    COST_EQUALS
                } else {
                    COST_REPLACE
                };

            let mut cost = cost_sub;
            let cost_del = costs[j + 1] + COST_DELETE;
            cost = if cost_del < cost { cost_del } else { cost };

            let cost_inst = costs[j] + COST_INSERT;
            cost = if cost_inst < cost { cost_inst } else { cost };

            if cost < min_local_cost {
                min_local_cost = cost;
            }

            diagonal = costs[j + 1];
            costs[j + 1] = cost;
        }

        if min_local_cost >= min_distance && shorter.len() > 0 {
            return None;
        }
    }

    Some(costs[shorter.len()])
}

Together, these algorithmic optimizations (combined with the conversion to Rust) contributed about a 5x improvement over the .NET version.

Partial Ratio

Having a fast Levenshtein implementation was a good start, but there were additional opportunities for improvement, this time in partial_ratio.

Recall from the previous post that we had the following basic code:

let mut min_distance = std::usize::MAX;

for i in 0..longer.len() - shorter.len() {
    let distance = distance_from(shorter, block);
        if distance < min_distance {
            min_distance = distance;
        }
    }
}

While this is correct, it's not terribly efficient, and most implementations take advantage of an important observation:

For any two sequences a and b, where a.len() < b.len(), the lowest partial edit distance (defined as the edit distance between a and a slice b' where b'.len() == a.len() will occur on a slice l' which begins a matching sequence.

Breaking this down somewhat, we do not need to check all a.len() - b.len() + 1 slices of b that have the same length as s; instead, we can simply check the ones which begin matching slices (we also add the first slice and the final slice for good measure).

Why exactly is this true? One way to think about it is the following:

For a slice that begins with a match, consider the slice that begins at the following index. It can be no lower in edit distance than the previous slice, because, at best, it trades a matching character for another matching character.

This means that the code snippet above can be converted to:

let mut min_distance = std::usize::MAX;

// find all matching slices between the two sequences, where a block is defined as:
// (starting index in long sequence, starting index in short sequence, length)
let matching_blocks = get_matching_blocks(shorter, longer);

for (long_start, _, _) in matching_blocks {
    let long_end = std::cmp::min(longer.len(), long_start + shorter.len());
    let block = &long[long_start..long_end];
    let distance = distance_from(shorter, block);
        if distance < min_distance {
            min_distance = distance;
        }
    }
}

Compared to the naive implementation, this version significantly reduces redundant computation, making it quite a bit faster. However, it doesn't go quite far enough. In particular, it will frequently result in too many distance_from calls towards the end of the sequence matches, where the block from the longer sequence has length less than the length of the shorter sequence.

To correct this, we can adjust the algorithm to only compute distance_from for slices of the same length as the shorter sequence.

Another important series of optimizations we can make here is leveraging the same idea in the section above from Levenshtein about short-circuiting. If we provide consumers an input that represents the target partial ratio score they're aiming for, we can return Option<u8>, meaning we return Some(<score>) if the score exceeds the specified threshold, and None otherwise. While for the most part we don't do anything special except compute the edit distance threshold to pass to levenshtein, we can benefit from another observation that lets us avoid processing most blocks entirely:

Imagine there is some b' that yields the target edit distance required for the desired partial ratio. As b.len() > a.len(), the edit distance lev(a, b) in such a case must be at least the target edit distance plus the difference in string length. By extension, if this condition is not satisfied such that lev(a, b) - b.len() + a.len() > target_distance, then we know for certain that a b' with the target distance cannot exist.

Restated slightly, this means that the edit distance on full sequences bounds the best possible edit distance on partial slices. If the former is "bad enough", we can rule out the possibility of a good match on a partial slice.

Depending on the nature of the target score and the input data, this observation can be quite an unlock. For example, if the target ratio is relatively restrictive, and most match candidates are poor, being able to short-circuit after a single levenshtein invocation is quite useful. On the input data and settings processed with the chants utility, this led to about a 5x increase in execution speed, on top of the already sizeable gains in the core levenshtein implementation and previous short-circuit opportunities.

As far as I am aware, this specific optimization is not in use anywhere, so it could be original. Perhaps it's because most ports of fuzzywuzzy directly reimplement the somewhat obscure original logic. Indeed, this observation didn't really surface until I reframed the get_matching_blocks problem into one of getting the slices in the longer sequence with the same length as the shorter sequence.

Summary and future work

The resulting performance improvement over my first .NET utility using a third-party fuzzy-matching utility ended up being a factor of about 25x. Not bad for a side project! The code is available here and is MIT-licensed, so feel free to use it if you have interest in high-performance fuzzy matching in Rust!

As a final note, in the course of writing this article, I came across an interesting post regarding the use of term frequency, inverse document frequency (TD-IDF) for efficient fuzzy matching at scale. It would be interesting to explore the effectiveness of that method in this problem space, as it promises to be much faster than even a hand-tuned levenshtein-based matching algorithm in Rust. Even if it isn't as precise as Levenshtein, for example, it would likely provide a low-cost way of reducing the search space to match candidates with a higher likelihood of producing a high-quality match, which would be cheaper than running Levenshtein on all possible match candidates.