Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Sequence Algorithms

edit_distance

The header file boost/algorithm/sequence/edit_distance.hpp provides an implementation of the edit distance between two sequences (aka Levenshtein distance, aka string distance, aka string difference).

The edit distance is the length of the shortest (or least-cost) edit script from one sequence to another, where an edit script is defined as a sequence of insertion, deletion and (optionally) substitution operations.

The function implementing the edit distance is named edit_distance. This function will return the edit distance between two sequences, where sequences may be any valid range object supporting forward iteration. The edit_distance function will also, if requested, return the edit script.

The following example illustrates basic features and functionality of edit_distance:

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using namespace boost::algorithm::sequence::parameter;

// Returns the edit distance between two sequences.
// Sequences may be any valid range object supporting forward iteration.
d = edit_distance(seq1, seq2);

// Returns the edit distance, and calls special handler methods from object 'script_handler'
// on the correspondiung "script" of edit operations:
d = edit_distance(seq1, seq2, _script = script_handler);

// Sequences need not be the same type:
d = edit_distance(my_vector, my_list | boost::adaptors::reversed);

The edit_distance function has two required parameters: the two sequences to be compared. The first sequence may be thought of as the "input" sequence. Deletion operations delete elements from this first sequence. The second sequence may be considered the "output" sequence. Insertion operations insert elements into this second sequence. Substitution operations subsitute an element in the first sequence for a different element in the second sequence.

Sequence arguments may be any range object that supports forward iteration (or any object that is convertible to such a range using the function boost::as_literal). Some examples of such valid argument types are:

  • char* (null terminated C strings)
  • std::string
  • std::vector
  • std::list
  • output from range adaptors (e.g. boost::adaptors::reversed)
  • ranges assembled from iterators (e.g. boost::make_iterator_range)

The edit_distance function also accepts a variety of optional parameters, using the Boost.Parameter named argument library:

Table 1. Optional Arguments to edit_distance

Argument

Description

Default

_cost = cost_object

define customized cost functions for edit operations insertion, deletion and substitution

unit cost: cost of insertion, deletion and substitution are all 1

_equal = equal_object

define a customized equality relation between sequence elements

standard equality: requires that the == operator is defined for sequence elements

_script = script_handler

the sequence, or "script," of edit operations is passed to methods defined on script_handler for collection or other processing by the calling code

edit script information is ignored: only the edit distance is returned

_substitution = boolean_arg

Enable or disable substitution. Note that boolean_arg may be either of type bool (enable/disable at run time), boost::false_type() or boost::true_type() (enable/disable at compile time). When substitution is disabled, only insertion and deletion are considered as edit operations.

boost::false_type() -- substitution is disabled at compile time.

_max_cost = cost_value

Provide a maxium edit distance ("cost"). If the edit cost exceeds this maximum, computation of the true mimimum edit distance will halt, and the remainder of the computation will use a fast linear-time fallback, which will not yield the true minimum distance.

No maximum: the minimum edit distance will always be computed fully

_max_cost_exception = bool_value

If bool_value == true, then edit_distance will throw an exception if edit distance exceeds the value specified by _max_cost, instead of completing with the fast fallback.

false


[Tip] Tip

Sequence elemements may also have differing types, as long as they are compatible with cost and equality functions.

A user specified object may be provided to edit_distance that defines customized cost functions for the edit operations.

A cost function object must define the methods insertion, deletion and substitution. The cost type used by edit_distance can be explicitly defined by defining cost_type, or it may be automatically inferred from the return type of the cost functions.

[Tip] Tip

When substitution is disabled at compile-time, then the substitution method is not required.

The cost function defaults to "unit cost." The cost of insertion, deletion and substitution is always exactly 1.

The following example demonstrates the use of customized cost functions:

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using namespace boost::algorithm::sequence::parameter;

struct custom_cost {
    // If cost_type is not defined, it will be automatically inferred from
    // the return type of the cost functions below:
    typedef int cost_type;

    // inserting or deleting a space is free:
    int insertion(char a) const { return (a == ' ') ? 0 : 1; }
    int deletion(char a) const { return (a == ' ') ? 0 : 1; }

    // defining this method is not required if substitution has been 
    // disabled at compile time.
    int substitution(char a, char b) const { return 1; }
};

// Invoke edit_distance with the custom cost functions:
int d = edit_distance("spaces here",  "  spaces   there ",  _cost = custom_cost());

// inserting 't' costs 1.  Inserting extra spaces costs nothing:
BOOST_ASSERT(d == 1);

[Tip] Tip

Although edit distances are typically integer values, the cost type can also be defined as a floating point.

[Note] Note

When a cost function is asymmetric (for example, if the cost of deletion is not equal to cost of insertion), edit_distance will not behave as a true distance metric, as the symmetry property is violated: edit_distance(a, b) != edit_distance(b, a).

[Caution] Caution

Cost functions are expected to always return values >= 0. The behavior of edit_distance is undefined on functions that can return negative cost values.

A user specified equality relation may be provided to edit_distance, that defines when the distance algorithms consider two sequence elements to be identical.

The equality relation defaults to applying operator == to sequence elements.

The following example demonstrates the use of a customized equality relation:

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using namespace boost::algorithm::sequence::parameter;

struct custom_equal {
    // any elements of the same case are considered identical
    bool operator()(char a, char b) const { return tolower(a) == tolower(b); }
};

// apply the customized definition of element equality:
int d = edit_distance("case", "CaSe", _equal = custom_equal());

// characters differing only by case are considered equal, and so distance is zero:
BOOST_ASSERT(d == 0);

The minimum edit distance between two sequences corresponds to a particular order of edit operations, referred to as an "edit script." The user may provide an object that defines methods for processing ("handling") the resulting script of operations.

An edit script handler must define the methods: insertion, deletion, substitution and equality.

[Tip] Tip

When substitution is disabled at compile-time, then the substitution method is not required.

By default, the edit_distance function provides no script handling object. All information about the script of edit operations is ignored, and only the final edit distance is returned.

The following example demonstrates the definition of an edit script handler:

#include <cstdlib>
#include <boost/tuple/tuple.hpp>
#include <boost/tuple/tuple_io.hpp>
using boost::make_tuple;
using std::cout;

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using namespace boost::algorithm::sequence::parameter;

struct script_handler {
    // an edit script handler that writes edit operations to stdout:
    void insertion(char element, int cost) { cout << make_tuple('+', element, cost); }
    void deletion(char element, int cost) { cout << make_tuple('-', element, cost); }

    // optional if substitution is disabled at compile time
    void substitution(char e1, char e2, int cost) { cout << make_tuple(':', e1, e2, cost); }

    // cost is known to be zero by definition, so it is not a parameter
    void equality(char, char) { /* ignore equality in output */ }
};

// A script handler is passed as a non-const reference, and must be declared prior to 
// calling edit_distance()
script_handler handler;

cout << boost::tuples::set_delimiter(' ');

// the edit script of operations will be passed to the methods of handler:
int d = edit_distance("cat in the hat", "hat on the cat", _script = handler);

cout << std::endl;

// delete 'c', insert 'h', delete 'i', insert 'o', delete 'h', insert 'c'
BOOST_ASSERT(d == 6);

// handler will output:  "(- c 1)(+ h 1)(- i 1)(+ o 1)(- h 1)(+ c 1)"
// NOTE: the exact ordering of edit script operations is not generally guaranteed.  
// For example, the following permutation might also be output:
// "(- h 1)(+ c 1)(- i 1)(+ o 1)(- c 1)(+ h 1)"

[Note] Note

Minimal edit scripts are not unique. For example a deletion followed by an insertion is often equivalent to the reverse. In general, edit_distance cannot guarantee which of a possible set of equivalent permutations will be output. The script may be collected and sorted using a custom script handler, if a particular ordering scheme is desired for an application.

The substitution operation is not necessary for computing an edit distance, as it is equivalent to a deletion followed by an insertion. Common applications of edit distance, e.g. the standard unix file diff, do not make use of subsitution. The Myers edit distance algorithm implicitly assumes no substitution. Therefore, substitution is not enabled by default in edit_distance.

Substitution is enabled or disabled via the _substitution named parameter. The function supports two modes of enable/disable: If the argument is passed a value of type bool (i.e. true, false or a bool variable), then substitution is enabled/disabled at run-time. However, if the argument is passed boost::false_type() or boost::true_type(), substitution is enabled or disabled at compile time. This is faster, as the boolean check is not present in the code. Compile-time disabled (the default) is the fastest of all, as no subsitution checking or code is ever compiled at all.

The following example demonstrates the _substitution parameter:

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using namespace boost::algorithm::sequence::parameter;

// turn substitution on or off at run-time
bool sub_switch = true;
int d = edit_distance("cat", "bat", _substitution = sub_switch);
// substitute 'c' -> 'b'
BOOST_ASSERT(d == 1);

// enable substitution at compile-time
d = edit_distance("cat", "hat", _substitution = boost::true_type());
// substitute 'c' -> 'h'
BOOST_ASSERT(d == 1);

// disable substitution at compile-time (the default)
d = edit_distance("cat", "mat", _substitution = boost::false_type());
// delete 'c', insert 'm'
BOOST_ASSERT(d == 2);

The edit_distance function is implemented to be efficient when comparing two sequences that differ only in localized areas. For example, comparing two similar text files that differ only by localized edits, or comparing two sequences of DNA that differ only at certain localized mutations.

However, edit distance in general always has a worst case time and space complexity of O(NxM), where N and M are the lengths of the two sequences being compared. For long sequences, this worst case can quickly become intractible.

The edit_distance function provides an argument _max_cost, which takes a numeric maximum cost value. If the function finds that the maximum cost exceeds this value, it will halt its normal computations and fall back to a fast linear algorithm to complete the rest of the computation. This fallback can obviously not provide a minimum distance, however its answer will build on the 'full' computations performed up to that point.

[Note] Note

Since the fallback continues after encountering the maximum cost, the resulting edit distance will be >= max-cost. Since the full edit path search is not performed, the result will also be >= the true minimum distance.

If an approximate fallback is not desired, the function may also be directed to throw an exception if the computation exceeds a given maximum cost. The argument _max_cost_exception controls this behavior.

The following example demonstrates _max_cost and _max_cost_exception:

#include <boost/algorithm/sequence/edit_distance.hpp>

using boost::algorithm::sequence::edit_distance;
using boost::algorithm::sequence::max_edit_cost_exception;
using namespace boost::algorithm::sequence::parameter;

// get the distance between two long sequences, but fall back to approximation if the distance is too large.
d = edit_distance(long_sequence, another_long_sequence, _max_cost = 1000);

// throw exception if the distance grows too large
try {
    d = edit_distance(long_sequence, another_long_sequence, _max_cost = 1000, _max_cost_exception = true);
} catch (max_edit_cost_exception& e) {
    // handle the exception
}

The implementation of edit_distance is based on a variation of the Dijkstra Single Source Shortest Path algorithm, tuned for the particular structure of an edit graph. It is efficient on long sequences with localized areas of differences (e.g. computing the diff of two similar files).

Whenever conditions permit, an even faster specialized distance algorithm is automatically applied, from:

An O(ND) Difference Algorithm and its Variations
Eugene W. Myers
Dept of Computer Science, University of Arizona

The Myers algorithm will be applied when the following conditions are met:

  • Both sequences support random access
  • Substitution is not considered
  • Unit cost for insertion and deletion

Note that these last two are the default behaviors for edit_distance, and so it is otherwise sufficient to provide the function with ranges that support random access.

[Caution] Caution

If you supply a custom cost object, the Myers algorithm will not be used, as it is not possible to determine whether or not a given cost object implements unit costs.


PrevUpHomeNext