![]() |
Home | Libraries | People | FAQ | More |
The header file boost/algorithm/sequence/edit_distance.hpp
provides an implementation of the
edit
distance between two sequences (aka Levenshtein distance,
aka string distance, aka string difference).
The edit distance is the length of the shortest (or least-cost) edit script from one sequence to another, where an edit script is defined as a sequence of insertion, deletion and (optionally) substitution operations.
The function implementing the edit distance is named edit_distance
.
This function will return the edit distance between two sequences, where
sequences may be any valid range object supporting forward iteration. The
edit_distance
function
will also, if requested, return the edit script.
The following example illustrates basic features and functionality of
edit_distance
:
#include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using namespace boost::algorithm::sequence::parameter; // Returns the edit distance between two sequences. // Sequences may be any valid range object supporting forward iteration. d = edit_distance(seq1, seq2); // Returns the edit distance, and calls special handler methods from object 'script_handler' // on the correspondiung "script" of edit operations: d = edit_distance(seq1, seq2, _script = script_handler); // Sequences need not be the same type: d = edit_distance(my_vector, my_list | boost::adaptors::reversed);
The edit_distance
function
has two required parameters: the two sequences to be compared. The first
sequence may be thought of as the "input" sequence. Deletion
operations delete elements from this first sequence. The second sequence
may be considered the "output" sequence. Insertion operations
insert elements into this second sequence. Substitution operations subsitute
an element in the first sequence for a different element in the second
sequence.
Sequence arguments may be any range object that supports forward iteration
(or any object that is convertible to such a range using the function
boost::as_literal
). Some examples of such valid
argument types are:
char*
(null terminated C strings)
std::string
std::vector
std::list
boost::adaptors::reversed
)
boost::make_iterator_range
)
The edit_distance
function
also accepts a variety of optional parameters, using the Boost.Parameter
named argument library:
Table 1. Optional Arguments to edit_distance
Argument |
Description |
Default |
---|---|---|
_cost = cost_object
|
define customized cost functions for edit operations insertion, deletion and substitution |
unit cost: cost of insertion, deletion and substitution are all 1 |
_equal = equal_object
|
define a customized equality relation between sequence elements |
standard equality: requires that the |
_script = script_handler
|
the sequence, or "script," of edit operations is passed
to methods defined on |
edit script information is ignored: only the edit distance is returned |
_substitution = boolean_arg
|
Enable or disable substitution. Note that |
|
_max_cost = cost_value
|
Provide a maxium edit distance ("cost"). If the edit cost exceeds this maximum, computation of the true mimimum edit distance will halt, and the remainder of the computation will use a fast linear-time fallback, which will not yield the true minimum distance. |
No maximum: the minimum edit distance will always be computed fully |
_max_cost_exception = bool_value
|
If |
|
![]() |
Tip |
---|---|
Sequence elemements may also have differing types, as long as they are compatible with cost and equality functions. |
A user specified object may be provided to edit_distance
that defines customized cost functions for the edit operations.
A cost function object must define the methods insertion
,
deletion
and substitution
. The cost type used by
edit_distance
can be explicitly
defined by defining cost_type
,
or it may be automatically inferred from the return type of the cost functions.
![]() |
Tip |
---|---|
When substitution is disabled at compile-time, then the |
The cost function defaults to "unit cost." The cost of insertion, deletion and substitution is always exactly 1.
The following example demonstrates the use of customized cost functions:
#include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using namespace boost::algorithm::sequence::parameter; struct custom_cost { // If cost_type is not defined, it will be automatically inferred from // the return type of the cost functions below: typedef int cost_type; // inserting or deleting a space is free: int insertion(char a) const { return (a == ' ') ? 0 : 1; } int deletion(char a) const { return (a == ' ') ? 0 : 1; } // defining this method is not required if substitution has been // disabled at compile time. int substitution(char a, char b) const { return 1; } }; // Invoke edit_distance with the custom cost functions: int d = edit_distance("spaces here", " spaces there ", _cost = custom_cost()); // inserting 't' costs 1. Inserting extra spaces costs nothing: BOOST_ASSERT(d == 1);
![]() |
Tip |
---|---|
Although edit distances are typically integer values, the cost type can also be defined as a floating point. |
![]() |
Note |
---|---|
When a cost function is asymmetric (for example, if the cost of deletion
is not equal to cost of insertion), |
![]() |
Caution |
---|---|
Cost functions are expected to always return values >= 0. The behavior
of |
A user specified equality relation may be provided to edit_distance
,
that defines when the distance algorithms consider two sequence elements
to be identical.
The equality relation defaults to applying operator ==
to sequence elements.
The following example demonstrates the use of a customized equality relation:
#include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using namespace boost::algorithm::sequence::parameter; struct custom_equal { // any elements of the same case are considered identical bool operator()(char a, char b) const { return tolower(a) == tolower(b); } }; // apply the customized definition of element equality: int d = edit_distance("case", "CaSe", _equal = custom_equal()); // characters differing only by case are considered equal, and so distance is zero: BOOST_ASSERT(d == 0);
The minimum edit distance between two sequences corresponds to a particular order of edit operations, referred to as an "edit script." The user may provide an object that defines methods for processing ("handling") the resulting script of operations.
An edit script handler must define the methods: insertion
,
deletion
, substitution
and equality
.
![]() |
Tip |
---|---|
When substitution is disabled at compile-time, then the |
By default, the edit_distance
function provides no script handling object. All information about the
script of edit operations is ignored, and only the final edit distance
is returned.
The following example demonstrates the definition of an edit script handler:
#include <cstdlib> #include <boost/tuple/tuple.hpp> #include <boost/tuple/tuple_io.hpp> using boost::make_tuple; using std::cout; #include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using namespace boost::algorithm::sequence::parameter; struct script_handler { // an edit script handler that writes edit operations to stdout: void insertion(char element, int cost) { cout << make_tuple('+', element, cost); } void deletion(char element, int cost) { cout << make_tuple('-', element, cost); } // optional if substitution is disabled at compile time void substitution(char e1, char e2, int cost) { cout << make_tuple(':', e1, e2, cost); } // cost is known to be zero by definition, so it is not a parameter void equality(char, char) { /* ignore equality in output */ } }; // A script handler is passed as a non-const reference, and must be declared prior to // calling edit_distance() script_handler handler; cout << boost::tuples::set_delimiter(' '); // the edit script of operations will be passed to the methods of handler: int d = edit_distance("cat in the hat", "hat on the cat", _script = handler); cout << std::endl; // delete 'c', insert 'h', delete 'i', insert 'o', delete 'h', insert 'c' BOOST_ASSERT(d == 6); // handler will output: "(- c 1)(+ h 1)(- i 1)(+ o 1)(- h 1)(+ c 1)" // NOTE: the exact ordering of edit script operations is not generally guaranteed. // For example, the following permutation might also be output: // "(- h 1)(+ c 1)(- i 1)(+ o 1)(- c 1)(+ h 1)"
![]() |
Note |
---|---|
Minimal edit scripts are not unique. For example a deletion followed
by an insertion is often equivalent to the reverse. In general, |
The substitution operation is not necessary for computing an edit distance,
as it is equivalent to a deletion followed by an insertion. Common applications
of edit distance, e.g. the standard unix file diff, do not make use of
subsitution. The Myers edit distance algorithm
implicitly assumes no substitution. Therefore, substitution is not enabled
by default in edit_distance
.
Substitution is enabled or disabled via the _substitution
named parameter. The function supports two modes of enable/disable: If
the argument is passed a value of type bool
(i.e. true
, false
or a bool
variable), then substitution is enabled/disabled at run-time. However,
if the argument is passed boost::false_type()
or boost::true_type()
, substitution is enabled or disabled
at compile time. This is faster, as the boolean check is not present in
the code. Compile-time disabled (the default) is the fastest of all, as
no subsitution checking or code is ever compiled at all.
The following example demonstrates the _substitution
parameter:
#include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using namespace boost::algorithm::sequence::parameter; // turn substitution on or off at run-time bool sub_switch = true; int d = edit_distance("cat", "bat", _substitution = sub_switch); // substitute 'c' -> 'b' BOOST_ASSERT(d == 1); // enable substitution at compile-time d = edit_distance("cat", "hat", _substitution = boost::true_type()); // substitute 'c' -> 'h' BOOST_ASSERT(d == 1); // disable substitution at compile-time (the default) d = edit_distance("cat", "mat", _substitution = boost::false_type()); // delete 'c', insert 'm' BOOST_ASSERT(d == 2);
The edit_distance
function
is implemented to be efficient when comparing two sequences that differ
only in localized areas. For example, comparing two similar text files
that differ only by localized edits, or comparing two sequences of DNA
that differ only at certain localized mutations.
However, edit distance in general always has a worst case time and space complexity of O(NxM), where N and M are the lengths of the two sequences being compared. For long sequences, this worst case can quickly become intractible.
The edit_distance
function
provides an argument _max_cost
,
which takes a numeric maximum cost value. If the function finds that the
maximum cost exceeds this value, it will halt its normal computations and
fall back to a fast linear algorithm to complete the rest of the computation.
This fallback can obviously not provide a minimum distance, however its
answer will build on the 'full' computations performed up to that point.
![]() |
Note |
---|---|
Since the fallback continues after encountering the maximum cost, the resulting edit distance will be >= max-cost. Since the full edit path search is not performed, the result will also be >= the true minimum distance. |
If an approximate fallback is not desired, the function may also be directed
to throw an exception if the computation exceeds a given maximum cost.
The argument _max_cost_exception
controls this behavior.
The following example demonstrates _max_cost
and _max_cost_exception
:
#include <boost/algorithm/sequence/edit_distance.hpp> using boost::algorithm::sequence::edit_distance; using boost::algorithm::sequence::max_edit_cost_exception; using namespace boost::algorithm::sequence::parameter; // get the distance between two long sequences, but fall back to approximation if the distance is too large. d = edit_distance(long_sequence, another_long_sequence, _max_cost = 1000); // throw exception if the distance grows too large try { d = edit_distance(long_sequence, another_long_sequence, _max_cost = 1000, _max_cost_exception = true); } catch (max_edit_cost_exception& e) { // handle the exception }
The implementation of edit_distance
is based on a variation of the Dijkstra Single Source Shortest Path algorithm,
tuned for the particular structure of an edit graph. It is efficient on
long sequences with localized areas of differences (e.g. computing the
diff of two similar files).
Whenever conditions permit, an even faster specialized distance algorithm is automatically applied, from:
An O(ND) Difference Algorithm and its Variations Eugene W. Myers Dept of Computer Science, University of Arizona
The Myers algorithm will be applied when the following conditions are met:
Note that these last two are the default behaviors for edit_distance
,
and so it is otherwise sufficient to provide the function with ranges that
support random access.
![]() |
Caution |
---|---|
If you supply a custom cost object, the Myers algorithm will not be used, as it is not possible to determine whether or not a given cost object implements unit costs. |