PE1AQP's program to compute edit-distances.

Introduction
String Similarity, Dissimilarity and Edit-Distance
Damerau-Levenshtein Algorithm
Download
Example Script
Limitations
Back to
The Fine Print

This little project was triggered by the needs of a (Linux, bash) script used for my ham-radio purposes. It could also be useful for other purposes, so I decided to publish it. Even so I will use ham-radio terminology on this web-page.

Introduction

I have a little (Linux-, bash-) shell script that processes ham-radio callsigns, given by a user. These callsigns could be anything, but are almost always one of a known small set.

So I wanted a way to check if the given callsign could be a misspelling of one of the known set. if so, we should ask the user if this was indeed intended.

EXAMPLE

Assume that we know that the given callsign could be anything,
but is almost always one of 
  "PE1AQP", "DK1AQP" or "G1RLZ".

- If a user enters a clearly different string, say "PE0FKO"
  We should accept that string.

- If a user enters a correct, known string "G1RLZ"
  We should also accept that string.

- But when the user enters a possibly mistyped string "DL1AQP"
  We should ask him or her:
  Did you perhaps mean "DK1AQP" or even "PE1AQP" ?

This is very similar to the spellchecking feature of most editors, email-programs etc., etc.. However, I could not find a stand-alone, command-line oriented program that could be called from a script that provides this feature.

So I had to write my own (which I enjoyed doing anyhow). The resulting program might just be of interest to others.

String Similarity, Dissimilarity and Edit-Distance

At first sight, one would require an algorithm to compute the similarity between two strings. Easier to define is, however, the degree in which two strings are dissimilar. Such string-dissimilarity can be defined as the number of edit steps required to go from the first to the second string.

A single edit-step is one of these actions:

Remove a character: "AB" -> "A"
Insert a new character: "AB" -> "AXB"
Replace an existing character with another : "AB" -> "AX"
Swap two characters: "AB" -> "BA"

The number of these steps between two strings is also known as the edit-distance between these strings.

Studies have shown that most typing mistakes people make when entering text consist of one or max two such actions. So if two strings are a distance of 1 or 2 such edit-steps apart, it is likely that an typing mistake happened. [Link to follow.]

Damerau-Levenshtein Algorithm

A well-known algorithm to compute the edit-distance between two strings is the Damerau-Levenshtein Distance (Wikipedia). In fact there are two variants of this algorithm: a simpler and a stricter variant.

Download

To follow.

Example Script

An (extremely simplified) script that uses this program could look like this example.sh:

#!/bin/sh

# This set of most likely words is normally derived from a database or file.
WORDSET="PE1AQP G1RLZ DK1AQP"

read -p "Enter your word: " WORD

CHOICE=$(dld -ei $WORD --set="$WORDSET" ) ; count=$?

if [ $count -gt 200 ] 
then
        echo Error occured !
        echo $CHOICE
        exit
elif [ $count -eq 1 ] || [ $count -eq 2 ] 
then
        echo "Did you mean one of these words?"
        echo $CHOICE
        read WORD
fi

echo "Your word was: $WORD"

Running this script a few times looks as follows:

$
$ ./example.sh 
Enter your word: PE0FKO
Your word was: PE0FKO
$
$ ./example.sh 
Enter your word: G1RLZ
Your word was: G1RLZ
$
$ ./example.sh 
Enter your word: DL1AQP
Did you mean one of these words?
*PE1AQP* G1RLZ **DK1AQP**
dd1aqp
Your word was: dd1aqp
$

As intended !

Limitations

My implementation has a few limitations:

It processes only 7-bit ASCII correctly. Feeding it strings with UTF-8 characters above this 7-bit limit causes it to estimate too large distances.
Although 7-bit ASCII is enough for my use-case (handling ham-radio callsigns), I might be tempted to improve on this.
It uses the simpler variant of Damerau-Levenshtein. Overestimating the cost of certain edit-steps.
I noticed this to occur in real data. I will probably improve on this, as well.
Although fixing both issues together might be tricky; we'll see.
The set of epected words is limited by the maximum string length of the shell in question. This is (much!) more than sufficient for my purposes. Other users could be interrested in reading this set of words from a file.

Back to

The Fine Print

My program source code is published under the GPLv3 license.
optparse.h is public-domain software from github/skeeto/optparse,
The algorithm is, of course, developed by Damerau and Levenshtein, see Damerau-Levenshtein Distance (Wikipedia).
This web-page is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.