PE1AQP's program to compute edit-distances.
Table of Contents
This little project was triggered by the needs of a (Linux, bash) script used for my ham-radio purposes. It could also be useful for other purposes, so I decided to publish it. Even so I will use ham-radio terminology on this web-page.
Introduction
I have a little (Linux-, bash-) shell script that processes ham-radio callsigns, given by a user. These callsigns could be anything, but are almost always one of a known small set.
So I wanted a way to check if the given callsign could be a misspelling of one of the known set. if so, we should ask the user if this was indeed intended.
EXAMPLE Assume that we know that the given callsign could be anything, but is almost always one of "PE1AQP", "DK1AQP" or "G1RLZ". - If a user enters a clearly different string, say "PE0FKO" We should accept that string. - If a user enters a correct, known string "G1RLZ" We should also accept that string. - But when the user enters a possibly mistyped string "DL1AQP" We should ask him or her: Did you perhaps mean "DK1AQP" or even "PE1AQP" ?
This is very similar to the spellchecking feature of most editors, email-programs etc., etc.. However, I could not find a stand-alone, command-line oriented program that could be called from a script that provides this feature.
So I had to write my own (which I enjoyed doing anyhow). The resulting program might just be of interest to others.
String Similarity, Dissimilarity and Edit-Distance
At first sight, one would require an algorithm to compute the similarity between two strings. Easier to define is, however, the degree in which two strings are dissimilar. Such string-dissimilarity can be defined as the number of edit steps required to go from the first to the second string.
A single edit-step is one of these actions:
- Remove a character: "AB" -> "A"
- Insert a new character: "AB" -> "AXB"
- Replace an existing character with another : "AB" -> "AX"
- Swap two characters: "AB" -> "BA"
The number of these steps between two strings is also known as the edit-distance between these strings.
Studies have shown that most typing mistakes people make when entering text consist of one or max two such actions. So if two strings are a distance of 1 or 2 such edit-steps apart, it is likely that an typing mistake happened. [Link to follow.]
Damerau-Levenshtein Algorithm
A well-known algorithm to compute the edit-distance between two strings is the Damerau-Levenshtein Distance (Wikipedia). In fact there are two variants of this algorithm: a simpler and a stricter variant.
Download
To follow.
Example Script
An (extremely simplified) script that uses this program could look like this example.sh
:
#!/bin/sh # This set of most likely words is normally derived from a database or file. WORDSET="PE1AQP G1RLZ DK1AQP" read -p "Enter your word: " WORD CHOICE=$(dld -ei $WORD --set="$WORDSET" ) ; count=$? if [ $count -gt 200 ] then echo Error occured ! echo $CHOICE exit elif [ $count -eq 1 ] || [ $count -eq 2 ] then echo "Did you mean one of these words?" echo $CHOICE read WORD fi echo "Your word was: $WORD"
Running this script a few times looks as follows:
$ $ ./example.sh Enter your word: PE0FKO Your word was: PE0FKO $ $ ./example.sh Enter your word: G1RLZ Your word was: G1RLZ $ $ ./example.sh Enter your word: DL1AQP Did you mean one of these words? *PE1AQP* G1RLZ **DK1AQP** dd1aqp Your word was: dd1aqp $
As intended !
Limitations
My implementation has a few limitations:
- It processes only 7-bit ASCII correctly. Feeding it strings with
UTF-8 characters above this 7-bit limit causes it to estimate too
large distances.
Although 7-bit ASCII is enough for my use-case (handling ham-radio callsigns), I might be tempted to improve on this. - It uses the simpler variant of Damerau-Levenshtein. Overestimating the
cost of certain edit-steps.
I noticed this to occur in real data. I will probably improve on this, as well. - Although fixing both issues together might be tricky; we'll see.
- The set of epected words is limited by the maximum string length of the shell in question. This is (much!) more than sufficient for my purposes. Other users could be interrested in reading this set of words from a file.
Back to
The Fine Print
- My program source code is published under the GPLv3 license.
optparse.h
is public-domain software from github/skeeto/optparse,- The algorithm is, of course, developed by Damerau and Levenshtein, see Damerau-Levenshtein Distance (Wikipedia).
- This web-page is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.