I’ve described various computer solving techniques in earlier blogs, but I only briefly mentioned scoring using word lists. For most computer applications I use tetragram scoring because it’s fast and easy to implement. But it’s not always the most accurate way. If your solving program already “knows” the word break points then it’s relatively easy and fast to look up each trial string to see if it’s a word. I’ll get to that method in a minute. But what about the many cases where the trial output is one continuous string without breaks?
There are two approaches to that situation. The first is to begin at the first letter and keep appending letters until the string is no longer a word. For example, if the trial output begins “THEMAN” this program would go through “THEM,” which it recognizes as a word, and assume the break point comes after the M and begin looking for the next word at AN…. Or, if the word THEMA (Latin, but used in English) is in your reference word list, it will break there and start looking for the next word at the N. As you can see, this does not always find words as you would by eye. You naturally think it begins with the two words THE MAN or possibly THE MANY, etc. If it doesn’t find a word starting with the first letter, it moves on and does the same with letter 2 and so forth. The scoring algorithm won’t give an accurate number if it is breaking the text incorrectly. The final score is based on how many total letters are found to be in words.
The second approach is to start with longest words first. Start with the longest word length in your available lists and then shorten the search length incrementally. Let’s say you start with length 24, the maximum word length in my normal lists. You check to see if the first 24-letter string is a valid word. If it is, you blank that out and test the next string beginning at letter 25. If not, you do the same beginning at letter 2, and so on. You probably won’t find any words of that length, but the next cycle, you try 23-letter words, and so on. Whenever you find a word, be sure you eliminate that section of the output from future testing so that it’s not double counted using shorter words found within the longer ones. The final score in this version is also the total number of letters that are in words.
Neither method is perfect. Take a sentence beginning TWOSTRINGS. The first program will find TWOS T RINGS, dropping the T from the count. The second method would get it right. But with BUTOUR, the second method gets BU TOUR, losing the BU, while the first method gets BUT OUR, the correct parse. I’ve written programs both ways and in my experience this second way is best.
Now for the question of how to determine if a string is a word. Of course it is very simple to write a program that reads from an alphabetical list and when it comes to the point where your test string should appear, determine whether or not it is there. This works reasonably fast on very short lists or if your candidate begins with an A or B. Otherwise, it is quite slow. Another approach is to use a hash function. I’m not going to discuss that since I don’t use one. If you already know what this is and how to use one, you don’t need me, and if you don’t, I can’t tell you. I believe Python and maybe other languages have this built in and it’s fast even for unordered lists but is complicated and has some computing overhead to build a hash table.
But there’s a fast method for Delphi, the language I use, which I believe is available in other languages, too. My FindWord function takes a string S as input and returns a true if it’s a word and false if it is not. It depends on using the AnsiCompareStr function. This function takes two strings as input, and compares them. If the first one is less than (i.e. alphabetically earlier than) the second, it returns a negative number, if it’s greater, it returns a positive number, and if they are the same, a zero. To use this, first you load your reference word list in alphabetical order and count the number of entries; call that X. This only needs to be done once when the program is loaded not each time the function is called. Your search function will begin its first comparison of S with the word in the middle, i.e. at position 1/2 x. If the test word is less than the list word, you next try S with the word at 1/4 x. If the first comparison shows S is greater than the list word then your second test is against the word 3/4 x. In other words, you keep splitting the search domain in half with every search. When you get a match, the function returns a True and stops looking. When you get down to the last word without a match, the function returns a false. It’s written recursively so that it automatically keeps calling itself until one of those two points is reached. For a list of 80,000 words, the maximum number of comparison you’d have to do is 17.
To make it even faster, don’t load all the words into a single list. Use a separate list for each word length. I sometimes use a two-dimensional array ranging from 2 to 24 for the word length, and 1 to 13000 for the word count. Since there are more 8-letter words than any other length and there are just under 13,000 words in that list. I treat one-letter words separately in my function, giving a False for anything besides A or I. This cuts the maximum number of searches to 14 for length 8 and for most strings, much fewer.
This can be refined even more. I’ll discuss that in my next post.