Parsing plaintext concluded

One more post on the problem of dividing plaintext and then I’ll leave the topic. I decided to try two more ways to divide up text into words. The first method was a total failure: hillclimbing. That consisted of randomly choosing dividing points and then testing to see how many valid words there were between the spaces, followed by a series of trying one or two random changes, checking to see if more words were produced, and either keeping the new spots or going back to the previous set. I won’t discuss the details, but take my word for it: it bombed.

The other method is to start at the beginning and reduce the string down until you have a word left at the beginning. For example, if the text you are parsing is “mydogatemylunch’, the program first checks the whole string to see if it’s a word. Since it isn’t, it crops the last letter, tests again, and so on until it has left only “my” which is a valid word. It saves that, then it starts with the next letter, d in this case, and does the same thing until all the letters are used or, if no word is found, the letter is saved as a “word”, but skipped over.

Simply put, the method I described in the previous two posts is to start with valid words from one of more lists and checking to see if they are in the subject text. This new method is to take sections of the subject text and see if they are valid words. Neither method is perfect. After testing numerous trial texts, it is clear to me that the previous version (Method A) is better than this new one (B). There are some texts where B performs better, some where they’re equally good or bad, but most cases have A outperforming B. Here are some examples.

Both A and B got this perfect: oneostricheggwillfeedtwentyfourpeopleforbreakfastthejoyofcooking

A got this one perfect: slowandsteadywinstherace. B’s result: slow ands tea d y wins the race. (“ands” is a valid word as in “no ifs, ands, or buts”).

Both got this wrong, but differently: wedrinkallwecantherestwesell
A: we drink all we c anther est we sell. B: we drink all we cant heres t we sell

Lastly, one where B outperformed A:  asinthesongfreebirdcouldyou
A: a sin the …   B: As in the …

This exercise has given me a new appreciation for those pros who write autocorrect software. Of course they use AI and have massive data troves to mine, while I used just a few dozen test sentences. One good thing about trying this new method is that I learned how to determine whether a string is a valid word much more quickly than before. In the past I was just taking a file of words and sequentially checking to see if each matched my test string. That’s reasonably fast if the word is early in the list, but not otherwise. I was using lists ordered by frequency so that the most-used words would be found fast, but it still involved a lot of unnecessary test matches. For this new method I discovered a search method that is probably old hat to programmers, but new to me. Basically you start in the middle of an alphabetized word list, compare strings, and if the test string is less than the list word, you do the same with the first half of the list, otherwise with the second half, and continue to cut the search space in half, and repeat until you match or can’t shrink any farther.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.