Parsing undivided plaintext

In my last post I gave a few examples of my attempts at writing a program to divide up undivided text into words. Since then, I’ve been working on the program. It’s doing better. Here are examples from that post and how the program divides them now.

what is christmas it is tenderness for the past courage for the present hope for the future.

they were indeed a queer looking party that assembled on the bank the birds with draggle d feathers the animals with their fur clinging close to them and all dripping wet cross and uncomfortable

Not perfect, but much better. So how did I approach this task? I’m not going to provide code, just discuss my thinking process. I decided to start with long words first since it is relatively rare for long words to accidentally appear, that is by juxtaposition with smaller words. The opposite is clearly not true. Small words appear in longer words all the time. Separating out small words like to and in first would break up almost every sentence incorrectly.

I used a word list to go through the text and inserted spaces before and after every found word starting with length 24. Whenever it found a word, it would effectively blank out that stretch so it could not be used in searching for words farther down the list. I also used word lists that were ordered by frequency so that it found the words most likely to appear before obscure words took up that space. In my first iteration, that’s about all I did and the examples I gave in my last post show the limitations of such an approach.

My next step was to identify common errors that this approach produced, such as “I twill” instead of “it will.” Twill is a five-letter word, so it’s found before the more common word will, and that leaves only the letter i. That breaking looks fine to the program since I and twill are both valid words. I created a list of such examples, mostly involving very small words such as “ha snot= has not”, “o four = of our” and so forth. The program checks that list after its initial parsing and fixes anything occurring there. Creating that list was a time-consuming process and is still ongoing. The only way to do it is to test many examples using the program and judge by eye. It’s not always easy. For example, is “a man” better than “am an”? Should it be changed or not? I use Google Ngram as guidance for such hard cases. I call this my “fixit” list.

This improved things, but many errors still appeared. Most common were those where the “S form” of a word (i.e. the plural of a noun or third person singular of a verb)  was found when the correct form is without the final S. I wrote a routine to find and try to correct such case. One class that was easy to find was words with a final S. I went through the separated text pair by pair. Whenever word 1 ended with an S, I tested the frequency of that word followed by word 2, then removed the S from the first word and tacked it onto word 2 and tested the frequency of that pair. Which ever scored best, I saved. The data for such word pair frequencies can again be obtained from Google Ngram. This doesn’t always work. For example “westernsquare” initially breaks into “westerns qua re,” all valid words. Swapping the S to the qua yields “western squa” not square, so the parsing does not change. “Collisionsport” breaks up as “collisions port” without this step, but the program successfully changes it to “collision sport.”

Lastly, I did the same thing for every low-scoring pair, but with an additional test. I set an arbitrary limit, X, and tested every successive pair. If the frequency was below X, I tried shifting the last letter of the first word to the beginning of word 2 to compare, and I also shifted the first letter of word 2 to the end of word 1 and whichever of the original and two variations scored highest, I kept. The improvements have been subtle, but real. The downside is that it slows down parsing tremendously. Without this final word pair improvement step, the parsing is essentially instantaneous.  With it, a sentence often takes ten or fifteen seconds. I will continue to fiddle with the value of X. The higher I put it, the more word pairs get tested and the longer the program takes. I have to weigh speed versus accuracy.

I’ll continue to add common errors to the fixit list, and to add missing words to my lists, including proper nouns, but longer lists add time. I also found it necessary to remove some words, like “doth.” It’s a valid word, but rarely used today and it causes parsing errors with “do -the -they -those,” etc.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.