raspberry-pi Check spelling on Linux with this cool script

Take a look back at Unix history with these old school commands to check spelling in a document.

Almost since its inception at Bell Labs, Unix was initially a document processing platform. Used by technical writers and editors across Bell, Unix provided powerful tools to work on documents. But the spell checker - an important tool for technical writers - didn't appear until 1974 with Unix 5th Edition.

Technical writers before Unix 5th Edition had to check spelling in a different way. Here's how they did it.

Separate into words

To understand how writers checked spelling on Unix, we first need to remember that documents were written using the nroff or troff document preparation systems. These documents are just plain text files with special command codes to control formatting and layout. Because everything was plain text, you could use almost any tool to analyze nroff and troff source files, including checking the spelling.

To spell check a document, we need to break up the document into individual words, and compare each word to a dictionary of correctly-spelled words. Let's start with these Unix commands to break up a plain text file into separate words. The tr command (introduced with Unix 4th Edition in 1973) can translate characters from one set of characters, like this command that translates uppercase letters from the test.me source file to lowercase letters in the tmp1 output file:

$ tr 'A-Z' 'a-z' < test.me > tmp1

Spaces can be turned into line breaks by using a similar tr command:

$ tr ' ' '\n' < tmp1 > tmp2

The tr program can also delete characters using the -d option. Special characters like punctuation and numbers can be omitted; they are not really words:

$ tr -d '",.:;()?!_' < tmp2 > tmp3

We don't need to read from and write to temporary files. Unix 3rd Edition introduced a powerful feature, pipes, to string multiple commands together. With pipes, the output from one command is immediately directed into the next command, like this:

$ tr 'A-Z' 'a-z' < test.me | tr ' ' '\n' | tr -d '",.:;()?!_' > tmp3

Make them unique

Many of the words will likely be repeated throughout the document. You don't need to look up twenty instances of "the" or "and" - you only need to check them once if they are all spelled the same.

The sort command is one of the oldest Unix tools, from Unix 1st Edition. Use this to sort the list of words alphabetically. Then you can use the uniq program to compare the sorted list and eliminate repeated entries; the uniq command first appeared in Unix 3rd Edition.

Using the output from the previous tr commands, the second step reduces any repeated instances of words and saves the output to a new temporary file:

$ sort tmp3 | uniq > tmp4

Find misspelled words

Once we have a list of words from the original document, we need to look each one up in the dictionary. If the word exists in the dictionary, it's a valid word. If it's not there, it's probably misspelled.

The comm program fits this role very well. comm will compare two files, one line at a time, and generate three columns of text: lines that are unique to the first file, lines that are unique to the second file, and lines that are common to both.

Consider the list of words from the document we want to spell check; that's the first file. We want to compare this to a dictionary of correctly-spelled words; that's the second file. When we use comm to compare the two files, we want the list of (misspelled) words that appear in the first file, but do not appear in the list of (correct) words from the dictionary. Using the command line options comm -2 -3 will suppress the second and third columns, only displaying the list of words from the first file that do not exist in the dictionary.

$ comm -2 -3 tmp4 words

Putting it all together

The power of pipes lets us connect all of the programs together into one command line. This separates an nroff document called test.me into words, sorts them, finds the unique words, and compares the list against the dictionary. The output is a list of likely misspelled words:

$ tr 'A-Z' 'a-z' < test.me | tr ' ' '\n' | tr -d '",.:;()?!_' | sort | uniq | comm -2 -3 - words


Most of these are actually spelled correctly, but the last line shows the word "sample" is misspelled in my original document:

.de (V
.ls 1
.de )V
.uh "Test document"
.ls 2
This is a smple nroff document to demonstrate how to change line
spacing and back again. Let's say I wrote a document that used
double line spacing but included plain text equations with a
custom macro that uses single spacing:
                  1    2
  x = x  + v  t + _ a t
       0    0     2
The first paragraph will use double line spacing, the equation
will be single spaced and not use fill, and the following
paragraph (this one) will go back to double line spacing.