Check spelling on Linux with this cool script
Take a look back at Unix history with these old school commands to check spelling in a document.
Almost since its inception at Bell Labs, Unix was initially a document processing platform. Used by technical writers and editors across Bell, Unix provided powerful tools to work on documents. But the spell checker - an important tool for technical writers - didn't appear until 1974 with Unix 5th Edition.
Technical writers before Unix 5th Edition had to check spelling in a different way. Here's how they did it.
Separate into words
To understand how writers checked spelling on Unix, we first need to remember that documents were written using the nroff or troff document preparation systems. These documents are just plain text files with special command codes to control formatting and layout. Because everything was plain text, you could use almost any tool to analyze nroff and troff source files, including checking the spelling.
To spell check a document, we need to break up the document into individual words, and compare each word to a dictionary of correctly-spelled words. Let's start with these Unix commands to break up a plain text file into separate words. The tr
command (introduced with Unix 4th Edition in 1973) can translate characters from one set of characters, like this command that translates uppercase letters from the test.me
source file to lowercase letters in the tmp1
output file:
$ tr 'A-Z' 'a-z' < test.me > tmp1
Spaces can be turned into line breaks by using a similar tr
command:
$ tr ' ' '\n' < tmp1 > tmp2
The tr
program can also delete characters using the -d
option. Special characters like punctuation and numbers can be omitted; they are not really words:
$ tr -d '",.:;()?!_' < tmp2 > tmp3
We don't need to read from and write to temporary files. Unix 3rd Edition introduced a powerful feature, pipes, to string multiple commands together. With pipes, the output from one command is immediately directed into the next command, like this:
$ tr 'A-Z' 'a-z' < test.me | tr ' ' '\n' | tr -d '",.:;()?!_' > tmp3
Make them unique
Many of the words will likely be repeated throughout the document. You don't need to look up twenty instances of "the" or "and" - you only need to check them once if they are all spelled the same.
The sort
command is one of the oldest Unix tools, from Unix 1st Edition. Use this to sort the list of words alphabetically. Then you can use the uniq
program to compare the sorted list and eliminate repeated entries; the uniq
command first appeared in Unix 3rd Edition.
Using the output from the previous tr
commands, the second step reduces any repeated instances of words and saves the output to a new temporary file:
$ sort tmp3 | uniq > tmp4
Find misspelled words
Once we have a list of words from the original document, we need to look each one up in the dictionary. If the word exists in the dictionary, it's a valid word. If it's not there, it's probably misspelled.
The comm
program fits this role very well. comm
will compare two files, one line at a time, and generate three columns of text: lines that are unique to the first file, lines that are unique to the second file, and lines that are common to both.
Consider the list of words from the document we want to spell check; that's the first file. We want to compare this to a dictionary of correctly-spelled words; that's the second file. When we use comm
to compare the two files, we want the list of (misspelled) words that appear in the first file, but do not appear in the list of (correct) words from the dictionary. Using the command line options comm -2 -3
will suppress the second and third columns, only displaying the list of words from the first file that do not exist in the dictionary.
$ comm -2 -3 tmp4 words
Putting it all together
The power of pipes lets us connect all of the programs together into one command line. This separates an nroff document called test.me
into words, sorts them, finds the unique words, and compares the list against the dictionary. The output is a list of likely misspelled words:
$ tr 'A-Z' 'a-z' < test.me | tr ' ' '\n' | tr -d '",.:;()?!_' | sort | uniq | comm -2 -3 - words
+
=
0
1
let's
nf
nroff
smple
Most of these are actually spelled correctly, but the last line shows the word "sample" is misspelled in my original document:
.de (V
.ls 1
.nf
..
.de )V
.ls
.fi
..
.uh "Test document"
.ls 2
.pp
This is a smple nroff document to demonstrate how to change line
spacing and back again. Let's say I wrote a document that used
double line spacing but included plain text equations with a
custom macro that uses single spacing:
.(V
1 2
x = x + v t + _ a t
0 0 2
.)V
.pp
The first paragraph will use double line spacing, the equation
will be single spaced and not use fill, and the following
paragraph (this one) will go back to double line spacing.