How to check spelling on old school Unix
How did technical writers check spelling on original Unix systems? Let's learn how to make our own tools to spell check documents.
A few weeks ago, I shared how to check spelling on Linux with a shell script. This mimicked the way technical writers would look for misspelled words in documents on the original Unix system in the 1970s. This script used a set of common Unix commands including tr
, sort
, uniq
, and comm
to compare the words from a document against a dictionary of correctly spelled words.
The sort
command was an original program from Unix 1st Edition (November 1971), uniq
arrived in Unix 3rd Edition (February 1973), and tr
and comm
were both introduced in Unix 4th Edition (November 1973). Yet writers at Bell Labs used Unix to write patents, articles, and other technical documents since 1st Edition.
Let's explore how these technical writers checked the spelling in their documents in 1971 or 1972, when tools like tr
, uniq
, and comm
didn't yet exist.
Build your own tools
The general process to check the spelling in a text document is this:
- Separate a document into words, one word per line
- Sort the list of words
- Remove duplicate lines
- Check the list against a dictionary to find misspelled words
Unix 1st Edition included the sort
command, but not the other tools. We will need to build our own versions of these tools to do the work. Some technical writers at the time knew how to write simple programs to automate their work, so while this wasn't a common task for a technical writer, it was not unexpected.
The "make words" program
Unix 2nd Edition (June 1972) provided the first C compiler. We can use the C programming language to write a simple program that scans its input and attempts to break up text into words, with one word per line. The PDP-11 that ran the Unix system at Bell Labs in 1972 had very limited memory compared to today's computers; programmers tried to use as little memory as possible for their programs. In this sample makewords.c
program, we only need to examine the input one character at a time:
#include <stdio.h>
#include <ctype.h> /* isalpha, tolower .. or write your own */
main()
{
int c;
int newline;
newline = 0;
while ( (c = getchar()) != EOF ) {
if (isalpha(c)) {
putchar(tolower(c));
newline = 0;
}
else {
if (!newline) {
putchar('\n');
newline = 1;
}
}
} /* while */
}
Note that the program uses original-style C programming syntax, similar to what could be used in Unix 2nd Edition. This is just a simple demonstration program. A more serious implementation would improve on the basic algorithm (for example, using buffers to read data) but this version should be easy enough for non-programmers to understand.
This program reads one letter at a time from standard input. Where the program encounters an uppercase or lowercase letter, it prints the letter to the output. Otherwise, the program starts a new line. The effect is the program breaks up a document into words, with one word per line.
The "dedupe" program
The dedupe.c
program assumes a sorted list of words. It reads from the list and stores only two lines in memory: the current line and the previous line. If the two lines match, the program only prints the first one.
#include <stdio.h>
#include <string.h> /* strncmp .. or write your own */
main()
{
char line[60], prev[60]; /* be careful! */
prev[0] = 0;
/* print lines that aren't the same */
while (fgets(line,60,stdin) != NULL) {
if (strncmp(line,prev,60) != 0) {
fputs(line,stdout); /* line already has '\n' */
}
strncpy(prev,line,60);
}
}
The "non-words" program
The nonwords.c
program reads a list of words from its input, and compares each word to a dictionary of correctly spelled words. To make the program run faster, the dictionary should already be sorted. With this assumption, if the program attempts to find the misspelled word "teh," the program can stop when it reaches words that start with "u," safe in the knowledge that "teh" is not in the dictionary.
#include <stdio.h>
#include <string.h> /* strncmp .. or write your own */
int lookup(word, wordlist)
char word[];
FILE* wordlist;
{
char s[60]; /* be careful! */
/* look up the word in the (sorted) wordlist file */
while (fgets(s,60,wordlist) != NULL) { /* be careful! */
if (strncmp(word,s,60) == 0) {
return 1; /* found it! */
}
if (s[0] > word[0]) {
return 0; /* not in wordlist */
}
}
return 0; /* not found */
}
int main()
{
char word[60]; /* be careful! */
FILE* dict;
if ( (dict = fopen("words", "r")) == NULL ) {
puts("cannot open word list file");
return 1;
}
/* read a list of words from stdin and look up in the dictionary.
print words NOT in the dictionary. */
while (fgets(word,60,stdin) != NULL) { /* be careful! */
if (lookup(word,dict) != 1) { /* not found */
fputs(word,stdout); /* word already has '\n' */
}
rewind(dict); /* back to the beginning */
}
fclose(dict);
}
This program is slightly more complicated because it needs to look up each word in the dictionary. To simplify the process, the main
function opens the dictionary file and resets it to the start of the file to look up each new word. This isn't the fastest way to look up words in a file, but requires very little memory to do it. The lookup
function takes the word to examine and a file pointer to the dictionary, and scans the dictionary for the word.
Note that dedupe
and nonwords
both use statically allocated string variables. While this is not great programming practice, the longest word ("pneumonoultramicroscopicsilicovolcanoconiosis") in the words
dictionary file is 45 characters long. Using 60-character strings is a safe assumption here.
Check your spelling
With these programs, and with the existing sort command from Unix 1st Edition, we can check the spelling in text documents.
- The
makewords
program breaks apart the text file and lists each word on a separate line - The
sort
command sorts the list alphabetically - The
dedupe
program removes duplicate words - The
nonwords
program checks the list of words against the dictionary, and prints misspelled words
Unix 1st Edition and 2nd Edition did not support pipes, a neat feature introduced in Unix 3rd Edition to let programs read input from the output of another program. In the earliest days of Unix history, you needed to use the <
metacharacter to specify that a command should read input from a file, and the >
character to send the output to a different file. This requires typing each command on a separate line:
$ ./makewords < test.me > tmp1
$ sort < tmp1 > tmp2
$ ./dedupe < tmp2 > tmp3
$ ./nonwords < tmp3
nf
nroff
smple
The nf
(no fill) nroff instruction is not listed in the dictionary, nor is the nroff
command name itself. However, these are both valid in my test.me
document. But the last line of output indicates that the word "sample" is misspelled:
.de (V
.ls 1
.nf
..
.de )V
.ls
.fi
..
.uh "Test document"
.ls 2
.pp
This is a smple nroff document to demonstrate how to change line
spacing and back again. Let's say I wrote a document that used
double line spacing but included plain text equations with a
custom macro that uses single spacing:
.(V
1 2
x = x + v t + _ a t
0 0 2
.)V
.pp
The first paragraph will use double line spacing, the equation
will be single spaced and not use fill, and the following
paragraph (this one) will go back to double line spacing.