hard-drive Writing a simple text formatter

Let's explore how document processing works by creating a simple version of a text formatting system.

I have used both word processors and document preparation systems for a long time. Each has their own strengths: Word processors are great when you need to see how your document will appear on the printed page as you type it. But document preparation systems of the 1980s and early 1990s were often more powerful than the word processor. At the expense of seeing the final page as you type it, the document processor was often capable of very fine document control and formatting, including fonts, at a time when on-screen font rendering was in its infancy.

Document preparation is not "magic." Text formatters operate according to a set of rules, and the author can control the processor's behavior by using special formatting requests or macros, such as different requests to print text in bold, italics, or underline format. Let's explore how document processing works by creating a simple version of a text formatting system.

Word processors and document processors

Since the late 1970, I have used a desktop word processor. Our first home computer was an Apple II, and we used a simple word processor that supported a few basic word processing functions. In the early 1980s, my family replaced the Apple with an IBM Personal Computer, running a different word processor. While still somewhat limited, the newer word processor supported additional formatting that wasn't possible with the earlier software, although the printer hardware was also a limiting factor.

I used a few different word processors throughout the 1980s. As a student, my primary use case was writing class papers. That primary use continued when I enrolled as an undergraduate university student; I still used desktop word processors to write class papers.

It was during this time that I discovered the university's Unix computer lab, and I started to explore the new (to me) operating system on the campus Sun 3/50 workstations. Unfortunately, "word processing" wasn't a thing on Unix at the time - and our Unix lab didn't have a word processing package - so I learned how to use the existing Unix tools to do something similar. That was my first introduction to document processing.

Creating documents using a document processing system required thinking about writing in a new way. I wouldn't see a version of my finished document on screen in the same format it would be printed on paper. Instead, I used formatting instructions to invoke the formatting that I needed, such as centered lines, bold, and underlined text. I used the original Unix nroff system, which produced output suitable for a typewriter-like device, such as a dot matrix printer. I could have used GNU groff (released a few years earlier in 1990) to produce output for a laser printer, but I discovered a loophole in our campus computer lab policy: you had to pay per page if you wanted to use the laser printer, but printing to the dot matrix printer was free. "Free" is a great deal for an undergraduate student, so I used nroff and produced output for the dot matrix printer.

Bold and underline

To better understand how document processors work, let's examine a very limited text processing system that only supports bold and underlined text on a typewriter-like device. We'll keep this to a simple example so we can focus on just the parts that format text as bold or underlined.

This simple formatting was commonly available on every dot matrix printer throughout the 1980s and 1990s: To create boldface type, print a letter, then back up one space to print the same letter on top of it, and repeat for the rest of the word. The result is bold text. Similarly, to produce underlined text, print an underscore, then back up one space and print the letter to be underlined. This overlays the letter on top of the underscore to generate underlined text.

We can represent these processes as a simple algorithm:

Function to make a word underlined (give it a word to underline) For each letter in the word: Print an underscore, then a "backspace" code, then the letter

And:

Function to make a word bold (give it a word to make bold) For each letter in the word: Print the letter, then a "backspace" code, then the same letter

Both of these have the same core feature: print a letter, then a "backspace," then print another letter. The only difference is to make a bold letter, you use the same letter twice, but underlining uses an "underscore" character for the first "letter." This function can be described with this algorithm:

Function to print a letter (give it the two characters to print) Print the first character, then a "backspace" code, then the second character

We can write these algorithms as program code, such as with the C programming language. These C functions produce the same results as the algorithms I've described:

void
fputc2(char c1, char c2, FILE *out)
{
  fputc(c1, out);
  fputc('\b', out);                    /* backsp */
  fputc(c2, out);
}

The fputc2 function (put a character to a file, twice) takes two single-letter characters as input. The function prints the first character using the fputc standard C function (put a character to a file), then uses the fputc function again to print a backspace character. On any typewriter-like printer from the 1980s and 1990s, such as a dot matrix printer, the backspace character will back up over the previous character. The next letter printed will be printed on top of the first character, and that's what happens with the third fputc function to print the second letter.

With this function, we can also write the functions to make words underlined or bold:

void
underline(char *word, FILE *out)
{
  char *p;
  p = word;

  while (p[0]) {
    fputc2('_', p[0], out);
    p++;
  }
}

void
bold(char *word, FILE *out)
{
  char *p;
  p = word;

  while (p[0]) {
    fputc2(p[0], p[0], out);
    p++;
  }
}

Programmatically, these functions can work with words (called strings) of any size. The functions use a computing concept called a pointer (named p) that indicates the next letter in the word. After examining each letter, the bold and underline functions advance the pointer to the next letter of the word by incrementing it with p++.

The functions are also generalized to print output to a file called out. This file could be an actual output file, or it might be the standard output for the program, such as the computer's terminal or screen. Each function is given the same file to write to, so everything prints output to the same place.

Processing files and lines

With these functions to print words as underlined or bold, we can create the rest of the text processing system to process entire files. At a high level, the algorithm looks like this:

Program to process files (give it a list of files) For each file: Open the file Print the file (and process underline and bold instructions) Close the file

We can construct this program in the C programming language like this:

int
main(int argc, char **argv)
{
  int i;
  FILE *pfile;

  for (i = 1; i < argc; i++) {
    pfile = fopen(argv[i], "r");

    if (pfile != NULL) {
      printfile(pfile, stdout);
      fclose(pfile);
    }
    else {
      fputs("cannot read file: ", stderr);
      fputs(argv[i], stderr);
      fputc('\n', stderr);
    }
  }

  /* if no files, read from stdin */

  if (argc == 1) {
    printfile(stdin, stdout);
  }

  return 0;
}

This program reads the list of files given to it; in the C language, the list of files is called an argument vector and is represented by argv. The program processes each file in turn: it opens the file, prints the file, then closes the file. If no files are provided on the command line, the program reads from the computer's standard input, such as the user typing text at the keyboard, or the output from another program.

We also need to describe a function to print the contents of a file. Along the way, this function should  process any special instructions to format lines of text as underlined or bold:

Function to print a file (give it a file to work on) For each line in the file: If the line starts with a request to make it underlined: Print each word on the line as underlined If the line starts with a request to make it bold: Print each word on the line as bold Otherwise: Print the line as given

This is where we need to make a decision: What should we use as a request to apply formatting? The RUNOFF and nroff text processors use requests on a new line that start with a period, such as .ul to provide underline on a typewriter-like device.

But text processing systems don't need to use a period for their special requests. You can use any character that is not likely to appear on its own within a document. For our simple case, we can choose any single character that wouldn't normally start a new line. To demonstrate that any character will do the job, let's pick something other than the period. In this example, @U will indicate a line should be underlined, and @B requests to print a line as bold.

One way to write such a procedure is to use the getline function, a newer addition to the C programming language. Added in 2008, getline reads text from a file one line at a time. The function manages memory on its own, making it very easy to read lines of any length:

void
printfile(FILE *in, FILE *out)
{
  char *line = NULL;
  size_t linesize = 0;
  ssize_t linelen;

  char *word;

  while ((linelen = getline(&line, &linesize, in)) > -1) {
    if (line[0] == '@') {
      /* control line */
      word = strtok(line, DELIM);

      if (strcmp(word, "@U") == 0) {
        while ((word = strtok(NULL, DELIM)) != NULL) {
          underline(word, out);
          fputc(' ', out);
        }
        fputc('\n', out);
      }
      else if (strcmp(word, "@B") == 0) {
        while ((word = strtok(NULL, DELIM)) != NULL) {
          bold(word, out);
          fputc(' ', out);
        }
        fputc('\n', out);
      }
    }
    else {
      /* print the line */
      fputs(line, out);
    }
  }

  free(line);
}

The printfile function processes a file one line at a time. If the line begins with the special character @, the function checks if the line is a request to make text underlined (@U) or bold (@B); if so, the function processes each word on that line to make it either underlined or bold. The process to read each word is not the most efficient method (since the same process is used twice, it would be better to move that code to a separate function) but it is at least easy to understand.

The function prints other lines as they are given, without any special handling.

Putting it all together

We've defined all the pieces of a program to process a file and make lines underlined or bold. To turn this into a final version, we need to assemble everything into one file and compile it. However, I've left out the header files and a definition, which I'll add here to the completed program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define DELIM " \t\n"

void
fputc2(char c1, char c2, FILE *out)
{
  ...
}

void
underline(char *word, FILE *out)
{
  ...
}

void
bold(char *word, FILE *out)
{
  ...
}

void
printfile(FILE *in, FILE *out)
{
  ...
}

int
main(int argc, char **argv)
{
  ...
}

Save that program as boldunder.c and compile it on a Linux system like this:

$ gcc -o boldunder boldunder.c

Let's test it with a sample file. This input file contains three lines: the first line is a request to print in bold, the second prints its line as underlined text, but the third line is just normal text.

@B This text is in bold.
@U And this text is underlined.
The rest of the text is printed normally.

If we process the file using the boldunder program, the output will have the formatting we want. On some Linux terminals, you may see the output formatted directly on the screen; if you don't see the correct output, try viewing the output using a file pager like the less program. This is how the final output should appear:

$ ./boldunder file
This text is in bold.
And this text is underlined.
The rest of the text is printed normally.

Processing files

Document preparation approaches the document creation process differently than word processors. Word processors of the past traded some fidelity for features, or some features for performance. In that same era of the 1970s and 1980s, document preparation systems provided sophisticated document formatting, at a level not possible with contemporary word processors. Document processing systems like nroff and LaTeX made it easy to create complex technical and scientific documents, with the tradeoff of not seeing the final document until after you've processed it.

While modern word processors easily allow you to see the final version of the document as you type it, sometimes a document preparation system is the right tool. But document preparation shouldn't be a "black box." By exploring a simple version of a document formatter with this boldunder program, we can see how things work "behind the scenes" to process files into documents.