raspberry-pi Writing a simple text formatter (part 2)

A further demonstration of how document preparation works behind the scenes, with a program to fill a paragraph with text.

I recently shared some sample code to demonstrate writing a simple text formatter. This was a simple document preparation system that only displayed text in bold or underline on a typewriter-like device, such as a dot matrix printer or Teletype terminal.

As a demonstration of the basics of text formatting, this worked well. To underline each letter in a word for a line printer, the program prints an underscore character, then a backspace code, then the letter to be underlined. The result is underlined text. Similarly, to display  bold text, the program prints the letter, then a backspace, then the same letter. On a typewriter-like printer, this overstrike method creates bold text.

Limitations of the text formatter

The sample text formatter only prints text in bold or underline. It does not perform the more common task of document preparation: collect words and fill paragraphs. To perform this function, we might use a BSD Unix tool that's now common across many Unix and Linux systems: fmt will reformat a plain text document to fill out paragraphs. You can set the output to use a preferred line length, or fmt will use a default line length of 75 columns.

Unfortunately, the Linux version of the fmt command doesn't account for backspace characters, so you can't use fmt to reformat text for printing after you use bold and underline. For example, let's start with this sample file that prints one line in bold, another as underlined, and the third line as plain text:

$ cat file
@B This text is in bold.
@U And this text is underlined.
The rest of the text is printed normally.

Formatting this text with the boldunder program we created in the other article, and sending the output to the fmt command, results in a paragraph that does not appear to be filled to 75 columns:

$ ./boldunder file | fmt
This text is in bold.  And
this text is underlined.
The rest of the text is printed normally.

The Linux fmt program interprets the backspace code as a regular character, so using backspace adds to the length of each word. This is easier to see if we use tr to translate backspaces into some other character, such as the "less than" symbol:

$ ./boldunder file | tr '\b' '<' | fmt
T<Th<hi<is<s t<te<ex<xt<t i<is<s i<in<n b<bo<ol<ld<d.<.  _<A_<n_<d
_<t_<h_<i_<s _<t_<e_<x_<t _<i_<s _<u_<n_<d_<e_<r_<l_<i_<n_<e_<d_<.
The rest of the text is printed normally.

If we want to emulate a document processing system that collects words and fills paragraphs, in combination with the program to format text as bold or underlined, we need to write our own. This is an opportunity to further demonstrate how document preparation systems work. Here's how to do it.

A program to refill text

At a high level, this program needs to read one or more files - and for each file, it collects words and fills paragraphs. It does that one line at a time. We can describe this using words as a kind of pseudo-code that expresses the intent of the program in a way that should be readable to everyone:

Program to collect words and fill paragraphs: For each file:  Read one line at a time  Examine each line, one word at a time  Keep track of the length of the line being printed  If the next word doesn't make the output line too long, then print it  Otherwise, start a new line and print the word

Count letters in a word

To start, we must write a procedure to count the letters in a word, taking into account that a character followed by backspace shouldn't count towards the total word length. The general algorithm can be described like this:

Function to count the length of a word (give it a word to work on)  For each letter in the word:    If the letter is a normal character,      Add one to the word length.    If the character is a backspace code,      Subtract one from the word length.

We can write such a function using any programming language. I've written a sample function called strlength using the C language. This function "walks" through the word (also called a string in programming) using a special programming concept called a pointer (named p) and examines every letter in the word. It uses the standard C function isprint to determine if the letter is a printable character; if it is, the function adds one to the length. If the letter is instead a backspace code, it subtracts one from the length:

int
strlength(char *s)
{
  char *p;
  int length = 0;

  p = s;

  while (p[0]) {
    if (isprint(p[0])) {
      length++;
    }
    else if (p[0] == '\b') {           /* backsp */
      if (--length < 0) {
        length = 0;                    /* min is zero */
      }
    }
    /* else .. nonprinting chars don't count to length */

    p++;
  }

  return length;
}

Fill lines one at a time

We can use that function to write a new procedure to fill lines, one input line at a time. This new function, named linefill, breaks up a line (also called a string) into separate words using the built-in C function strtok, according to a set of delimiters like spaces and tabs. Using the strlength function we wrote earlier, it determines if there is enough space left on the line to include the new word; if so, it prints the word. Otherwise, it starts a new line and prints the word there:

int
linefill(char *s, int len, FILE *out)
{
  char *word;
  int wordlen;
  int length;

  length = len;

  word = strtok(s, DELIM);

  while (word != NULL) {
    wordlen = strlength(word);

    if ((length + 1 + wordlen) < MAXLEN) {
      if (length > 0) {
        fputc(' ', out);               /* space */
        length++;                      /* space */
      }
    }
    else {
      fputc('\n', out);                /* newline */
      length = 0;
    }

    length += wordlen;
    fputs(word, out);

    word = strtok(NULL, DELIM);
  }

  return length;
}

Read lines one at a time

Now that we have a process that can manage one line at a time, we need to create a procedure called filefill that will read one line at a time from a file and use the linefill function to turn it into a paragraph. We can use the getline function from the C library to read a single line from a file, which saves a copy of the line in a variable called line. The rest of the function is a loop that keeps reading lines until it reaches the end of the file:

void
filefill(FILE *in, FILE *out)
{
  char *line = NULL;
  size_t linesize = 0;
  ssize_t linelen;
  int length = 0;

  /* read file one line at a time */

  while ((linelen = getline(&line, &linesize, in)) > -1) {
    length = linefill(line, length, out);
  }

  fputc('\n', out);                    /* trailing new line */

  free(line);
}

If you are familiar with C programming, you may notice that this doesn't do anything special for new paragraphs. The entire file becomes one long paragraph. And that's okay for this demonstration.

To start a new paragraph when we reach a blank line in the input file, we would need to add some code after getline to detect if the line is empty so it can start printing on a new line. However, I've omitted that from this example because I want to focus on how a text formatter works at the most basic level; I'll leave out the more advanced features.

Putting it all together

With the basic functions now defined, we can create a short program that reads a list of files and processes each of them with filefill. If there are no files to read, the program can read from the standard input, which could be the user's keyboard:

int
main(int argc, char **argv)
{
  int i;
  FILE *pfile;

  for (i = 1; i < argc; i++) {
    pfile = fopen(argv[i], "r");

    if (pfile == NULL) {
      fputs("cannot read file: ", stderr);
      fputs(argv[i], stderr);
      fputc('\n', stderr);
    }
    else {
      filefill(pfile, stdout);
      fclose(pfile);
    }
  }

  /* if no files, then read from stdin */

  if (argc == 1) {
    filefill(stdin, stdout);
  }

  return 0;
}

And now we can put the entire program together:

#include <stdio.h>
#include <stdlib.h>

#include <ctype.h>
#include <string.h>

#define DELIM " \t\n"

#define MAXLEN 60

int
strlength(char *s)
{
..
}

int
linefill(char *s, int len, FILE *out)
{
..
}

void
filefill(FILE *in, FILE *out)
{
..
}

int
main(int argc, char **argv)
{
..
}

Save that program as fill.c and compile it on a Linux system like this:

$ gcc -o fill fill.c

Let's test it with the sample file from the start of the article. If we first use the boldunder program to underline any lines that start with @U and make bold any lines that start with @B, we can send the output to the new fill program to collect words and fill a paragraph. Each line in the paragraph will be a maximum of 60 characters long, not counting the backspace controls:

$ ./boldunder file | ./fill
This text is in bold. And this text is underlined. The rest
of the text is printed normally.

If you aren't able to see the bold and underlined text directly in your terminal, try sending the output to a pager program, such as the less command.

Document preparation in a nutshell

The two programs, boldunder and fill, demonstrate the basics of document preparation. This is a different method than word processing, where you can see a version of the final document as you type it; you cannot see the final document as you type it when you use a document preparation system. The tradeoff is that document processing systems allowed technical writers to create documents at a time before video terminals; seeing a final version of a document as you entered it was impossible in this era.

We shouldn't think of document preparation systems as "magic." These text formatting systems operated according to a set of rules, and we've presented a simple implementation of those rules with boldunder and fill. One program only formats text as underlined or bold, the other collects words and fills a paragraph. Together, they form the essentials to process files into printed documents.