raspberry-pi Collect words and fill paragraphs

All text processors collect words and fill paragraphs, let’s explore what that means.

When you process a document using any kind of text processor, whether that’s one that supports minimal markup like Markdown, or a system that uses some kind of request or instruction system like LaTeX or groff, or one that uses explicit tags such as HTML or XML, the basic action is the system collects words and fills paragraphs. If the system supports other formatting, it will interpret certain tags or other instructions to modify the output. But at the core, all text processing systems collect words from the input and fill paragraphs on the output.

This is easier to understand by writing a little code. Let’s explore a very simple text processing system that doesn’t understand any formatting requests, and only collects words and fills paragraphs. We’ll call this program fill because that’s what it does, in a very simple way.

Basic outline

To process text files, our program will need to follow a basic outline. We can describe the flow in this way:

  1. Read each file
  2. Read lines in each file
  3. Split each line into words
  4. If we can print the next word to the output without going over some “maximum line length,” then do it
  5. Otherwise, start a new line and continue

That’s a very simple description of the process to fill paragraphs. We can describe this using a form of pseudocode, which provides more detail about how to break up the activities into functions:

main program {

  for every file {
    "fill" the file's contents
  }

}

fill_file function {

  output line length = 0

  for every line in the file {
    "fill" the line's contents
    keep track of output line length
  }

}

fill_line function {

  if the line is empty {
    add a blank line
    done
  }

  for each word in the line {
    if adding the word makes the output line too long {
      start a new line
    }

    if we're not at the start of a new line {
      add a space
    }

    add the word

    keep track of output line length
  }

}

The pseudocode uses { and } brackets to indicate the start and end of certain procedures. This should make the pseudocode more readable, even if you don’t know how to write a computer program. Notice that each “unit of work” in this program is broken out into “functions,” which make it easier to separate out how the program works.

Reading the files

Let’s turn this pseudocode into a program. The first step is to write a “main” program that reads all the files; this program will then call a separate function called fill_file to actually process the file’s contents.

int
main(int argc, char **argv)
{
  int i;
  FILE *pfile;

  for (i = 1; i < argc; i++) {
    pfile = fopen(argv[i], "r");

    if (pfile == NULL) {
      fputs("cannot open file: ", stderr);
      fputs(argv[i], stderr);
      fputc('\n', stderr);
    }
    else {
      fill_file(pfile, stdout);
      fclose(pfile);
    }
  }

  if (argc == 1) {
    fill_file(stdin, stdout);
  }

  return 0;
}

I wrote this program in the C programming language because C is such a common programming language. You should be able to find a C compiler on any system you use today, including Linux, Windows, and Mac.

C uses a concept called pointers to track open files. The * on the FILE *pfile line tells C that the pfile variable is actually a pointer to a file, which is also why I used the p prefix to the variable’s name.

The rest of the program should be relatively easy to understand: It loops through all parameters from the command line, which are stored in the argv (argument vector) array, opens the file, and uses the fill_file function to process the file. If the program doesn’t find any command line options, it reads from the standard input (such as the keyboard) and prints to the standard output (like the screen).

The contents of files

To process the contents of each file, we need to write a function called fill_file that reads the file one line at a time, then uses another function to process the lines. This function is quite short because it only reads lines and lets the other function do all the work:

void
fill_file(FILE *in, FILE *out)
{
  char *line = NULL;
  size_t line_size = 0;
  size_t line_len = 0;

  while (getline(&line, &line_size, in) != -1) {
    line_len = fill_line(line, line_len, out);
  }

  fputc('\n', out);                    /* break current line */

  free(line);
}

In this example, I used the getline function to read input from the file. In a very simple example, I could have used a different function called fgets to get a string from a file, but the fgets function is very limited and requires making assumptions about how long each line might be. In contrast, the getline function is more dynamic, and can read lines that are both very short or very long. The tradeoff to using getline is that it can use more memory to read the data, but this is not usually a problem on modern computers.

You might notice that the fill_file function is basically a single loop that gets the next line from the file, then uses fill_line to process it. And that’s all this function needs to do.

Words on a line

The real work of the fill program happens in the fill_line function. The function splits the line into words, and determines if that word can be added to the output line. If adding the word would make the line too long, the function starts a new line and adds the word there.

size_t
fill_line(char *input, size_t line_len, FILE *out)
{
  char *word;
  size_t word_len;

  /* break up the input into words */

  word = strtok(input, DELIM);

  while (word) {
    word_len = strlen(word);

    if ((line_len + 1 + word_len) > MAX_LEN) {
      fputc('\n', out);                /* break current line */
      line_len = 0;
    }

    if (line_len > 0) {
      fputc(' ', out);
      line_len++;
    }

    fputs(word, out);
    line_len += word_len;

    /* get next word */

    word = strtok(NULL, DELIM);
  }

  return line_len;
}

There’s a lot going on here, but if you look carefully, you should notice that the function works on each “rule” separately. After it gets the first word from the line, the function tests if printing that word would make the output line too long. If so, it breaks the line and starts a new one:

    if ((line_len + 1 + word_len) > MAX_LEN) {
      fputc('\n', out);                /* break current line */
      line_len = 0;
    }

Then, it adds a space to the output. It only makes sense to add a space between words on a line; that means the function should not add a space at the start of a new line, so it checks the line length first. If the line length is greater than “zero,” then it adds a space:

    if (line_len > 0) {
      fputc(' ', out);
      line_len++;
    }

Finally, the function prints the word to the output. It also updates the line length, then reads the next word on the line:

    fputs(word, out);
    line_len += word_len;

    /* get next word */

    word = strtok(NULL, DELIM);

All of this activity happens within a while loop, and the loop keeps repeating only while there is another word on the line. When it runs out of words, the function returns the current line length.

Recognizing empty lines

However, this fill_line function has a limitation: it collects words from each line of input. And if the line is empty, it doesn’t print anything and reads the next line. This will mean that the function will collect words from the input and fill one big paragraph on the output.

To fix this behavior so the program will collect words and fill paragraphs, one paragraph at a time, the fill_line function needs to recognize when it has read a blank line. One way to do this is to use the strlen function to determine the length of the input line. If it’s zero, it’s a blank line.

But a blank line might not be of zero length; a blank line could just contain empty spaces or tabs. To recognize non-zero length blank lines, we need a new function called is_empty that reads a line and determines if it contains only “empty” characters. If the string is entirely empty, the function should return a “true” value like 1; if the string contains any non-empty character, the function should return a “false” value like 0.

int
is_empty(char *str, char *empty)
{
  char *s;
  s = str;

  while (s[0]) {
    if (strchr(empty, s[0]) == NULL) {
      /* this char is not in the empty list,
         that means this is a non-empty char */
      return 0;
    }
    s++;
  }

  /* reached the end without finding a non-empty char,
     that means this is an empty string */
  return 1;
}

This function uses pointers, so it might not be immediately obvious what it is doing. I’ve included several comments to help explain the behavior. The function relies on the strchr function from the C standard library, which finds the first occurrence of a character in a string. Here, the is_empty function looks at each character in the line, and if that character does not appear in the list of “empty” characters, the function returns “false” because it has found a non-empty character in the line. If the function reaches the end of the line without finding any non-empty characters, then the entire line is empty and the function returns a “true” value.

With this function, we can update the fill_line function to recognize if an input line is empty, and start a new line:

size_t
fill_line(char *input, size_t line_len, FILE *out)
{
  char *word;
  size_t word_len;

  if (is_empty(input, DELIM)) {
    if (line_len > 0) {
      fputc('\n', out);                /* break current line */
    }

    fputc('\n', out);                  /* blank line */
    return 0;
  }

  /* break up the input into words */

  word = strtok(input, DELIM);

  while (word) {
    word_len = strlen(word);

    if ((line_len + 1 + word_len) > MAX_LEN) {
      fputc('\n', out);                /* break current line */
      line_len = 0;
    }

    if (line_len > 0) {
      fputc(' ', out);
      line_len++;
    }

    fputs(word, out);
    line_len += word_len;

    /* get next word */

    word = strtok(NULL, DELIM);
  }

  return line_len;
}

Collect words and fill paragraphs

Each function in this program does a single step in the overall algorithm to collect words from the input and fill paragraphs on the output. By breaking up the program into these smaller functions, it’s easier to see how the algorithm works: the program reads files one at a time, then uses a function to process the files; that function reads each file one line at a time, then uses another function to process the line; and that last function reads each word one at a time, and fills paragraphs so each line is less than MAX_LEN is length.

The entire program looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define DELIM " \t\n"

#define MAX_LEN 66

int
is_empty(char *str, char *empty)
{
...
}

size_t
fill_line(char *input, size_t line_len, FILE *out)
{
...
}

void
fill_file(FILE *in, FILE *out)
{
...
}

int
main(int argc, char **argv)
{
...
}

Use this program whenever you need to “refill” the paragraphs in a plain text file. For example, when working on a Markdown file - and especially after several edits where you might add, delete, or change words within the text - the lines are likely to be of odd lengths. While this isn’t a problem for Markdown, it can be distracting to work on. Using a program like fill can make the text easier to read.

For example, let’s say we have this short Markdown file that contains a header and a brief paragraph. But each line is a different length, and we have extra spaces between some of the words:

$ cat file.md
# This   is   a   test   file

This is the start of a new paragraph,
which   should   start   on   a   new   line.
Some more text to cause a line break.

With the fill program, we can collect words and fill paragraphs so each line is up to about 66 characters long. The fill program “hard-coded” the line length at 66 characters because using a hard value made the program easier to write.

$ ./fill file.md
# This is a test file

This is the start of a new paragraph, which should start on a new
line. Some more text to cause a line break.

To save the new version of the file, first copy the file to some temporary name, then use fill on that old file and save the contents to the new file:

$ mv file.md file.old
$ ./fill file.old > file.md
$ cat file.md
# This is a test file

This is the start of a new paragraph, which should start on a new
line. Some more text to cause a line break.