How markup works: collecting words

Write a sample program to explore how markup systems collect words to fill paragraphs.

February 24, 2025

Markup languages all share one common rule: they collect words and fill paragraphs, and rely on markup to control the formatting. The specifics of the markup and formatting differ based on the system being used (HTML uses tags, LaTeX and TeX use backslash commands, and groff and nroff use dot commands) but at their core, these systems all collect words and fill paragraphs.

Let’s explore one method to collect words in a document.

Working with limited memory

Today’s computers have gigabytes of memory; my computer at home has 32GB of installed RAM. But when authors first started using computers to write documents, computers counted memory in kilobytes. To put that in perspective: 1,024 bytes is a kilobyte, 1,024 kilobytes is a megabyte, and 1,024 megabytes is a gigabyte. That means a modern computer with 32GB has more than a thousand thousand (1,048,576) times more memory than an older computer with 32kB—although the first computers used in document preparation had even less memory than that.

Processing documents in this small amount of memory presents certain limitations on how to read the input file. Where a modern application might load the entire file into memory before processing it, original markup systems needed to preserve memory by reading the input one character at a time to collect one word at a time.

Writing a program to collect words requires only a little memory. We need enough to store the current letter being read from the input, and a full word being collected, plus another variable to hold the length of the word.

Start with the basics

Let’s start by looking at the most basic prototype: a function to read letters from the input and fill a “word” variable. Let’s call this variable buffer since it’s just a place to store data until we print it. When the buffer is full, we can print it to the output:

void fill_buffer(FILE *in, FILE *out)
{
    int letter;
    char buffer[BUFSIZE];
    int buflen = 0;

    while ((letter = fgetc(in)) != EOF) {
        buffer[buflen++] = letter;

        if (buflen == BUFSIZE) {
            put_string(buffer, buflen, out);
            buflen = 0;
        }
    }

    put_string(buffer, buflen, out);
}

This is a good starting point because it demonstrates the basic behavior of how a markup system reads input one letter at a time to collect words. In this sample, we use only three variables: letter to store a single letter at a time, buffer which saves a “string” of letters as a “buffer” with length BUFSIZE (defined elsewhere in the program), and buflen to track how many letters we’ve saved in the buffer.

The function is just one big while loop that reads one letter at a time using the fgetc function:

    while ((letter = fgetc(in)) != EOF) {
        ...
    }

The while line may look difficult, but it’s easier to understand once you understand that the syntax for the while loop is:

    while (condition) {
      instructions ..
    }

In this case, the “condition” contains another statement:

    (letter = fgetc(in)) != EOF

This does several things at once: letter = fgetc(in) means the program first uses fgetc to read a single letter from the input, and saves it in letter. The fgetc function returns an “end of file” value when it reaches the end of the input, so the != EOF part of the condition tests if we have found the end of the file. As a result, the while loop only executes while there is data to read.

Inside the loop, the program stores data into the buffer, and increments the buffer length with buflen++. When the buffer’s length equals the buffer’s size, the program calls the put_string function to print the buffer’s contents. Then it resets the buffer’s length to zero, ready to read more data.

At the end, the program prints any extra data that might have been saved in the buffer since the last time it was printed.

This is just one part of a larger program to read data into a buffer one letter at a time and print it out. The full program also needs a main function to process the files on the command line, plus a put_string function to print data:

#include <stdio.h>
#include <ctype.h>                     /* isspace */

#define BUFSIZE 20

void put_string(const char *str, int len, FILE *out)
{
    /* an inefficient fputs */

    for (int i = 0; i < len; i++) {
        fputc(str[i], out);
    }

    fputc('\n', out);
}

void fill_buffer(FILE *in, FILE *out)
{
    int letter;
    char buffer[BUFSIZE];
    int buflen = 0;

    while ((letter = fgetc(in)) != EOF) {
        buffer[buflen++] = letter;

        if (buflen == BUFSIZE) {
            put_string(buffer, buflen, out);
            buflen = 0;
        }
    }

    put_string(buffer, buflen, out);
}

int main(int argc, char **argv)
{
    FILE *in;

    for (int i = 1; i < argc; i++) {
        in = fopen(argv[i], "r");

        if (in != NULL) {
            fill_buffer(in, stdout);
            fclose(in);
        }
    }

    if (argc == 1) {
        fill_buffer(stdin, stdout);
    }

    return 0;
}

This sample program sets the buffer size to 20 letters, which might be a bit small for practical use to collect words but is suitable to see the process in action. Save this source code in a file named buffer.c and compile it, such as with the GNU C Compiler on Linux:

$ gcc -o buffer buffer.c

Processing a single-line “lorem ipsum” file with 124 letters generates 7 lines of output. Except for the last line, the lines are 20 characters long because that is the size of the buffer:

$ ./buffer lorem.txt
Lorem ipsum dolor si
t amet, consectetur 
adipiscing elit, sed
 do eiusmod tempor i
ncididunt ut labore 
et dolore magna aliq
ua.

The spaces at the end of each line are easier to see if we use the tr command to replace each space with an underscore character:

$ ./buffer lorem.txt | tr ' ' '_'
Lorem_ipsum_dolor_si
t_amet,_consectetur_
adipiscing_elit,_sed
_do_eiusmod_tempor_i
ncididunt_ut_labore_
et_dolore_magna_aliq
ua.

Collecting words

We can update this primitive function to read letters from the input and collect words. Instead of a fill_buffer function, we write a find_words function. The internals of this function remain essentially the same: read the file one letter at a time, and save them into a buffer variable called word. The key difference is that the function should only store words. Any kind of whitespace, such as a space or tab, marks the beginning and end of a word.

void find_words(FILE *in, FILE *out)
{
    int letter;
    char word[WORDSIZE];
    int wordlen = 0;

    while ((letter = fgetc(in)) != EOF) {
        if (isspace(letter)) {         /* whitespace */
            if (wordlen > 0) {         /* found a space after the word */
                put_string(word, wordlen, out);
                wordlen = 0;
            }
        }
        else {                         /* letter */
            word[wordlen++] = letter;

            if (wordlen == WORDSIZE) { /* avoid overflow */
                put_string(word, wordlen, out);
                wordlen = 0;
            }
        }
    }

    if (wordlen > 0) {
        put_string(word, wordlen, out);
    }
}

I’ve included a few comments to help explain how the function works at each step. This version of the function tests each letter with the isspace function, which returns a true value if the character is some kind of whitespace. When the function finds a whitespace character, it uses wordlen to determine if there is data in the word variable; if so, it calls the put_string function to print it. Otherwise, the function continues to save letters into the word variable, only printing its contents if the saved data has filled the buffer.

At the end, the program prints any extra data that might have been saved in the word buffer since the last time it was printed.

This is the only part of the program that needs to change in order to collect words from the input. The main function and put_string function remain the same:

#include <stdio.h>
#include <ctype.h>                     /* isspace */

#define WORDSIZE 20

void put_string(const char *str, int len, FILE *out)
{
    /* an inefficient fputs */

    for (int i = 0; i < len; i++) {
        fputc(str[i], out);
    }

    fputc('\n', out);
}

void find_words(FILE *in, FILE *out)
{
    int letter;
    char word[WORDSIZE];
    int wordlen = 0;

    while ((letter = fgetc(in)) != EOF) {
        if (isspace(letter)) {         /* whitespace */
            if (wordlen > 0) {         /* found a space after the word */
                put_string(word, wordlen, out);
                wordlen = 0;
            }
        }
        else {                         /* letter */
            word[wordlen++] = letter;

            if (wordlen == WORDSIZE) { /* avoid overflow */
                put_string(word, wordlen, out);
                wordlen = 0;
            }
        }
    }

    if (wordlen > 0) {
        put_string(word, wordlen, out);
    }
}

int main(int argc, char **argv)
{
    FILE *in;

    for (int i = 1; i < argc; i++) {
        in = fopen(argv[i], "r");

        if (in != NULL) {
            find_words(in, stdout);
            fclose(in);
        }
    }

    if (argc == 1) {
        find_words(stdin, stdout);
    }

    return 0;
}

Save this new program as words.c and compile it with your system’s C compiler:

$ gcc -o words words.c

Processing the same single-line “lorem ipsum” file generates 19 lines of output. In this case, each line is exactly one word:

$ ./words lorem.txt 
Lorem
ipsum
dolor
sit
amet,
consectetur
adipiscing
elit,
sed
do
eiusmod
tempor
incididunt
ut
labore
et
dolore
magna
aliqua.

The word count is easier to see if we use the cat command with the -n (number lines) option:

$ ./words lorem.txt | cat -n
     1  Lorem
     2  ipsum
     3  dolor
     4  sit
     5  amet,
     6  consectetur
     7  adipiscing
     8  elit,
     9  sed
    10  do
    11  eiusmod
    12  tempor
    13  incididunt
    14  ut
    15  labore
    16  et
    17  dolore
    18  magna
    19  aliqua.

Collect words and fill paragraphs

This demonstration is only the first step in how a markup system collect words and fill paragraphs, but it is simple enough to show the process in action. To take this to the next level, you might replace the put_string function with a function like add_word to add the current word to a separate variable that stores a line of text before it is printed. After that, adding other functionality such as recognizing markup can support text formatting.

Jim Hall is an open source software advocate and technical writer. At work, Jim is CEO of Hallmentum, an IT executive consulting company that provides hands-on IT Leadership training, workshops, and coaching.