the-journey Understanding Unix ‘nroff’

Explore a simple implementation to understand how this important document preparation system works “under the hood.”

When I first explored Unix systems as a university student, I learned how to use nroff and troff to write class papers. Nroff made it easy to write documents that I could print on dot matrix printers or other “typewriter-like” output devices. I used troff (actually, GNU groff) to create output suitable for a laser printer.

But while I understood how it worked, using the different macros and requests, I didn’t understand why it worked. I had access to the GNU groff source code, but it was too advanced for me to understand at the time. More recently, though, I’ve decided to take a step back to understand the basics of how nroff worked “under the hood.”

What is nroff?

If you haven’t used nroff, Brian Kernighan (formerly of Bell Labs, now at Princeton) explained the brief history of nroff and troff in our interview with him last year:

Nroff is for basically ASCII text in a fixed-width font; it mimics ancient terminals like the Model 37 Teletype and line printers of the 1970s. Troff is for typesetters, which provide variable-width characters in multiple fonts, which is what we see in print today.

The first such program was RUNOFF, written by Jerry Saltzer at MIT in maybe 1966; it was (in my experience) the first such formatting program and provided the model for a lot of subsequent ones. In 1968, I wrote a simpler version for printing my PhD thesis at Princeton that I called “roff,” but I think Doug McIlroy at Bell Labs wrote a different version in the same style that he also called “roff.” Nroff and troff adopted the same style of formatting commands, but were significantly more powerful and could do more complicated formatting, especially page layout.

The original version of troff by Joe Ossanna in about 1973 was aimed at a particular typesetter, the Graphics Systems Model C/A/T. Ditroff (“device independent troff”) was my updated version of troff that produced output that was independent of any particular typesetter.

Kernighan mentioned RUNOFF as an inspiration for the nroff system. I also interviewed Jerry Saltzer about the history of the RUNOFF program, where he explained:

RUNOFF was certainly not the first document preparation system. It borrowed ideas from several predecessors, including DITTO for CTSS, and COLOSSAL TYPEWRITER and JUSTIFY for the PDP-1.

RUNOFF is not a document preparation system by itself, it is one of a pair of programs, named TYPSET and RUNOFF. TYPSET was used to create or edit a document and the resulting file was then run through RUNOFF to format it for printing.

In that same interview, I highlighted features of RUNOFF, including its control words: “A RUNOFF document used control words or their abbreviation on a new line, starting with a period. For example, .center (or the abbreviation .ce) centers the next line, .begin page (or .bp) inserts a page break, and .literal (or .li) prints the next line as normal text even if it starts with a period.”

RUNOFF produced just the formatting that Saltzer needed to write his Sc.D. thesis proposal in 1963, and later to prepare his thesis document. RUNOFF was really a personal project, not meant to become a larger, more flexible document preparation system. But it inspired others to create those more advanced, more flexible systems. Bell Labs implemented a version of RUNOFF called roff, using just the abbreviated commands. Roff was used throughout the Unix team to create printed documents intended for a “typewriter-like” device, such as a Teletype or line printer.

In his book, Unix: A History and Memoir (2020), Kernighan relates that the Unix team created nroff as an offshoot of another project: The Unix team wanted to purchase a new computer system to keep working on Unix, but management said No. Around the same time, the Patents department planned to purchase another computer system to help them write patent applications, which required strict, specific formatting. The Unix team made a deal: the Patents department bought a new computer for the Unix team, and the Unix team updated roff to become nroff (“new roff”). Combined with a macro system, nroff and the later troff (“typesetter roff,” intended for use with a phototypesetter) could provide fine control on formatting all kinds of documents.

A basic outline

In their book, Software Tools (1976), Brian Kernighan and P. J. Plauger describe how to create a variety of useful tools, by way of teaching algorithms and program construction. One such program is called Format, a simplified version of nroff.

Kernighan and Plauger use a programming language called Ratfor, or “Rational FORTRAN,” which may appear to modern readers as a mix of C programming and pseudocode. It is actually a pre-processor that uses C-like input to generate code for FORTRAN66 compilers. However, because not many people use Ratfor these days, I will instead present an updated outline of Format using a more general pseudocode.

First, let’s understand the basics of the program. This Format program reads one or more files, then collects words and fills paragraphs. Blank lines are preserved, but documents should really use formatting instructions like .sp to add a blank line of extra space or .ti 4 to add 4 spaces of temporary indenting, such as to start a new paragraph. These instructions always appear on a line by themselves, and at the start of a line.

At a high level, the main program looks like this:

main()
{
    for each file:

        for each line in the file:

            if first character is '.', then:
                command(line)

            else:
                text(line)
}

The text function reads text from the input and adds it to the output, writing full lines as it goes. The command function processes a formatting instruction, which usually updates some parameter in memory (like the page length, right margin, temporary indent, or another value).

Processing commands

In their book, Kernighan and Plauger define these commands that the Format program can process: (p. 233)

command break? default what it does:
.bp n yes n = +1 begin page numbered n
.br yes cause break
.ce n yes n = 1 center next n lines
.fi yes start filling
.fo no empty footer title
.he no empty header title
.in n no n = 0 indent n spaces
.ls n no n = 1 line spacing is n
.nf yes stop filling (“no fill”)
.pl n no n = 66 set page length to n
.rm n no n = 60 set right margin to n
.sp n yes n = 1 space down n lines
.ti n yes n = 0 temporary indent of n
.ul n no n = 1 underline words from next n lines

The command function evaluates the first word on the line, and takes actions based on what it is. For example:

command(line)
{
    look at the first word:

        if ".br", then:
            brk()

        or, if ".fi", then:
            brk()
            param.fill = YES

        or, if ".nf", then:
            brk()
            param.fill = NO

        or, if ".sp", then:
            get next word as "val"
            space(val)

        or, if ".ls", then:
            get next word as "val"
            param.ls = val

        or, if ".he", then:
            get rest of line as "str"
            param.header = val

        or, if ".fo", then:
            get rest of line as "str"
            param.footer = val

        ... and so on for the other commands
}

Some commands might simply set a parameter, such as .ls, .he, or .fo. Other commands set values but also take an action, such as commands like .fi and .nf that cause a break with the brk() function. And other commands do something immediately, like the .sp command to add extra blank lines on the output.

Breaking lines

By dividing these actions into separate functions, the Format program is easier to write because the actions are isolated into distinct units. For example, the brk() function evaluates the output that it’s been preparing to print, stored in a global variable called output, and prints it if there’s anything there, then clears the variable in memory:

void brk(void)
{
    if (length(output) > 0) {
        puts(output);
    }

    output[0] = 0;
}

I’ve written the brk() function using the C programming language, because it’s fairly simple and it’s good to see an implementation. Even if you don’t know C programming, you can probably infer that length is a function that determines the length of a string, and puts prints the string to the output.

Others who are more familiar with C might wonder why I didn’t use strlen to find the length; it’s because the output variable might contain backspaces, which strlen doesn’t understand and treats like normal characters. But the Format program might use backspaces to create underlined or bold type, so we need the new length function to accommodate that:

size_t length(const char *str)
{
    size_t len, i;

    len = 0;

    for (i = 0; str[i]; i++) {
        if (str[i] == '\b') {          /* backspace */
            len--;
        }
        else if (str[i] >= 32) {       /* printable */
            len++;
        }
    }

    return len;
}

I’ve included an extra comparison to ensure that a character is printable, and not a printer control code. ASCII defines codes 0 through 31 to be printer controls, and 32 through 127 to be printable characters. (Anything above 127 is technically an extended ASCII character, and also likely to be printable.)

Processing text

Kernighan and Plauger first describe the text function very simply (p. 229) as a function that might simply copy text to the output, but then update it later (p. 237) after providing more detail in the Format program. This updated text function accommodates gluing several shorter lines together to fill an output line.

Kernighan and Plauger’s implementation might be described through this pseudocode:

text(line)
{
    if the line is all blanks, or zero length:
        new_para()

    if zero length,
    or if not filling output lines:
        putline(line)

    otherwise:
        for each word on the line:
            putword(word)
}

What I’ve called the new_para() function here is actually named leadbl in Kernighan and Plauger’s example. This starts the output on the left side and adds any temporary indenting that might be needed, effectively starting a new paragraph.

Adding words with the putword() function simplifies processing the text. Basically, this function adds words to the output variable one at a time. If the output variable reaches the maximum line length, it calls brk() to print the line and start a new one.

Kernighan and Plauger describe the putline() function using pseudocode (p. 229) but I’ll simplify it further so it’s easier to understand:

putline(line)
{
    if at the top of the page or past the bottom of the page:
        make the top margins and top title

    add an indent, if any

    add the line of text to the output

    increment the line number

    if past the bottom of the page:
        make the bottom margins and bottom title
}

The basics of nroff

Kernighan and Plauger provide an excellent overview of how to create a document processing system in their book, Software Tools. By exploring this simple implementation, we can understand how a more complex document preparation system (like nroff) works on the inside.

But this overview of nroff is just the beginning, enough to provide a general overview or high level outline of the program. In a follow-up article, I will go into more detail and demonstrate how you can create a prototype text processing system like this Format program. By digging further, we can see the “moving parts” of a document preparation system and more fully understand how it works.