Understanding Unix ‘nroff’
Explore a simple implementation to understand how this important document preparation system works “under the hood.”
When I first explored Unix systems as a university student, I learned how to use nroff and troff to write class papers. Nroff made it easy to write documents that I could print on dot matrix printers or other “typewriter-like” output devices. I used troff (actually, GNU groff) to create output suitable for a laser printer.
But while I understood how it worked, using the different macros and requests, I didn’t understand why it worked. I had access to the GNU groff source code, but it was too advanced for me to understand at the time. More recently, though, I’ve decided to take a step back to understand the basics of how nroff worked “under the hood.”
What is nroff?
If you haven’t used nroff, Brian Kernighan (formerly of Bell Labs, now at Princeton) explained the brief history of nroff and troff in our interview with him last year:
Nroff is for basically ASCII text in a fixed-width font; it mimics ancient terminals like the Model 37 Teletype and line printers of the 1970s. Troff is for typesetters, which provide variable-width characters in multiple fonts, which is what we see in print today.
…
The first such program was RUNOFF, written by Jerry Saltzer at MIT in maybe 1966; it was (in my experience) the first such formatting program and provided the model for a lot of subsequent ones. In 1968, I wrote a simpler version for printing my PhD thesis at Princeton that I called “roff,” but I think Doug McIlroy at Bell Labs wrote a different version in the same style that he also called “roff.” Nroff and troff adopted the same style of formatting commands, but were significantly more powerful and could do more complicated formatting, especially page layout.
The original version of troff by Joe Ossanna in about 1973 was aimed at a particular typesetter, the Graphics Systems Model C/A/T. Ditroff (“device independent troff”) was my updated version of troff that produced output that was independent of any particular typesetter.
Kernighan mentioned RUNOFF as an inspiration for the nroff system. I also interviewed Jerry Saltzer about the history of the RUNOFF program, where he explained:
RUNOFF was certainly not the first document preparation system. It borrowed ideas from several predecessors, including DITTO for CTSS, and COLOSSAL TYPEWRITER and JUSTIFY for the PDP-1.
RUNOFF is not a document preparation system by itself, it is one of a pair of programs, named TYPSET and RUNOFF. TYPSET was used to create or edit a document and the resulting file was then run through RUNOFF to format it for printing.
In that same interview, I highlighted features of RUNOFF, including
its control words: “A RUNOFF document used control words or their
abbreviation on a new line, starting with a period. For example,
.center
(or the abbreviation .ce
) centers the
next line, .begin page
(or .bp
) inserts a page
break, and .literal
(or .li
) prints the next
line as normal text even if it starts with a period.”
RUNOFF produced just the formatting that Saltzer needed to write his Sc.D. thesis proposal in 1963, and later to prepare his thesis document. RUNOFF was really a personal project, not meant to become a larger, more flexible document preparation system. But it inspired others to create those more advanced, more flexible systems. Bell Labs implemented a version of RUNOFF called roff, using just the abbreviated commands. Roff was used throughout the Unix team to create printed documents intended for a “typewriter-like” device, such as a Teletype or line printer.
In his book, Unix: A History and Memoir (2020), Kernighan relates that the Unix team created nroff as an offshoot of another project: The Unix team wanted to purchase a new computer system to keep working on Unix, but management said No. Around the same time, the Patents department planned to purchase another computer system to help them write patent applications, which required strict, specific formatting. The Unix team made a deal: the Patents department bought a new computer for the Unix team, and the Unix team updated roff to become nroff (“new roff”). Combined with a macro system, nroff and the later troff (“typesetter roff,” intended for use with a phototypesetter) could provide fine control on formatting all kinds of documents.
A basic outline
In their book, Software Tools (1976), Brian Kernighan and P. J. Plauger describe how to create a variety of useful tools, by way of teaching algorithms and program construction. One such program is called Format, a simplified version of nroff.
Kernighan and Plauger use a programming language called Ratfor, or “Rational FORTRAN,” which may appear to modern readers as a mix of C programming and pseudocode. It is actually a pre-processor that uses C-like input to generate code for FORTRAN66 compilers. However, because not many people use Ratfor these days, I will instead present an updated outline of Format using a more general pseudocode.
First, let’s understand the basics of the program. This Format
program reads one or more files, then collects words and fills
paragraphs. Blank lines are preserved, but documents should really use
formatting instructions like .sp
to add a blank line of
extra space or .ti 4
to add 4 spaces of temporary
indenting, such as to start a new paragraph. These instructions always
appear on a line by themselves, and at the start of a line.
At a high level, the main program looks like this:
main()
{
for each file:
for each line in the file:
if first character is '.', then:
command(line)
else:
text(line)
}
The text
function reads text from the input and adds it
to the output, writing full lines as it goes. The command
function processes a formatting instruction, which usually updates some
parameter in memory (like the page length, right margin, temporary
indent, or another value).
Processing commands
In their book, Kernighan and Plauger define these commands that the Format program can process: (p. 233)
command | break? | default | what it does: |
---|---|---|---|
.bp n |
yes | n = +1 | begin page numbered n |
.br |
yes | cause break | |
.ce n |
yes | n = 1 | center next n lines |
.fi |
yes | start filling | |
.fo |
no | empty | footer title |
.he |
no | empty | header title |
.in n |
no | n = 0 | indent n spaces |
.ls n |
no | n = 1 | line spacing is n |
.nf |
yes | stop filling (“no fill”) | |
.pl n |
no | n = 66 | set page length to n |
.rm n |
no | n = 60 | set right margin to n |
.sp n |
yes | n = 1 | space down n lines |
.ti n |
yes | n = 0 | temporary indent of n |
.ul n |
no | n = 1 | underline words from next n lines |
The command
function evaluates the first word on the
line, and takes actions based on what it is. For example:
command(line)
{
look at the first word:
if ".br", then:
brk()
or, if ".fi", then:
brk()
param.fill = YES
or, if ".nf", then:
brk()
param.fill = NO
or, if ".sp", then:
get next word as "val"
space(val)
or, if ".ls", then:
get next word as "val"
param.ls = val
or, if ".he", then:
get rest of line as "str"
param.header = val
or, if ".fo", then:
get rest of line as "str"
param.footer = val
... and so on for the other commands
}
Some commands might simply set a parameter, such as .ls
,
.he
, or .fo
. Other commands set values but
also take an action, such as commands like .fi
and
.nf
that cause a break with the brk()
function. And other commands do something immediately, like the
.sp
command to add extra blank lines on the output.
Breaking lines
By dividing these actions into separate functions, the
Format program is easier to write because the actions are isolated into
distinct units. For example, the brk()
function evaluates
the output that it’s been preparing to print, stored in a global
variable called output
, and prints it if there’s anything
there, then clears the variable in memory:
void brk(void)
{
if (length(output) > 0) {
puts(output);
}
output[0] = 0;
}
I’ve written the brk()
function using the C programming
language, because it’s fairly simple and it’s good to see an
implementation. Even if you don’t know C programming, you can probably
infer that length
is a function that determines the
length of a string, and puts
prints the
string to the output.
Others who are more familiar with C might wonder why I didn’t use
strlen
to find the length; it’s because the
output
variable might contain backspaces, which
strlen
doesn’t understand and treats like normal
characters. But the Format program might use backspaces to create
underlined or bold type, so we need the new length
function
to accommodate that:
size_t length(const char *str)
{
size_t len, i;
len = 0;
for (i = 0; str[i]; i++) {
if (str[i] == '\b') { /* backspace */
len--;
}
else if (str[i] >= 32) { /* printable */
len++;
}
}
return len;
}
I’ve included an extra comparison to ensure that a character is printable, and not a printer control code. ASCII defines codes 0 through 31 to be printer controls, and 32 through 127 to be printable characters. (Anything above 127 is technically an extended ASCII character, and also likely to be printable.)
Processing text
Kernighan and Plauger first describe the text
function
very simply (p. 229) as a function that might simply copy text to the
output, but then update it later (p. 237) after providing more detail in
the Format program. This updated text
function accommodates
gluing several shorter lines together to fill an output line.
Kernighan and Plauger’s implementation might be described through this pseudocode:
text(line)
{
if the line is all blanks, or zero length:
new_para()
if zero length,
or if not filling output lines:
putline(line)
otherwise:
for each word on the line:
putword(word)
}
What I’ve called the new_para()
function here is
actually named leadbl
in Kernighan and Plauger’s example.
This starts the output on the left side and adds any temporary indenting
that might be needed, effectively starting a new paragraph.
Adding words with the putword()
function simplifies
processing the text. Basically, this function adds words to the
output
variable one at a time. If the output
variable reaches the maximum line length, it calls brk()
to
print the line and start a new one.
Kernighan and Plauger describe the putline()
function
using pseudocode (p. 229) but I’ll simplify it further so it’s easier to
understand:
putline(line)
{
if at the top of the page or past the bottom of the page:
make the top margins and top title
add an indent, if any
add the line of text to the output
increment the line number
if past the bottom of the page:
make the bottom margins and bottom title
}
The basics of nroff
Kernighan and Plauger provide an excellent overview of how to create a document processing system in their book, Software Tools. By exploring this simple implementation, we can understand how a more complex document preparation system (like nroff) works on the inside.
But this overview of nroff is just the beginning, enough to provide a general overview or high level outline of the program. In a follow-up article, I will go into more detail and demonstrate how you can create a prototype text processing system like this Format program. By digging further, we can see the “moving parts” of a document preparation system and more fully understand how it works.