programmer Technical writers use the command line too

Technical writers who know how to leverage the command line can become power users.

In these modern, enlightened times, technical writers can do all of their work from a graphical environment, using a point-and-click interface like a word processor, web browser, or other tool to create content. To some technical editors, using a command line may seem antiquated when everything is online or on the desktop. But technical writers who know how to leverage the command line can become power users.

I run Linux, and I use a very modern graphical desktop. But I sometimes find it’s faster and easier to use the command line to do things. One command line tool I rely on is the AWK scripting language. This was developed by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan (the name AWK is an abbreviation of Aho, Weinberger, and Kernighan) at Bell Labs in 1977.

A brief introduction to AWK

AWK has a very simple syntax, which makes it easy to learn. It operates on text files, and reads one line at a time. Every AWK statement is two parts: a pattern and an action. A pattern can be text that appears on a line, or it can be a comparison of two or more values. An action is one or more statements.

You can use regular expressions to match text (like ^ to match the start of a line and $ for the end of a line) but let’s keep this simple and only use plain text. For example, this instruction looks for the text “hello” in any line of text, and sets the variable hi to 1:

/hello/ {hi=1}

If you have a long series of AWK pattern-action statements, you can save them in a file and let AWK run the file as a script. But most of the time, I just run AWK with a few statements; that means I can provide the statements on the command line. Let’s extend the previous example to also print any lines from a text file only when the hi variable has the value 1:

$ awk '/hello/ {hi=1} hi==1 {print}'

Extract text using AWK

With this brief introduction, we can use AWK to do real work on text files. Recently, I needed to extract the contents of an HTML page; specifically, I needed to save just the body section of an HTML document to a new file.

HTML uses tags to provide structure and formatting on a web page. The basic outline of a minimally valid HTML document looks like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title> ... </title>
  </head>
  <body>

  ...

  </body>
</html>

To extract just the text body of the document, I needed to print all text between <body> and </body>, but not the tags themselves. I created three rules in AWK to manage this:

  1. When AWK finds a line with <body>, set a variable to 1
  2. When AWK finds the closing </body> line, set the variable back to 0
  3. Only print lines when the variable has the value 1

A sample HTML document

Let’s demonstrate this by creating a sample HTML document using the pandoc program. You might use pandoc to convert a file into something else, but you can also use pandoc as a filter, which means it will process anything you pass to it on the command line. Here’s a sample that just has the body text “Hi there!”

$ echo "Hi there!" | pandoc --from markdown --to html --standalone --metadata title="Hello" > hello.html

Because this uses the --standalone option to create a complete HTML document, the output file is quite long (over 170 lines):

$ wc -l hello.html

The body section is at the end of the file, which we can preview with the tail command to show the last few lines:

$ tail hello.html
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
  </style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
</body>
</html>

Now let’s use AWK to extract just the content between <body> and </body>, but not print either tag. Using the rules in the order I’ve listed above is almost right, but it still prints the opening tag:

$ awk '/<body>/ {show=1} /<\/body>/ {show=0} show==1 {print}' hello.html
<body>
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>

By moving the show==1 {print} statement up front, the command will not print the opening tag, but will print the closing tag:

$ awk 'show==1 {print} /<body>/ {show=1} /<\/body>/ {show=0}' hello.html
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
</body>

Instead, we can strategically place the line that detects </body> up front, then the print statement, followed by the line to find the <body> opening tag. This arrangement of statements only prints the text between the opening and closing tags, but not the tags themselves:

$ awk '/<\/body>/ {show=0} show==1 {print} /<body>/ {show=1}' hello.html
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>

The command line is a power skill

You can do the same job with traditional desktop tools: it’s easy enough to use View Source in a web browser, and search the code for <body>, then use your mouse to highlight and copy all of the lines between <body> and </body>. But this can get tiresome, especially if you need to do this several times with long HTML documents.

Using desktop tools, this requires you to visually scan the document for the text you need—every time—then manually select and copy the text you want, so you can paste it into a new document. With the command line, it’s just a short AWK script to do the work for you.