Technical writers use the command line too
Technical writers who know how to leverage the command line can become power users.
In these modern, enlightened times, technical writers can do all of their work from a graphical environment, using a point-and-click interface like a word processor, web browser, or other tool to create content. To some technical editors, using a command line may seem antiquated when everything is online or on the desktop. But technical writers who know how to leverage the command line can become power users.
I run Linux, and I use a very modern graphical desktop. But I sometimes find it’s faster and easier to use the command line to do things. One command line tool I rely on is the AWK scripting language. This was developed by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan (the name AWK is an abbreviation of Aho, Weinberger, and Kernighan) at Bell Labs in 1977.
A brief introduction to AWK
AWK has a very simple syntax, which makes it easy to learn. It operates on text files, and reads one line at a time. Every AWK statement is two parts: a pattern and an action. A pattern can be text that appears on a line, or it can be a comparison of two or more values. An action is one or more statements.
You can use regular expressions to match text (like
^
to match the start of a line and $
for the end of a line) but let’s keep this simple and only use
plain text. For example, this instruction looks for the text “hello” in
any line of text, and sets the variable hi
to
1:
/hello/ {hi=1}
If you have a long series of AWK pattern-action
statements, you can save them in a file and let AWK run the file as a
script. But most of the time, I just run AWK with a few
statements; that means I can provide the statements on the command line.
Let’s extend the previous example to also print any lines from a text
file only when the hi
variable has the value
1:
$ awk '/hello/ {hi=1} hi==1 {print}'
Extract text using AWK
With this brief introduction, we can use AWK to do real work on text files. Recently, I needed to extract the contents of an HTML page; specifically, I needed to save just the body section of an HTML document to a new file.
HTML uses tags to provide structure and formatting on a web page. The basic outline of a minimally valid HTML document looks like this:
<!DOCTYPE html>
<html lang="en">
<head>
<title> ... </title>
</head>
<body>
...
</body>
</html>
To extract just the text body of the document, I needed to print all
text between <body>
and </body>
,
but not the tags themselves. I created three rules in AWK to manage
this:
- When AWK finds a line with
<body>
, set a variable to 1 - When AWK finds the closing
</body>
line, set the variable back to 0 - Only print lines when the variable has the value 1
A sample HTML document
Let’s demonstrate this by creating a sample HTML document using the
pandoc
program. You might use pandoc
to
convert a file into something else, but you can also use
pandoc
as a filter, which means it will process
anything you pass to it on the command line. Here’s a sample that just
has the body text “Hi there!”
$ echo "Hi there!" | pandoc --from markdown --to html --standalone --metadata title="Hello" > hello.html
Because this uses the --standalone
option to create a
complete HTML document, the output file is quite long (over 170
lines):
$ wc -l hello.html
The body section is at the end of the file, which we can
preview with the tail
command to show the last few
lines:
$ tail hello.html
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
</style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
</body>
</html>
Now let’s use AWK to extract just the content between
<body>
and </body>
, but not print
either tag. Using the rules in the order I’ve listed above is
almost right, but it still prints the opening tag:
$ awk '/<body>/ {show=1} /<\/body>/ {show=0} show==1 {print}' hello.html
<body>
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
By moving the show==1 {print}
statement up front, the
command will not print the opening tag, but will print the closing
tag:
$ awk 'show==1 {print} /<body>/ {show=1} /<\/body>/ {show=0}' hello.html
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
</body>
Instead, we can strategically place the line that detects
</body>
up front, then the print
statement, followed by the line to find the <body>
opening tag. This arrangement of statements only prints the text between
the opening and closing tags, but not the tags themselves:
$ awk '/<\/body>/ {show=0} show==1 {print} /<body>/ {show=1}' hello.html
<header id="title-block-header">
<h1 class="title">Hello</h1>
</header>
<p>Hi there!</p>
The command line is a power skill
You can do the same job with traditional desktop tools: it’s easy
enough to use View Source in a web browser, and search the code
for <body>
, then use your mouse to highlight and copy
all of the lines between <body>
and
</body>
. But this can get tiresome, especially if you
need to do this several times with long HTML documents.
Using desktop tools, this requires you to visually scan the document for the text you need—every time—then manually select and copy the text you want, so you can paste it into a new document. With the command line, it’s just a short AWK script to do the work for you.