circuit-board Counting words when there’s code

Here's how to use the command line to count words, except for source code.

I enjoy writing about open source software, and I often write articles about how people can write their own program. These are very technical articles that are often over a thousand words long.

When you’re writing a document, you might use the “word count” feature in your word processor or text editor to see how much you’ve written. But a word is not always a word. When your document also includes sample code, you might not want to include the code as words in that count.

For example, I wrote a “deep dive” article about how to use mmap on Linux to read a file into memory. I like to write the first draft of anything in Markdown, because it lets me focus on the content of what I’m writing without getting hung up on how it will appear in the final draft. The Markdown file for my mmap article was quite long at just over 300 lines and over 1,500 words. I can show the exact counts with the wc (word count) command on Linux:

$ wc mmap.md
 307 1548 9526 mmap.md

An easier way to see the statistics is to use the -w (words) or -l (lines) options, like this:

$ wc -l mmap.md
307 mmap.md

$ wc -w mmap.md
1548 mmap.md

But that article includes a lot of sample code. How much of that article is text and how much is code? I wanted to get a word count of just the text that I wrote, not including the source code samples. Here’s how I used the command line to count just the words I wrote, not the code:

The command line for technical writers

I run Linux at home. While I have a graphical desktop with all of the modern tools like Google Chrome, Visual Studio Code, LibreOffice, Oxygen, and other desktop applications, I often use a terminal window to run a few commands on the command line.

One command line tool I like to use is AWK, a simple but powerful scripting language that technical writers can use to make their work easier. AWK works on text files, and uses a simple pattern-action syntax for its commands. For example, to give a variable called yes the value 1 when AWK reads a line that has <body> on it, you can use this one-line statement:

/<body>/ {yes=1}

You don’t have to give yes an initial value; in AWK, every variable starts with a zero value.

For a short series of pattern-action statements, you might include that on the command line when you run the AWK program to read a file called file.txt, like this:

$ awk '/<body>/ {yes=1}' file.txt

Markdown documents with source code

With this basic introduction to AWK, we can write a simple command line to extract just the text. Doing so relies on the fact that I wrote my article draft in Markdown. In Markdown, you format code blocks using three backticks on a new line, also called a code fence:

```
code goes here
```

With these code fences, my article was a series of lines that alternated between text and code. Using the code fence as a delimiter, the text part of my article was in every other block, and the code samples were in the alternating blocks. I can use AWK to separate these blocks and print just the text lines of the file.

One way to do this is to increment a variable whenever AWK finds a code fence line. Since AWK gives every variable a zero value when it starts up, this means the text portions of the file will be in blocks 0, 2, 4, 6, … while the code will be in blocks 1, 3, 5, 7, and so on.

/```/ {show++}

To print the lines from the text blocks, we need to add a pattern-action statement that only takes action when the show variable is an even number. The standard way to test whether a value is odd or even is with the modulo arithmetic operator. This will return the remainder after division; the modulo of 7 divided by 2 is 1, because that’s 2 with a 1 left over. Similarly, the modulo of 8 divided by 2 is zero because that’s an even number; the answer is 4 with nothing left over, so the modulo is zero.

Using modulo, we can print any lines when the variable show is an even number by comparing the modulo (the % symbol) with zero:

show%2 == 0 {print}

Counting the words

Running AWK with these two lines, we should only get the text portion of the Markdown file. If we preview the output by using head to see the first few lines or tail to look at the last few lines of the output, we can see this AWK script actually includes one of each code fence pair. I’m not going to worry too much about these; a code fence only counts as one “word” … and after a thousand words, a few stray code fences as “words” won’t matter much:

$ awk '/```/ {show++} show%2 == 0 {print}' mmap.md | head
---
title: Reading a whole file at once
---

Most of the programs that I write are filter-like utilities: the program
starts up, processes data as it goes, then ends.
Usually, these programs don't have to store a lot of data in memory.
If I can find a way to only read data a character or a block at a time,
I'll do it that way.
This is probably because I learned C programming on DOS, and

$ awk '/```/ {show++} show%2 == 0 {print}' mmap.md | tail
Let's demonstrate this method
by writing a full program that uses `open_file()` to load
a data file into memory, then print the contents of the array:

```

Processing the same data file on my Linux system,
but using Unix line endings, generates this output:

```

With this, I can use wc one more time to count the words (-w) from just the text portion of my article:

$ awk '/```/ {show++} show%2 == 0 {print}' mmap.md | wc -w
1180

Technical writers use the command line too

While you can get the job done with traditional desktop tools, technical writers who know how to leverage the command line can become power users. This task to count words except for those in code samples is a tough one if you’re using desktop tools; I don’t know of any way to do it except by highlighting each text block, counting the words on that region, and adding them for a total word count. But with the command line, it’s just a short AWK script to do the work for you.