How markup works: collecting words
Write a sample program to explore how markup systems collect words to fill paragraphs.
Markup languages all share one common rule: they collect words and fill paragraphs, and rely on markup to control the formatting. The specifics of the markup and formatting differ based on the system being used (HTML uses tags, LaTeX and TeX use backslash commands, and groff and nroff use dot commands) but at their core, these systems all collect words and fill paragraphs.
Let’s explore one method to collect words in a document.
Working with limited memory
Today’s computers have gigabytes of memory; my computer at home has 32GB of installed RAM. But when authors first started using computers to write documents, computers counted memory in kilobytes. To put that in perspective: 1,024 bytes is a kilobyte, 1,024 kilobytes is a megabyte, and 1,024 megabytes is a gigabyte. That means a modern computer with 32GB has more than a thousand thousand (1,048,576) times more memory than an older computer with 32kB—although the first computers used in document preparation had even less memory than that.
Processing documents in this small amount of memory presents certain limitations on how to read the input file. Where a modern application might load the entire file into memory before processing it, original markup systems needed to preserve memory by reading the input one character at a time to collect one word at a time.
Writing a program to collect words requires only a little memory. We need enough to store the current letter being read from the input, and a full word being collected, plus another variable to hold the length of the word.
Start with the basics
Let’s start by looking at the most basic prototype: a function to read letters from the input and fill a “word” variable. Let’s call this variable buffer
since it’s just a place to store data until we print it. When the buffer is full, we can print it to the output:
void fill_buffer(FILE *in, FILE *out)
{
int letter;
char buffer[BUFSIZE];
int buflen = 0;
while ((letter = fgetc(in)) != EOF) {
buffer[buflen++] = letter;
if (buflen == BUFSIZE) {
put_string(buffer, buflen, out);
buflen = 0;
}
}
put_string(buffer, buflen, out);
}
This is a good starting point because it demonstrates the basic behavior of how a markup system reads input one letter at a time to collect words. In this sample, we use only three variables: letter
to store a single letter at a time, buffer
which saves a “string” of letters as a “buffer” with length BUFSIZE
(defined elsewhere in the program), and buflen
to track how many letters we’ve saved in the buffer.
The function is just one big while
loop that reads one letter at a time using the fgetc
function:
while ((letter = fgetc(in)) != EOF) {
...
}
The while
line may look difficult, but it’s easier to understand once you understand that the syntax for the while
loop is:
while (condition) {
instructions ..
}
In this case, the “condition” contains another statement:
(letter = fgetc(in)) != EOF
This does several things at once: letter = fgetc(in)
means the program first uses fgetc
to read a single letter from the input, and saves it in letter
. The fgetc
function returns an “end of file” value when it reaches the end of the input, so the != EOF
part of the condition tests if we have found the end of the file. As a result, the while
loop only executes while there is data to read.
Inside the loop, the program stores data into the buffer, and increments the buffer length with buflen++
. When the buffer’s length equals the buffer’s size, the program calls the put_string
function to print the buffer’s contents. Then it resets the buffer’s length to zero, ready to read more data.
At the end, the program prints any extra data that might have been saved in the buffer since the last time it was printed.
This is just one part of a larger program to read data into a buffer one letter at a time and print it out. The full program also needs a main
function to process the files on the command line, plus a put_string
function to print data:
#include <stdio.h>
#include <ctype.h> /* isspace */
#define BUFSIZE 20
void put_string(const char *str, int len, FILE *out)
{
/* an inefficient fputs */
for (int i = 0; i < len; i++) {
fputc(str[i], out);
}
fputc('\n', out);
}
void fill_buffer(FILE *in, FILE *out)
{
int letter;
char buffer[BUFSIZE];
int buflen = 0;
while ((letter = fgetc(in)) != EOF) {
buffer[buflen++] = letter;
if (buflen == BUFSIZE) {
put_string(buffer, buflen, out);
buflen = 0;
}
}
put_string(buffer, buflen, out);
}
int main(int argc, char **argv)
{
FILE *in;
for (int i = 1; i < argc; i++) {
in = fopen(argv[i], "r");
if (in != NULL) {
fill_buffer(in, stdout);
fclose(in);
}
}
if (argc == 1) {
fill_buffer(stdin, stdout);
}
return 0;
}
This sample program sets the buffer size to 20 letters, which might be a bit small for practical use to collect words but is suitable to see the process in action. Save this source code in a file named buffer.c
and compile it, such as with the GNU C Compiler on Linux:
$ gcc -o buffer buffer.c
Processing a single-line “lorem ipsum” file with 124 letters generates 7 lines of output. Except for the last line, the lines are 20 characters long because that is the size of the buffer:
$ ./buffer lorem.txt
Lorem ipsum dolor si
t amet, consectetur
adipiscing elit, sed
do eiusmod tempor i
ncididunt ut labore
et dolore magna aliq
ua.
The spaces at the end of each line are easier to see if we use the tr
command to replace each space with an underscore character:
$ ./buffer lorem.txt | tr ' ' '_'
Lorem_ipsum_dolor_si
t_amet,_consectetur_
adipiscing_elit,_sed
_do_eiusmod_tempor_i
ncididunt_ut_labore_
et_dolore_magna_aliq
ua.
Collecting words
We can update this primitive function to read letters from the input and collect words. Instead of a fill_buffer
function, we write a find_words
function. The internals of this function remain essentially the same: read the file one letter at a time, and save them into a buffer variable called word
. The key difference is that the function should only store words. Any kind of whitespace, such as a space or tab, marks the beginning and end of a word.
void find_words(FILE *in, FILE *out)
{
int letter;
char word[WORDSIZE];
int wordlen = 0;
while ((letter = fgetc(in)) != EOF) {
if (isspace(letter)) { /* whitespace */
if (wordlen > 0) { /* found a space after the word */
put_string(word, wordlen, out);
wordlen = 0;
}
}
else { /* letter */
word[wordlen++] = letter;
if (wordlen == WORDSIZE) { /* avoid overflow */
put_string(word, wordlen, out);
wordlen = 0;
}
}
}
if (wordlen > 0) {
put_string(word, wordlen, out);
}
}
I’ve included a few comments to help explain how the function works at each step. This version of the function tests each letter with the isspace
function, which returns a true value if the character is some kind of whitespace. When the function finds a whitespace character, it uses wordlen
to determine if there is data in the word
variable; if so, it calls the put_string
function to print it. Otherwise, the function continues to save letters into the word
variable, only printing its contents if the saved data has filled the buffer.
At the end, the program prints any extra data that might have been saved in the word
buffer since the last time it was printed.
This is the only part of the program that needs to change in order to collect words from the input. The main
function and put_string
function remain the same:
#include <stdio.h>
#include <ctype.h> /* isspace */
#define WORDSIZE 20
void put_string(const char *str, int len, FILE *out)
{
/* an inefficient fputs */
for (int i = 0; i < len; i++) {
fputc(str[i], out);
}
fputc('\n', out);
}
void find_words(FILE *in, FILE *out)
{
int letter;
char word[WORDSIZE];
int wordlen = 0;
while ((letter = fgetc(in)) != EOF) {
if (isspace(letter)) { /* whitespace */
if (wordlen > 0) { /* found a space after the word */
put_string(word, wordlen, out);
wordlen = 0;
}
}
else { /* letter */
word[wordlen++] = letter;
if (wordlen == WORDSIZE) { /* avoid overflow */
put_string(word, wordlen, out);
wordlen = 0;
}
}
}
if (wordlen > 0) {
put_string(word, wordlen, out);
}
}
int main(int argc, char **argv)
{
FILE *in;
for (int i = 1; i < argc; i++) {
in = fopen(argv[i], "r");
if (in != NULL) {
find_words(in, stdout);
fclose(in);
}
}
if (argc == 1) {
find_words(stdin, stdout);
}
return 0;
}
Save this new program as words.c
and compile it with your system’s C compiler:
$ gcc -o words words.c
Processing the same single-line “lorem ipsum” file generates 19 lines of output. In this case, each line is exactly one word:
$ ./words lorem.txt
Lorem
ipsum
dolor
sit
amet,
consectetur
adipiscing
elit,
sed
do
eiusmod
tempor
incididunt
ut
labore
et
dolore
magna
aliqua.
The word count is easier to see if we use the cat
command with the -n
(number lines) option:
$ ./words lorem.txt | cat -n
1 Lorem
2 ipsum
3 dolor
4 sit
5 amet,
6 consectetur
7 adipiscing
8 elit,
9 sed
10 do
11 eiusmod
12 tempor
13 incididunt
14 ut
15 labore
16 et
17 dolore
18 magna
19 aliqua.
Collect words and fill paragraphs
This demonstration is only the first step in how a markup system collect words and fill paragraphs, but it is simple enough to show the process in action. To take this to the next level, you might replace the put_string
function with a function like add_word
to add the current word to a separate variable that stores a line of text before it is printed. After that, adding other functionality such as recognizing markup can support text formatting.