raspberry-pi Converting Simplified Docbook to PDF

Convert Simplified Docbook files to PDF using pandoc and LibreOffice with this Bash script.

Docbook is an XML markup language. Similar to HTML, Docbook uses semantic "tags" that tell a processor system how the text should appear. Docbook supports a variety of output formats, such as books and articles, and a variant called Simplified Docbook that can be easier to use for writing technical articles. When I write with Docbook, I'm actually using Simplified Docbook.

Processing a Docbook file requires installing several dependencies and running a multi-step process that includes combining the Docbook file using an XSLT transformation to create a formatted output file, then using a second tool to process the intermediate file into a PDF file. This can feel a little heavy if all you want to do is write an article and view it as a PDF file.

To simplify the process on my end, I wrote a short Bash script that uses pandoc to convert a Simplified Docbook file into a Word document, then runs LibreOffice to convert the Word file into a PDF. Since I already have pandoc and LibreOffice installed on my system, this felt more lightweight.

A sample file

To demonstrate the process, let's start with a short Simplified Docbook file that just contains a title and a single paragraph:

<article>
 <title>This is the title</title>
 <para>Simplified Docbook is a markup language based on XML.</para>
</article>

Let's explain what this file contains, for anyone who is new to Simplified Docbook. The <article> on the first line defines this document as an Article using Simplified Docbook. Because XML tags come in pairs, </article> ends the Article on the last line of the file. The <title> on the second line defines the title of the document, and <para> on the third line creates a simple paragraph.

Verifying the XML file

This is a straightforward document and we can quickly see that it is correct. But the best practice is to use a tool to verify that the XML is correct; we can use the xmllint command to confirm that the XML document doesn't contain any errors. Without any options, xmllint will scan the file and print any errors, such as mismatched tags; if the XML file is correct, xmllint simply prints the file.

$ xmllint article.docbook
<?xml version="1.0"?>
<article>
 <title>This is the title</title>
 <para>Simplified Docbook is a markup language based on XML.</para>
</article>

To avoid printing the file if it's deemed correct, we can use the --noout option:

$ xmllint --noout article.docbook 

The xmllint program can also use a DTD to fully validate the XML file. On my system, I have downloaded a copy of the Simplified Docbook DTD file as sdocbook.dtd. That means I can use xmllint --dtdvalid sdocbook.dtd to confirm that my Simplified Docbook file is indeed correct:

$ xmllint --noout --dtdvalid sdocbook.dtd article.docbook

If there are no errors, the xmllint program exits silently with a return value of zero.

Converting with pandoc

The pandoc program can convert to and from a long list of files, including Docbook, LibreOffice, and Microsoft Word. The command line requires you to specify the original document format and the output format, if pandoc can't guess the correct values based on the file extensions. For example, to convert a Simplified Docbook file into Microsoft Word format, type this command:

$ pandoc --from=docbook --to=docx article.docbook -o article.docx

This also specifies the output file using the -o option. Here, I've saved my output file to a DOCX file, which is the Microsoft Word file format. I could have used LibreOffice format (ODT) instead, but I happen to like the default styles in the Word file; it just looks nice to me.

Printing to PDF

To convert the Word document to a PDF file, I used a command line option with LibreOffice. The --convert-to pdf option does as it implies: it converts a file to PDF. This is the same process as if you opened the Word document in LibreOffice and immediately used the File > Export As > Export Directly as PDF menu action. The --convert-to option implies the --headless option, which does not open LibreOffice in desktop mode.

$ libreoffice --headless --convert-to pdf article.docx

LibreOffice will print a 1-line message as it converts the file.

Putting it all together

I wanted to automate all of these steps so I only need to run one command to convert a Simplified Docbook file to PDF. One way to do this is by running each command on its own, with the simple assumption that each still will run without errors. But the point of using xmllint is to catch any errors I might have made in my file, so I don't run pandoc with an invalid file. And if pandoc encounters an error, even with a correct XML file, I don't want to run LibreOffice to convert any output to PDF.

Checking for errors at each step requires using the if-then feature in the Bash shell. You should have Bash installed automatically if you run Linux or macOS, but you may need to update this script if you run another system like Windows. The if statement tests a condition, then runs a command if that condition is true. For example, to check if the previous command ran successfully (returned zero for its "error" status) you can use $? -eq 0 which uses $? to evaluate the status of the most recent command.

You can also use several if statements inside each other. And that's the feature I used to create my script: it runs xmllint to verify the XML file; if successful, the script runs pandoc to convert from Simplified Docbook to Word format. If that command exits successfully, the script then runs LibreOffice to convert the output to a PDF file.

#!/bin/bash
# wrapper to convert a list of files, from Docbook to PDF, using
# pandoc (Docbook to docx) and LibreOffice (docx to PDF)
for file in "$@" ; do
 word=${file%.docbook}.docx
 xmllint --noout --dtdvalid sdocbook.dtd "$file"
 if [ $? -eq 0 ] ; then
   pandoc --from=docbook --to=docx "$file" -o "$word"
   if [ $? -eq 0 ] ; then
     libreoffice --headless --convert-to pdf "$word"
   fi
 fi
done

This script also uses the for feature in Bash to process multiple files at the same time. This lets me quickly process one or more Simplified Docbook files to generate PDF files:

$ sdocbook article.docbook 
convert /home/jhall/Documents/docbook/short.docx as a Writer document -> /home/jhall/Documents/docbook/short.pdf using filter : writer_pdf_Export

screenshot of a PDF file

My Simplified Docbook file as a PDF