html-code XML as a document markup language

Due to its extensible nature, you can define your own XML tags to create a custom document markup language.

XML is a widely popular file type that can contain all kinds of data. As its "Extensible Markup Language" name implies, XML can be extended to support all kinds of data. The key in using XML is that the system creating the XML file and the system reading the XML file must both understand the rules for how the data is arranged using tags.

An XML file is really just a 1-line file type declaration and a parent data block. The data block may contain other data elements or data blocks within it. What each block is called and how they are used is up to you. For example, a simple XML example might have an empty XML data block called <data>, like this:

<?xml version="1.0" encoding="UTF-8"?>
<data>
</data>

XML as document markup

XML doesn't have to be used only for transmitting data. You can also define an XML file that contains document data. In this way, XML becomes a document markup language.

To demonstrate how XML might be used for document markup, let's imagine our own document markup system. In this example, we might use a generic tag like <document> as our parent data block. Everything else in the document, from metadata to actual content, resides inside this parent data block.

In our imaginary markup language, we might collect all document metadata in a single data block called <metainfo>. This block could capture any data we need to describe the document, such as the title, author, summary, publication date, and so on. For our demonstration, we'll define only three tags: <title> for the title of the document, <author> for the author, and <date> for the publication date.

We need a separate data block for the body of our document. Let's call that the <content> section. To keep our example a simple one, we'll only allow <paragraph> for body paragraphs. And within a paragraph, we might only support two basic kinds of text formatting: bold and italics text, using <bold> and <italics>.

This XML markup language uses rather verbose tags. This might be cumbersome to write a larger document such as a book or even an article, but it suits our purposes to demonstrate the made-up markup language. The meaning behind each tag is clear.

It might be easiest to demonstrate this new markup language by drafting a new document in it. Here is a brief sample document, with text borrowed from an earlier article about dot matrix printing from the 1980s:

<?xml version="1.0" encoding="UTF-8"?>
<document>

<metainfo>
<title>About dot matrix printing</title>
<author>Jim Hall</author>
<date>October 2, 2023</date>
</metainfo>

<content>

<paragraph>The <italics>dot matrix</italics> or "impact" printer used a head
that passed left and right over the paper, and had a series of vertical pins
that would strike the paper at the right positions as the head moved to form
an array of dots. By printing an arrangement of dots next to each other, the
dot matrix printer could produce letters.</paragraph>

<paragraph>Epson introduced the popular <bold>MX-80 F/T</bold> series printer
in 1980, and the <bold>FX-80</bold> series printer a few years later. Both
were <italics>dot matrix</italics> printers.</paragraph>

<paragraph>While other dot matrix printers preceded these to the market, the
Epson series of dot matrix printers were omnipresent. Throughout the 1980s, it
seemed all DOS software assumed an Epson dot matrix printer as a default. The
Epson printers were everywhere: in schools, in homes, and in
offices.</paragraph>

<paragraph>I grew up in the 1980s, and we had the Epson FX-80 at home. The
FX-80 was a reliable yet affordable printer that made it a good fit for most
casual print jobs.</paragraph>

</content>
</document>

Transforming the XML document

With XML as a markup language, you can easily translate or transform a document from XML into some other format. A transformation engine or program reads the XML file and applies a set of rules to generate a new output file.

Output to HTML

One possible translation would convert the XML markup language into HTML: this is a trivial operation for humans to do, because the document markup is not very complex. The HTML output might look like this:

<!DOCTYPE html>
<html lang="en">

<head>
<title>About dot matrix printing</title>
</head>

<body>

<header>
<h1>About dot matrix printing</h1>
<p>Jim Hall</p>
<p>October 2, 2023</p>
</header>

<main>

<p>The <i>dot matrix</i> or "impact" printer used a head that passed left and
right over the paper, and had a series of vertical pins that would strike the
paper at the right positions as the head moved to form an array of dots. By
printing an arrangement of dots next to each other, the dot matrix printer
could produce letters.</p>

<p>Epson introduced the popular <b>MX-80 F/T</b> series printer in 1980, and
the <b>FX-80</b> series printer a few years later. Both were <i>dot matrix</i>
printers.</p>

<p>While other dot matrix printers preceded these to the market, the Epson
series of dot matrix printers were omnipresent. Throughout the 1980s, it
seemed all DOS software assumed an Epson dot matrix printer as a default. The
Epson printers were everywhere: in schools, in homes, and in offices.</p>

<p>I grew up in the 1980s, and we had the Epson FX-80 at home. The FX-80 was a
reliable yet affordable printer that made it a good fit for most casual print
jobs.</p>

</main>
</body>
</html>

As viewed in a web browser, the output looks like this:

HTML output

Sample output from an HTML transformation

Output to LaTeX

Another transformation might be to translate the XML markup document into LaTeX, so the document can be processed using that system. The transformation process would convert the XML tags into the appropriate LaTeX markup, with blank lines between paragraphs. For example, the output might look like this:

\documentclass{article}
\begin{document}

\title{About dot matrix printing}
\author{Jim Hall}
\date{October 2, 2023}
\maketitle

The \textit{dot matrix} or "impact" printer used a head that passed left
and right over the paper, and had a series of vertical pins that would
strike the paper at the right positions as the head moved to form an
array of dots. By printing an arrangement of dots next to each other,
the dot matrix printer could produce letters.

Epson introduced the popular \textbf{MX-80 F/T} series printer in 1980,
and the \textbf{FX-80} series printer a few years later. Both were
\textit{dot matrix} printers.

While other dot matrix printers preceded these to the market, the Epson
series of dot matrix printers were omnipresent. Throughout the 1980s,
it seemed all DOS software assumed an Epson dot matrix printer as a
default. The Epson printers were everywhere: in schools, in homes,
and in offices.

I grew up in the 1980s, and we had the Epson FX-80 at home. The FX-80
was a reliable yet affordable printer that made it a good fit for most
casual print jobs.

\end{document}

Using LaTeX to process the new file results in this finished document:

LaTeX output

Sample output from a LaTeX transformation

Direct transformation

Translation from "XML as document markup" isn't the only way to process an XML document. It's also possible to construct a separate markup processor to transform the XML input, and apply formatting rules to generate a custom output file. This type of direct transformation is likely rare in 2023 (do we need another document processing system?) but is theoretically possible. For example, one such theoretical processing system might transform the XML document into a PostScript file, ready for printing:

PostScript output

Sample output from a direct transformation

XML means extensible markup

When using XML as document markup, the transformation process might require an intermediate output type, such as the translation from XML into LaTeX, which requires the extra step of processing the LaTeX file into a suitable output type. Or a custom document processing system might read the XML input and directly generate an output file, such as the last example that transformed the document into a PostScript file.

XML is more than just a data container. You can use XML to describe many different kinds of documents. And due to its extensible nature, you can define your own tags to create a custom document markup language, which you can translate or transform into different output types.