Digging into tagged PDFs

PDF provides an excellent way to capture print-like output for online viewing. Tagged PDF is the format that specifies marked content.

January 26, 2024

HTML was originally designed as a file format that merely contains the logical structure of a document. End users could format it in a way that was most suitable for them. For example people with reading disabilities could make the text bigger or even use a screen reader. As time went on web site developers wanted pixel perfect control over the layout on end user machines whether this made sense or not. This led to inventing a side channel to control layout. Since HTML was not originally designed for visual design, this caused an impedance mismatch which meant a lot of work and head scratching to make it work. There is no "proper" solution so problems persist to this day.

PDF was originally designed as a file format for pixel perfect layout of graphics on every conceivable device. In this way people could be sure that their design was not randomly mangled along the way. As time went on people wanted to make PDF documents more broadly usable, for example to be able to copy/paste text out of them and to expose the logical structure of the document to end users to the benefit of e.g. people with disabilities. This led to inventing a side channel to describe structure but since PDF was not originally designed for semantic content, this caused an impedance mismatch which meant a lot of work and head scratching to make it work. There is no "proper" solution so problems persist to this day.

Both of these formats also use JavaScript, but let's not go there.

In the case of PDF, the logical format is called "tagged PDF" and is implemented by writing magic tags and commands in the PDF stream to specify "marked content". A PDF generator also has to write many different dictionaries and arrays all of which have criss-cross-references to each other to make it work. Or that's my theory at least, since I have been unable to prove that CapyPDF's tagged PDF generation actually works. At best, what can be said is that no PDF processor I have used it with has reported errors.

Going through these less traversed parts of the file format teaches you quite quickly that the PDF specification varies wildly in quality. For example let's look at the aforementioned references. PDF has both dictionaries and arrays as native data types. Thus if you have to map arbitrary text keys to values you'd use a dictionary whereas mapping consecutive integers from zero upwards you'd use an array. Seems simple, right?

One of the data structures needed for tagged PDF has gone beyond and reinvented this fairly simple concept. It has keys counting from zero upwards. Not only does the specification say that consecutive integers are needed, it even says that the PDF producer must write to a separate dictionary a key called ParentTreeNextKey. It goes on to say that when a new entry is added to this array (nee dictionary) then it must use the given key for the value. A more logical name for this would be ArraySize but that is not even the worst of it.

This array is actually a key-value store where every other entry is the key (or "index") as an integer and every other entry is the corresponding value. Yes, this means that the contents of the array look like [ 0 "value0" 1 "value1" 2 "value2" … ]. The actual values happen to also be index arrays, but they contain only values. In case you don't believe me, here is a screenshot from the official PDF specification:

screenshot from PDF spec — The PDF specification says the index must be an element in the first array

Presumably the rationale is that you could leave some elements out of the array. A simpler approach would have been to store an empty array as a placeholder value, but one should not meddle with the affairs of adobeans, for they are subtle and quick to anger.

Fortunately, declaring a PDF as "tagged" is simple. There is a specific key in one of the metadata dictionaries, and when that is set to true, the file is considered tagged. The pdfinfo program agrees with this assessment:

screenshot from pdfinfo output — Yes, pdfinfo says this is a tagged PDF

Good, good. Just to be sure, let's verify that it behaves the same on Acrobat Reader:

screenshot from Acrobat Reader — Why isn't this recognized as a tagged PDF?

I wonder if I still have some leftover glögi?

This article is adapted from Tagged PDF funsies. Republished with permission from Jussi Pakkanen.

Jussi Pakkanen has worked on almost all layers of the Linux software stack starting from the kernel to system development, packaging, developer tools, UI development and web work. Jussi is also the creator of Meson, an open source build system.