web-coding Fixing quotes after pandoc

Use this Bash script to convert curled quotes to plain text when using pandoc.

I like to write the first draft of anything in Markdown, then convert it to whatever output format I need using the pandoc command line tool. Using pandoc lets me convert between almost any document format, including Markdown, HTML, DOCX, ODT, LaTeX, asciidoc, and PDF. Most of the time, I convert from Markdown to ODT, or Markdown to HTML.

By default, pandoc converts straight quotes and apostrophes into curled quotes and curled apostrophes. This looks really nice, and is what I intend, when I convert to DOCX or ODT. But when I export a Markdown file into HTML, I prefer the straight quotes.

Not a perfect conversion

While I can add the --ascii=true option to use pandoc to convert to HTML, this isn't a perfect conversion. Here's a sample Markdown file to show what I mean:

I don't want "curled quotes" in HTML---I prefer straight quotes.

If I use pandoc to convert this from Markdown to HTML, I get the curled quotes by default:

$ pandoc --from markdown --to html file
<p>I don't want "curled quotes" in HTML—I prefer straight quotes.</p>

Using the --ascii=true option generates code points for the curled quotes and other special characters like the em dash:

$ pandoc --from markdown --to html --ascii=true file
<p>I don&#x2019;t want &#x201C;curled quotes&#x201D; in HTML&#x2014;I prefer straight quotes.</p>

The &#x entities indicate Unicode character values. These are valid for any modern font such as Noto Sans, which should include Unicode character mappings. But if the website specifies a legacy font without full UTF-8 coverage, these Unicode characters may not be defined, resulting in empty boxes instead of actual characters (these are called "tofu").

Editing the output

I wish pandoc generated standard HTML entities like &mdash; for an em dash, &rsquo; for a right single quote, and &ldquo; and &rdquo; for left and right double quotes, but these are easily translated using a separate command like sed.

On Unix-like systems, sed is the stream editor, and it is most often used to convert text that has been sent to it on the command line. Like many other Unix commands that work with text, sed uses regular expressions to find the text you want to replace, although in this case we can do the job with just regular text.

The basic sed command to change text is with the s instruction. This "swaps" text from one version to another. Every sed instruction should start with -e to mean this is an expression to evaluate. Put the text you want to find and replace between alternating slashes, like this to capitalize the h in "hello world":

$ echo hello world | sed -e 's/h/H/'
Hello world

The s instruction only acts on the first occurrence on a line of the matching text. To change every occurrence on a line, you need to add g to the end of the instruction, this makes it global for the line. Here's the same text, but capitalizing every l in "hello world":

$ echo hello world | sed -e 's/l/L/g'
heLLo worLd

Fixing the quotes

With this basic usage of sed, we can create a series of s instructions to change every instance of curled quotes and other special characters that pandoc produces when writing HTML. By experimenting, I find that these are the special characters from my documents that I want to make plain text in HTML:

  • &#x201C and &#x201D (")
  • &#x2018 and &#x2019 and &#39 (' or &apos;)
  • &#x2014 (&mdash;)
  • &#x2013 (&ndash;)

To automate these edits with sed, I created a Bash script called markdown that processes one or more Markdown files with pandoc and prints the HTML to the terminal:

!#/bin/bash

pandoc --from markdown --to html --ascii=true "$@" | sed -e 's/\&#x201C;/"/g' -e 's/\&#x201D;/"/g' -e 's/\&#x2018;/\&apos;/g' -e 's/\&#x2019;/\&apos;/g' -e 's/\&#39;/\&apos;/g' -e 's/\&#x2014;/\&mdash;/g' -e 's/\&#x2013;/\&ndash;/g'

With this script, my HTML truly uses plain ASCII characters:

$ markdown file
<p>I don&apos;t want "curled quotes" in HTML&mdash;I prefer straight quotes.</p>

If I want to save the output to an HTML file (which is more likely) I just need to remember not to use the -o option. Instead, I use the command line to redirect the output to a file:

$ markdown file > file.html

Using well-recognized HTML entities like &apos; and &mdash; means this HTML output should be recognized in any font, even legacy fonts without UTF-8 support. If you have other special characters that you use in your documents, you can use the same method to convert them.