Fixing quotes after pandoc
Use this Bash script to convert curled quotes to plain text when using pandoc.
I like to write the first draft of anything in Markdown, then convert it to whatever output format I need using the pandoc command line tool. Using pandoc lets me convert between almost any document format, including Markdown, HTML, DOCX, ODT, LaTeX, asciidoc, and PDF. Most of the time, I convert from Markdown to ODT, or Markdown to HTML.
By default, pandoc converts straight quotes and apostrophes into curled quotes and curled apostrophes. This looks really nice, and is what I intend, when I convert to DOCX or ODT. But when I export a Markdown file into HTML, I prefer the straight quotes.
Not a perfect conversion
While I can add the --ascii=true
option to use
pandoc to convert to
HTML, this isn't a perfect conversion. Here's a sample Markdown file
to show what I mean:
I don't want "curled quotes" in HTML---I prefer straight quotes.
If I use pandoc to convert this from Markdown to HTML, I get the curled quotes by default:
$ pandoc --from markdown --to html file
<p>I don't want "curled quotes" in HTML—I prefer straight quotes.</p>
Using the --ascii=true
option generates code points for
the curled quotes and other special characters like the em dash:
$ pandoc --from markdown --to html --ascii=true file
<p>I don’t want “curled quotes” in HTML—I prefer straight quotes.</p>
The &#x
entities indicate Unicode character values.
These are valid for any modern font such as Noto Sans,
which should include Unicode character mappings. But if the website
specifies a legacy font without full UTF-8 coverage, these Unicode
characters may not be defined, resulting in empty boxes instead of
actual characters (these are called "tofu").
Editing the output
I wish pandoc generated standard HTML entities like
—
for an em dash, ’
for a
right single quote, and “
and
”
for left and right double quotes, but these are
easily translated using a separate command like
sed.
On Unix-like systems, sed is the stream editor, and it is most often used to convert text that has been sent to it on the command line. Like many other Unix commands that work with text, sed uses regular expressions to find the text you want to replace, although in this case we can do the job with just regular text.
The basic sed command to change text is with the
s
instruction. This "swaps" text from one version to
another. Every sed instruction should start with
-e
to mean this is an expression to evaluate. Put the text
you want to find and replace between alternating slashes, like this to
capitalize the h
in "hello world":
$ echo hello world | sed -e 's/h/H/'
Hello world
The s
instruction only acts on the first occurrence
on a line of the matching text. To change every occurrence on a
line, you need to add g
to the end of the instruction,
this makes it global for the line. Here's the same
text, but capitalizing every l
in "hello world":
$ echo hello world | sed -e 's/l/L/g'
heLLo worLd
Fixing the quotes
With this basic usage of sed, we can create a series
of s
instructions to change every instance of curled quotes
and other special characters that pandoc produces when
writing HTML. By experimenting, I find that these are the special
characters from my documents that I want to make plain text in HTML:
- “ and ” (
"
) - ‘ and ’ and ' (
'
or'
) - — (
—
) - – (
–
)
To automate these edits with sed, I created a Bash
script called markdown
that processes one or more Markdown
files with pandoc and prints the HTML to the
terminal:
!#/bin/bash
pandoc --from markdown --to html --ascii=true "$@" | sed -e 's/\“/"/g' -e 's/\”/"/g' -e 's/\‘/\'/g' -e 's/\’/\'/g' -e 's/\'/\'/g' -e 's/\—/\—/g' -e 's/\–/\–/g'
With this script, my HTML truly uses plain ASCII characters:
$ markdown file
<p>I don't want "curled quotes" in HTML—I prefer straight quotes.</p>
If I want to save the output to an HTML file (which is more likely) I
just need to remember not to use the -o
option.
Instead, I use the command line to redirect the output to a file:
$ markdown file > file.html
Using well-recognized HTML entities like '
and
—
means this HTML output should be recognized in
any font, even legacy fonts without UTF-8 support. If you have other
special characters that you use in your documents, you can use the same
method to convert them.