Earlier this year, the editors of The Programming Historian decided to move the site from a Wordpress installation to a static website hosted on GitHub Pages. This post is a brief overview of how we made the switch, using some of the same tools and computational methods featured in our lessons.
I’m going to focus on how we converted the HTML pages generated by our Wordpress site into Markdown files that were ready to be deployed on GitHub. In the process, I’ll show how it’s possible to build on a series of Programming Historian lessons to solve new problems. Be aware, however, that this post will be slightly more technical than our usual lessons; it may be most beneficial for readers who are already comfortable using command line tools like Pandoc and are contemplating a similar conversion for their Wordpress website.
Our new website uses a publishing platform called Jekyll to turn a repository of files written in Markdown into an HTML website. In the case of the Programming Historian, Jekyll uses this repository to generate this website. Lessons that look like this are converted by Jekyll into lessons that look like that.
Thanks to the power of Jekyll, generating our new website was easy once all of our lessons and pages were formatted correctly in Markdown. Our challenge was to get all of the HTML pages from the Wordpress site and convert them into Markdown that Jekyll could understand. This was a multi-stage process made easier by tools like Wget, Pandoc, and Python.
Downloading the Old Site with Wget
Our first step was to get HTML versions of all the pages and lessons on our old site. To do this, we used the wget
tool described by Ian Milligan in Automated Downloading with Wget. Applying that lesson to our particular problem, we were able to download all the original HTML pages from our Wordpress site. (Note that these original HTML pages, as well as most of the scripts described below, reside on the master branch of our GitHub repository. The live, current site resides on the default gh-pages branch.)
Preparing the Old HTML for Pandoc Conversion
For the next step—the conversion of these HTML files to Markdown—we decided to use Pandoc, a powerful tool described by Dennis Tenen and Grant Wythoff in Sustainable Authorship in Plain Text using Pandoc and Markdown.
That lesson focuses on using Pandoc to convert from Markdown into other formats, but Pandoc is also able to turn HTML to Markdown, which is what we wanted to do. It can even locate metadata in the HTML, such as the author, title, and date, and convert it into a YAML metadata block in the Markdown output that Jekyll will recognize.
But Pandoc needs some help to do this. For example, it expects to find metadata in <meta>
tags that look like this:
<meta content="Caleb McDaniel" name="author"/>
<meta content="Data Mining the Internet Archive Collection" name="title"/>
<meta content="03-03-2014" name="date"/>
<meta content="William J Turkel" name="reviewers"/>
And if you want Pandoc to put that information into a YAML block when converting to Markdown, then those tags have to be located between the <head>
and </head>
tags of the HTML document.
This deepened our challenge, however, because our original HTML files did not place metadata like author and title inside <meta>
tags. Here’s an example of how the metadata above appeared in the original HTML downloaded from our Wordpress site:
<article>
<header>
<p class="kicker">March 3, 2014</p>
<h1><a href="/lessons/data-mining-the-internet-archive">Data Mining the Internet Archive Collection</a></h1>
<p class="byline">By Caleb McDaniel</p>
<ul class="credits">
<li class="technical-reviewer">Technical Reviewer: William J Turkel</li>
So, before we could use Pandoc to convert that HTML to Markdown, we needed to modify the HTML to look more like the <meta>
examples above. And to do that, we used a Python library called Beautiful Soup, which was explained by Jeri Wieringa in her lesson, Intro to Beautiful Soup.
In that lesson, Wieringa focused on the power of Beautiful Soup to “pull particular content from a webpage” by extracting it from particular tags. In the example above, we could use a snippet of code that looks something like this to get the content of the <p class="byline">By Caleb McDaniel</p>
tag:
soup = BeautifulSoup(html)
author = soup.find(class_='byline')
(Note that this code snippet assumes we have already used the standard Python techniques for reading from a text file to get our original HTML from a file and assign it to the variable html
.)
Beautiful Soup can do more than just help us to find content in HTML, however; it can also be used to modify the HTML tree. For example, a code snippet that looks something like the below would (1) save all the content from the original <head>
tag, (2) create a new <meta>
tag that contains the content from our author
variable, but with the leading By
stripped out; and (3) insert that new <meta>
tag into the original <head>
tag:
original_head = soup.head
author_tag = soup.new_tag('meta')
author_tag.attrs['name'] = 'author'
if author:
author_tag.attrs['content'] = author.string.lstrip('By ')
original_head.append(author_tag)
We can follow a similar two-step procedure (find the metadata in the original HTML, and then move it to a tag where Pandoc will be able to recognize it as metadata) to modify the way that the lesson date, reviewers, and title are stored in the HTML document. We also can and did use Beautiful Soup to make other modifications to the HTML in preparation for Pandoc conversion, such as locating and discarding (or “decomposing,” in the language of Beautiful Soup) the comments section on old posts.
This script shows all the changes we eventually made in this way. (See the comment lines to get a sense of what each part of the script does.) After running that script on our folder of original HTML pages downloaded from our Wordpress site, we had a new folder of modified HTML files that was ready (or at least readier) to be fed into Pandoc for conversion into Markdown files.
Converting Our Modified HTML to Markdown
Pandoc can convert an HTML file to Markdown with one simple command:
pandoc -f html -t markdown data-mining-the-internet-archive.html
But in our case, the conversion was not quite so simple. Consider what happens if you run that command on one of our modified HTML files. You would get a Markdown file that begins like this:
Lesson Goals
------------
The collections of the [Internet Archive](http://archive.org/) (IA)
include many digitized sources of interest to historians, including
[early JSTOR journal content](https://archive.org/details/jstor_ejc),
[John Adams's personal
library](https://archive.org/details/johnadamsBPL), and the [Haiti
collection](https://archive.org/details/jcbhaiti) at the John Carter
Brown Library. In short, to quote Programming Historian [Ian
Milligan](http://activehistory.ca/2013/09/the-internet-archive-rocks-or-two-million-plus-free-sources-to-explore/),
"The Internet Archive rocks."
In this lesson, you'll learn how to download files from such collections
using a Python module specifically designed for the Internet Archive.
You will also learn how to use another Python module designed for
parsing MARC XML records, a widely used standard for formatting
bibliographic metadata.
For demonstration purposes, this lesson will focus on working with the
digitized version of the [Anti-Slavery
Collection](http://archive.org/details/bplscas) at the Boston Public
Library in Copley Square. We will first download a large collection of
MARC records from this collection, and then use Python to retrieve and
analyze bibliographic information about items in the collection. For
example, by the end of this lesson, you will be able to create a list of
every named place from which a letter in the antislavery collection was
written, which you could then use for a mapping project or some other
kind of analysis.
For Whom Is This Useful?
------------------------
This intermediate lesson is good for users of the Programming Historian
who have completed general lessons on downloading files and performing
text analysis on them, but would like an applied example of these
principles. It will also be of interest to historians or archivists who
work with the MARC format or the Internet Archive on a regular basis.
Before You Begin
----------------
We will be working with two Python modules that are not included in
Python's standard library.
The first,
[internetarchive](https://pypi.python.org/pypi/internetarchive),
provides programmatic access to the Internet Archive. The second,
[pymarc](https://pypi.python.org/pypi/pymarc/), makes it easier to parse
MARC records.
The easiest way to download both is to use pip, the python package
manager. Begin by installing pip using [this Programming Historian
lesson](/lessons/installing-python-modules-pip/).
Then issue these commands at the command line: To install
internetarchive:
``` {.bash}
sudo pip install internetarchive
```
To install pymarc:
``` {.bash}
sudo pip install pymarc
```
Now you are ready to go to work!
Pandoc has converted our links and codeblocks into Markdown syntax, and eliminated a lot of the unnecessary content from the beginning of our HTML file. But right away, we can notice two things that make this Markdown conversion imperfect. First, despite all the work that we did to capture the author, title, and date from the original HTML, that metadata does not appear in this output.
A less obvious problem appears in the way that Pandoc has rendered code blocks—by surrounding the code with two lines of three backticks, the first of which also contains the language in curly braces. This is what Pandoc calls a fenced code block. Its purpose is to allow keywords in the code to be highlighted with appropriate colors, as you can see in many Programming Historian lessons. But Jekyll, the engine that powers GitHub Pages, does not recognize fenced code blocks that are formatted in this way.
There are actually other, even less obvious problems with the default Markdown conversion that Pandoc produced, but I’ll focus on these two. The fixes for them illustrate Pandoc’s power and the general principles we used to improve our bulk conversion workflow.
First, let’s consider the lack of metadata in our Markdown output. Recall that we did a lot of work with Beautiful Soup to create <meta>
tags for title, reviewer, author, and date. We hoped that Pandoc would pick up that information from the modified HTML and put into our converted Markdown file, ideally as a YAML front matter block that could be recognized by Jekyll. So what happened?
The short answer is that we needed to add two more things to our Pandoc command. The first is the --standalone
option, which is described on the Pandoc User’s Guide page.
By default, Pandoc outputs “snippets”; it focuses on converting the input text into the output format. In most cases, the metadata of the original document doesn’t affect that conversion, so Pandoc simply ignores it. In the case of HTML, Pandoc’s default behavior is to convert what is between the <body>
tags into your desired output format, and simply ignore what was between the <head>
tags.
One of the things that the --standalone
option does is to override that default, and instead capture any header and metadata information so that it can be put into the output document. So we should have run this Pandoc command instead of the one above:
pandoc -f html -t markdown --standalone data-mining-the-internet-archive.html
At first glance, it still won’t look like that command made a difference. You’ll get what seems to be the same Markdown output, with no metadata.
Behind the scenes, however, Pandoc is grabbing the metadata we stored in our <meta>
tags and assigning them to Pandoc template variables based on the name
attribute of these tags: for example, author
, reviewers
, and title
. (If you really want to understand what’s going on under the hood, try running the above command, with and without the --standalone
option, but changing -t markdown
to -t native
. Even without understanding the output you see, you can compare the first lines of the native output for a standalone document with the first lines of output without the standalone option. Notice that with standalone, something that looks like our metadata appears in the native output.)
In short, --standalone
has captured the metadata we wanted and assigned it to variables. but we also need to tell Pandoc where to put that metadata in our output. To do that, we used a custom Pandoc template that looked like this:
---
title: $title$
author: $for(author)$$author$$sep$, $endfor$
date: $date$
reviewers: $reviewers$
---
$body$
You can read more about templates in the Pandoc documentation, but the important thing to note here is that we are telling Pandoc where to output our metadata variables (represented by words with dollar signs around them, like $title$
) and where to output the main body of the HTML file (represented by the $body$
variable). The words that are not wrapped in dollar signs in our template will pass literally into our output document.
We can save that template in a file called jekyll.md
and then add the option --template=jekyll.md
to our Pandoc command above, like so:
pandoc -f html -t markdown --standalone --template=jekyll.md data-mining-the-internet-archive.html
When we do, the start of our Markdown output should now look like this:
---
title: Data Mining the Internet Archive Collection
author: Caleb McDaniel
date: 03-03-2014
reviewers: William J Turkel
---
Lesson Goals
------------
The collections of the [Internet Archive](http://archive.org/) (IA)
include many digitized sources of interest to historians, including
[early JSTOR journal content](https://archive.org/details/jstor_ejc),
[John Adams's personal
library](https://archive.org/details/johnadamsBPL), and the [Haiti
collection](https://archive.org/details/jcbhaiti) at the John Carter
Brown Library. In short, to quote Programming Historian [Ian
Milligan](http://activehistory.ca/2013/09/the-internet-archive-rocks-or-two-million-plus-free-sources-to-explore/),
"The Internet Archive rocks."
In this lesson, you'll learn how to download files from such collections
using a Python module specifically designed for the Internet Archive.
You will also learn how to use another Python module designed for
parsing MARC XML records, a widely used standard for formatting
bibliographic metadata.
For demonstration purposes, this lesson will focus on working with the
digitized version of the [Anti-Slavery
Collection](http://archive.org/details/bplscas) at the Boston Public
Library in Copley Square. We will first download a large collection of
MARC records from this collection, and then use Python to retrieve and
analyze bibliographic information about items in the collection. For
example, by the end of this lesson, you will be able to create a list of
every named place from which a letter in the antislavery collection was
written, which you could then use for a mapping project or some other
kind of analysis.
For Whom Is This Useful?
------------------------
This intermediate lesson is good for users of the Programming Historian
who have completed general lessons on downloading files and performing
text analysis on them, but would like an applied example of these
principles. It will also be of interest to historians or archivists who
work with the MARC format or the Internet Archive on a regular basis.
Before You Begin
----------------
We will be working with two Python modules that are not included in
Python's standard library.
The first,
[internetarchive](https://pypi.python.org/pypi/internetarchive),
provides programmatic access to the Internet Archive. The second,
[pymarc](https://pypi.python.org/pypi/pymarc/), makes it easier to parse
MARC records.
The easiest way to download both is to use pip, the python package
manager. Begin by installing pip using [this Programming Historian
lesson](/lessons/installing-python-modules-pip/).
Then issue these commands at the command line: To install
internetarchive:
``` {.bash}
sudo pip install internetarchive
```
To install pymarc:
``` {.bash}
sudo pip install pymarc
```
Now you are ready to go to work!
Notice that our metadata is now inserted in the output as a Jekyll metadata block. Hooray!
Converting Code Block Syntax
The other problem we identified with our Markdown output—the way to mark fenced code blocks—remains to be solved, however. This problem was a bit trickier to solve, because Jekyll’s default Markdown parser does not recognize code blocks fenced with backticks at all. But after some experimentation, we discovered that we could configure Jekyll to recognize and highlight code blocks that look like this:
``` bash
sudo pip install pymarc
```
Pandoc, as we’ve seen, was wrapping bash
in curly braces and a period, like so: {.bash}
. That’s because by default, Pandoc is taking the class
attribute in this line of our HTML and then putting it in braces. If there were more than one class
attribute in that line, Pandoc would continue putting them, prefaced by a period, inside those curly braces, as described in the documentation under Extension: fenced_code_attributes
.
Fortunately, in this case we have a simpler solution than before, because Pandoc provides a command-line option for turning off this behavior. And once that behavior—or “extension”—is turned off, the documentation tells us what will happen:
If the fenced_code_attributes
extension is disabled, but input contains class attribute(s) for the codeblock, the first class attribute will be printed after the opening fence as a bare word.
In plainer English, that’s exactly what we want to produce fenced code blocks that Jekyll can, with some configuration, recognize! So, following the documentation for Pandoc’s general options, we modified our Pandoc command again to disable the fenced_code_attributes
extension, like so:
pandoc -f html -t markdown-fenced_code_attributes --standalone --template=jekyll.md data-mining-the-internet-archive.html
Run that command on the same modified HTML file we’ve been using above, and the new Markdown output should begin like this:
---
title: Data Mining the Internet Archive Collection
author: Caleb McDaniel
date: 03-03-2014
reviewers: William J Turkel
---
Lesson Goals
------------
The collections of the [Internet Archive](http://archive.org/) (IA)
include many digitized sources of interest to historians, including
[early JSTOR journal content](https://archive.org/details/jstor_ejc),
[John Adams’s personal
library](https://archive.org/details/johnadamsBPL), and the [Haiti
collection](https://archive.org/details/jcbhaiti) at the John Carter
Brown Library. In short, to quote Programming Historian [Ian
Milligan](http://activehistory.ca/2013/09/the-internet-archive-rocks-or-two-million-plus-free-sources-to-explore/),
“The Internet Archive rocks.”
In this lesson, you’ll learn how to download files from such collections
using a Python module specifically designed for the Internet Archive.
You will also learn how to use another Python module designed for
parsing MARC XML records, a widely used standard for formatting
bibliographic metadata.
For demonstration purposes, this lesson will focus on working with the
digitized version of the [Anti-Slavery
Collection](http://archive.org/details/bplscas) at the Boston Public
Library in Copley Square. We will first download a large collection of
MARC records from this collection, and then use Python to retrieve and
analyze bibliographic information about items in the collection. For
example, by the end of this lesson, you will be able to create a list of
every named place from which a letter in the antislavery collection was
written, which you could then use for a mapping project or some other
kind of analysis.
For Whom Is This Useful?
------------------------
This intermediate lesson is good for users of the Programming Historian
who have completed general lessons on downloading files and performing
text analysis on them, but would like an applied example of these
principles. It will also be of interest to historians or archivists who
work with the MARC format or the Internet Archive on a regular basis.
Before You Begin
----------------
We will be working with two Python modules that are not included in
Python’s standard library.
The first,
[internetarchive](https://pypi.python.org/pypi/internetarchive),
provides programmatic access to the Internet Archive. The second,
[pymarc](https://pypi.python.org/pypi/pymarc/), makes it easier to parse
MARC records.
The easiest way to download both is to use pip, the python package
manager. Begin by installing pip using [this Programming Historian
lesson](/lessons/installing-python-modules-pip/).
Then issue these commands at the command line: To install
internetarchive:
``` bash
sudo pip install internetarchive
```
To install pymarc:
``` bash
sudo pip install pymarc
```
Now you are ready to go to work!
Notice the difference in the way the two bash
code blocks at the end of that snippet are now formatted. We did it!
Completing the Bulk Conversion from Wordpress
In the above sections, I’ve scratched the surface of what was, even with the help of Pandoc, a big job. In the examples above, we’ve looked at only one brief section of one lesson, and other lessons presented us with new challenges to solve. Making sure that all of our converted Markdown got close to the Markdown that Jekyll required took lots of trial and error of the sort I’ve described. Other Pandoc options had to be enabled or disabled to get our Markdown to look just right, as you can see from the final script we used to process all of our modified HTML files with Pandoc.
We even used this Python filter to convert some of the URL paths in our original Markdown to relative paths, which were more suitable for our static website. (Pandoc filters, one of the most powerful ways of extending Pandoc, are described in more detail here. The problem that they solved in this case could also have been solved with further Beautiful Soup modifications to our original HTML. But the advantage of the filter is that it can be used in the future if we want to change relative links again in all our Markdown lessons.)
After all that, our bulk conversion process still did not produce pristine Markdown at a single stroke. Some of the markup in the original Wordpress site was irretrievably lost by the conversion process, while other errors in the Markdown had to be corrected manually by our editors. One lesson we learned from the whole process is that there may be no such thing as a totally automated workflow for converting from a Wordpress site like ours to a Jekyll static site.
Nonetheless, software like Pandoc and techniques like those taught in our lessons on Wget and Beautiful Soup made a difficult job somewhat easier. The results are now available for you to enjoy on our GitHub Pages static site. Happy hacking!
About the author
Caleb McDaniel is an associate professor of history at Rice University.