Hi! We've renamed ScraperWiki.
The product is now QuickCode and the company is The Sensible Code Company.


Case study: Enrique Cocero getting political data from PDFs

cocero2Political strategy is international now.

Enrique Cocero works from Madrid for his consultancy 7-50 Electoral Math, using data to understand voters and candidates in election campaigns across the world.

He’s struggled with PDFs for a long time, and recently found PDF Tables via a Google search. He says:

I used to have nightmares – I’m sleeping better now!

Enrique got into political analysis while living in Boston in the US, volunteering on the Warren vs. Brown senate election, which was the most expensive contest in senate history.

Unfortunately, particularly in the US, lots of the raw data about politics comes on digital paper. For example, this PDF has 25 pages of detailed Missouri primary election results.

Missouri Primary results

Enrique uses the data to build models in the stats software R. He needs structured tables first to load into R. He uses PDF Tables to convert the PDF files into Excel. The output looks like this:

MIssouri Primary in Spreadsheet

Enrique tried various other conversion tools, such as from Adobe, but found the quality wasn’t high enough. In particular, cells were merged between columns, and data was misplaced. He had to spend too much time cleaning up the output, but often even that wouldn’t work.

Enrique’s models calculate, precinct by precinct, which voters to target. There are some people who will vote for you whatever happens, and others who never will. Where are the voters in between, along the chaotic edge? You need to learn as much as you can about the motives of those voters.

Because there is lots of open data, and a culture of data analysis in campaigns, the US is very active for 7-50 Electoral Math. Being in Spain, Enrique works a lot there too. There are less PDFs in Spain and more traditional web scraping, such as these Catalonia Parliamentary election results.

Catalunya Parliament results

Enrique often has to go to separate sites for each region, and then into separate pages for each year.

The system in Spain is very secretive. There’s not as much detailed data available, so instead Enrique has to reach conclusions by approximations, and make projections from the data there is.

Israel, in contrast, is a “paradise for elections” according to Enrique. With 120 seats in the Knesset, politicians have constantly shifting alliances. They jump jobs a lot, making it a fun place to analyse.

PDF Tables has lots of customers getting political data from PDFs. One day, the world will work out a popular data interchange method. For now we’re glad Enrique is at least sleeping a bit better!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
We're hiring!