Hi! We've renamed ScraperWiki.
The product is now QuickCode and the company is The Sensible Code Company.


The united lobbyists of pdf

After a nice bit of newsmaking, political lobbying is in the news. 

To keep potential scandals at bay that can threaten this lucrative and important business, the lobbyists have established the self-regulatory body Association of Professional Political Consultants

But you can’t have a regulatory body without a register.  So they publish a quarterly APPC register in PDF — which means you can’t use it as a database. 

Introducing pdftoxml

This is the pdftohtml program, with the -xml setting applied, so that all the lines tend to show up like:  

<text top=”150” left=”74” width=”36” height=”9” font=”3”>some text</text>

Some code for parsing out the titles (which have font=”0”) of the companies from that register can be found at appc-register-of-entries/code

This gives the list: 

Advocate, Atherton Associates, APCO WORLDWIDE, B2L Public Affairs, BayMor Solutions, Bellenden Public Affairs, Blue Rubicon, Burson Marsteller, Butler Kelly Ltd, Cavendish Communications, Chambré PA LLP, Champollion Communications Consultancy, Cherton Enterprise, Cicero Consulting, Citigate Dewe Rogerson Public Policy, Cogitamus, College Public Policy, Communiqué, Connect Communications, DJH Associates, Edelman, EPPA UK Ltd, EUK Consulting Ltd., Euro RSCG Apex Communications, Fishburn Hedges, Fleishman-Hillard, Foresight Consulting, Four Communications Plc, Freshwater Public Affairs, (formerly Waterfront Public Affairs), Gardant Communications, Grayling Political Strategy, Greenhaus Communications, Green Issues, Hanover Communications International Ltd, Heathcroft Communications, Helen Johnson Consulting Limited, Hill & Knowlton, Illiam Costain McCade, INSIGHT PUBLIC AFFAIRS, JDS Associates, JMC Partners LLP, Lansons Public Affairs, Lexington Communications, Mandate Communications, Munro & Forster, NEW CONSENSUS COMMUNICATIONS LIMITED, Open Road, PLMR Ltd, Political Developments Limited, POLITICAL INTELLIGENCE, PoliticsDirect, Politics International, Portland Communications, Positif Politics Ltd, PPS Group, Rosemary Grogan, Stratagem (NI) Ltd., Sovereign Strategy, Tetra Strategy, Weber Shandwick Public Affairs, Whitehouse Consultancy

If only I had more time to work on this scraper…   

One Response to “The united lobbyists of pdf”

  1. yorksranter August 8, 2010 at 3:03 pm #

    The docs could do with attention – it’s not at all obvious what pdftoxml will spit out or on what basis (for example – is everything always text? text first? text with other stuff interspersed?)

We're hiring!