ruby – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Programmers past, present and future Tue, 04 Jun 2013 09:20:38 +0000 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]> 1 758218506
Happy New Year and Happy New York! Tue, 03 Jan 2012 20:32:42 +0000 We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams.  We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or  information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations.

Job advert: Lead programmer Thu, 27 Oct 2011 11:04:47 +0000 Oil wells, marathon results, planning applications…

ScraperWiki is a Silicon Valley style startup, in the North West of England, in Liverpool. We’re changing the world of open data, and how data science is done together on the Internet.

We’re looking for a programmer who’d like to:

  • Revolutionise the tools for sharing data, and code that works with data, on the Internet.
  • Take a lead in a lean startup, having good hunches on how to improve things, but not minding when A/B testing means axing weeks of code.

In terms of skills:

  • Be polyglot enough to be able to learn Python, and do other languages (Ruby, Javascript…) where necessary.
  • Be able to find one end of a clean web API and a test suite from another.
  • We’re a small team, so need to be able to do some DevOps on Linux servers.
  • Desirable – able to make igloos.

About ScraperWiki:

  • We’ve got funding (Knight News Challenge winners) and are in the brand new field of “data hubs”.
  • We’re before product/market fit, so it’ll be exciting, and you can end up a key, senior person in a growing company.

Some practical things:

We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.

Must be willing to either relocate to Liverpool, or able to work from home and travel to our office here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

  • A link to a previous project that you’ve worked on that you’re proud of, or a description of it if it isn’t publicly visible.
  • A link to a scraper or view you’ve made on ScraperWiki, involving a dataset that you find interesting for some reason.
  • Any questions you have about the job.

Along to with the word swjob2 in the subject (and yes, that means no agencies, unless the candidates do that themselves)

Pool temperatures, company registrations, dairy prices…

]]> 2 758215716
Lots of new libraries Wed, 26 Oct 2011 11:53:50 +0000

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

  • csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
  • requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
  • Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
  • pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
  • PyTidyLib, to access the old classic C library that cleans up HTML files.
  • SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
  • matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
  • Google Data (gdata) for calling various Google APIs to get data.
  • Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

  • tmail for parsing emails.
  • typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
  • Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

  • GeoIP for turning IP addresses into countries and cities.
Let us know if there are any libraries you need for scraping or data analysis!
]]> 2 758215702
Ruby screen scraping tutorials Fri, 28 Jan 2011 16:44:24 +0000 Mark Chapman has been busy translating our Python web scraping tutorials into Ruby.

They now cover three tutorials on how to write basic screen scrapers, plus extra ones on using .ASPX pages, Excel files and CSV files.

We’ve also installed some extra Ruby modules – spreadsheet and FastCSV – to make them possible.

These Ruby scraping tutorials are made using ScraperWiki, so you can of course do them from your browser without installing anything.

Thanks Mark!

]]> 1 758214220
ScraperWiki adds Ruby as its third language Tue, 07 Sep 2010 17:13:30 +0000 We’re very pleased to announce that the third official ScraperWiki language is Ruby!

Ruby was a much requested enhancement to ScraperWiki and complements the two existing languages Python and PHP. We hope that its inclusion will encourage a new community of developers to start writing screen scrapers.

For more information about ScraperWiki see our FAQ.

The Ruby logo is copyright © 2006, Yukihiro Matsumoto. It is released under the terms of the Creative Commons Attribution-ShareAlike 2.5 License.