programming – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 End User Programming at the Office for National Statistics https://blog.scraperwiki.com/2015/06/end-user-programming-at-the-office-for-national-statistics/ Wed, 10 Jun 2015 16:32:00 +0000 https://blog.scraperwiki.com/?p=758222753 The Office for National Statistics (ONS) approached us regarding a task which involves transforming data in a spreadsheet. Basically, unpivotting it.

Data transformation is quite a general problem, but one with recurring patterns. Marginal variables are usually, well, somewhere in the margin. Cells generally refer to an observation or the name or value of a marginal variable. But there is enough variation that we cannot hope to capture all the possibilities in a GUI tool. Enter the formal language.

EOT team

ONS and ScraperWiki working on DataBaker

DataBaker, introduced by Dragon’s earlier article is essentially a formal language for describing particular ways of transforming data. DataBaker is essentially a dialect of Python, in that it is Python, but specialised for describing spatial relationships within spreadsheets (SQLAlchemy and numpy are more famous examples that can also be considered as dialects of Python).

It might seem unusual to invent a formal language for this task, but we have read our Nardi’s “A Small Matter of Programming” and are encouraged by quotes such as “ordinary people unproblematically learn and use formal languages and notations in everyday life”.

Earlier this week I interviewed Darren, lead for the ONS team that approached ScraperWiki. This team had essentially no previous programming experience, and are now successfully using DataBaker in their work. They are not professional programmers using a general purpose programming language, they are domain specialists using an end user programming language.

We chose Python because of its its clarity and its proven ability to be learned quickly by relative newcomers (for example, Python is a cornerstone in Software Carpentry’s bootcamp to help scientists learn to code). Darren’s team have no interest in learning Python per se, only in using DataBaker to do their job. It’s testimony to our success that they never have to think “I’m programming in Python”.

We are sneaking programming in by the backdoor, and this works because staff at ONS are already familiar with the domain of spreadsheets, and this makes it easier for them to understand the core concepts behind DataBaker. As Nardi says “people are likely to be better at learning and using computer languages that closely match their interests and their domain knowledge”.

Another part of the success of this project was that the ONS team had what Nardi refers to as a local developer. These are “domain experts who happen to have an intrinsic interest in computers and have more advanced knowledge of a particular program” (Nardi, again). Their local developer is the team’s go-to person for programming problems, and writes scripts, helps curate knowledge, and trains the team peer-to-peer.

Small Matter of ProgrammingA programming language provides the ultimate flexibility, but should only be used as a solution with care and whilst being attentive to the situation and expertise of the end users. The task that’s Darren’s team use DataBaker has no alternative solution: without DataBaker, the task wouldn’t be done. End User Programming for the win!

Footnote

The Nardi quotes are from Bonnie Nardi’s most excellent and sadly little known book: “A Small Matter of Programming”.

]]>
758222753
Programmers past, present and future https://blog.scraperwiki.com/2013/06/programmers-past-present-future/ https://blog.scraperwiki.com/2013/06/programmers-past-present-future/#comments Tue, 04 Jun 2013 09:20:38 +0000 http://blog.scraperwiki.com/?p=758218506 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]>
https://blog.scraperwiki.com/2013/06/programmers-past-present-future/feed/ 1 758218506
Lots of new libraries https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/ https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/#comments Wed, 26 Oct 2011 11:53:50 +0000 http://blog.scraperwiki.com/?p=758215702

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

  • csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
  • requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
  • Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
  • pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
  • PyTidyLib, to access the old classic C library that cleans up HTML files.
  • SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
  • matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
  • Google Data (gdata) for calling various Google APIs to get data.
  • Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

  • tmail for parsing emails.
  • typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
  • Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

  • GeoIP for turning IP addresses into countries and cities.
Let us know if there are any libraries you need for scraping or data analysis!
]]>
https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/feed/ 2 758215702