PHP – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Programmers past, present and future https://blog.scraperwiki.com/2013/06/programmers-past-present-future/ https://blog.scraperwiki.com/2013/06/programmers-past-present-future/#comments Tue, 04 Jun 2013 09:20:38 +0000 http://blog.scraperwiki.com/?p=758218506 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]>
https://blog.scraperwiki.com/2013/06/programmers-past-present-future/feed/ 1 758218506
Happy New Year and Happy New York! https://blog.scraperwiki.com/2012/01/happy-new-year-and-happy-new-york/ Tue, 03 Jan 2012 20:32:42 +0000 http://blog.scraperwiki.com/?p=758216011 We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams.  We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or  information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations. https://scraperwiki.com/events/

]]>
758216011
Lots of new libraries https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/ https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/#comments Wed, 26 Oct 2011 11:53:50 +0000 http://blog.scraperwiki.com/?p=758215702

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

  • csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
  • requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
  • Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
  • pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
  • PyTidyLib, to access the old classic C library that cleans up HTML files.
  • SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
  • matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
  • Google Data (gdata) for calling various Google APIs to get data.
  • Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

  • tmail for parsing emails.
  • typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
  • Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

  • GeoIP for turning IP addresses into countries and cities.
Let us know if there are any libraries you need for scraping or data analysis!
]]>
https://blog.scraperwiki.com/2011/10/lots-of-new-libraries/feed/ 2 758215702