Developer – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 ‘Documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing’ https://blog.scraperwiki.com/2011/05/documentation-is-like-sex-when-it-is-good-it-is-very-very-good-and-when-it-is-bad-it-is-better-than-nothing/ https://blog.scraperwiki.com/2011/05/documentation-is-like-sex-when-it-is-good-it-is-very-very-good-and-when-it-is-bad-it-is-better-than-nothing/#comments Wed, 25 May 2011 01:02:42 +0000 http://blog.scraperwiki.com/?p=758214818 You may have noticed that the design of the ScraperWiki site has changed substantially.

As part of that, we made a few improvements to the documentation. Lots of you told us we had to make our documentation easier to find, more reliable and complete.

We’ve reorganised it all under one contents page, called Documentation throughout the site, including within the code editor. All the documentation is listed there. (The layout shamelessly inspired by Django).

Of course, everyone likes different kinds of documentation – talk to a teacher and they’ll tell you all about different learning styles. Here’s what we have on offer, all available in Ruby, Python and PHP (thanks Tom and Ross!).

  • New style tutorials – very directed recipes, that show you exactly how to make something specific in under 30 minutes. More on these in a future blog post.
  • Live tutorials – these are what we now call the ScraperWiki special sauce tutorials. Self contained chunks of code with commentary that you fork and edit and run entirely in your browser. (thanks Anna and Mark!)
  • Copy and paste guides – a new type of reference to a library, which gives you code snippets you can quickly copy into your scraper. With one click. (thanks Julian!)
  • Interactive API documentation – for how to get data out of ScraperWiki. More on that in a later blog post. (thanks Zarino!)
  • Reference documentation – we’ve gone through it to make sure it covers exactly what we support.
  • Links for further help – an FAQ and our Google Group. But also for more gnarly questions asks on the Stack Overflow scraperwiki tag.

We’ve got more stuff in the works – screencasts and copy & paste guides to specific view/scraper libraries (lxml, Nokogiri, Google Maps…). Let us know what you want.

Finally, none of the above is what really matters about this change.

The most important thing is our new Documentation Policy (thanks Ross). Our promise to keep documentation up to date, and available alike for all the languages that we support.

Normally in websites it is much more important to have a user interface that doesn’t need documentation. Of course, you need it for when people get stuck, and it has to be good quality. But you really do want to get rid of it.

But programming is fundamentally about language. Coders need some documentation, even if it is just the quickest answer they can get Googling for an error message.

We try hard to make it so as little as possible is needed, but what’s left isn’t an add on. It is a core part of ScraperWiki.

(The quote in the title of this blog post is attributed to Dick Brandon on lots of quotation sites on the Internet, but none very reliably)

]]>
https://blog.scraperwiki.com/2011/05/documentation-is-like-sex-when-it-is-good-it-is-very-very-good-and-when-it-is-bad-it-is-better-than-nothing/feed/ 4 758214818
ScraperWiki Datastore – The SQL. https://blog.scraperwiki.com/2011/04/scraperwiki-datastore-the-sql/ https://blog.scraperwiki.com/2011/04/scraperwiki-datastore-the-sql/#comments Wed, 06 Apr 2011 16:54:58 +0000 http://blog.scraperwiki.com/?p=758214595 Recently at ScraperWiki we replaced the old datastore, which was creaking under the load, with a new, lighter and faster solution – all your data is now stored in Sqlite tables as part of the move towards pluggable datastores. In addition to the new increase in performance, using Sqlite also provides some other benefits such as allowing us to transparently modify the schema and accessing the data using SQL via the ScraperWiki API or via the Sqlite View.  If you don’t know SQL, or just need to try and remember the syntax there is a great SQL tutorial available at w3schools.com  which might get you started.

For getting your data out of ScraperWiki you can try using the Sqlite View, which makes it easier to add the fields you want to query as well as performing powerful queries on the data.  To explain how you do this we’ll use the Scraper created by Nicola in her recent post Special Treatment for Special Advisers in No. 10 which you can access on ScraperWiki, and from there create a new view. If you choose General Sqlite View, you’ll get a nice easy interface to query and study the data.  This dataset shows data from the Cabinet Office (UK central Government) and logs gifts given to advisers for the top ministers – all retrieved by Nicola after only having known how to program for three weeks.

If you’re more confident with your SQL, you can access a more direct interface after clicking the ‘Explore with ScraperWiki API’ link on the overview page for any scraper. This will also give you a link that you can use elsewhere to get direct access to your data in JSON or CSV format.  For those that are still learning SQL, or not quite as confident as they’d like to be, using the Sqlite View is a good place to start.  When you first get to the Sqlite View you’ll see something similar to the following, but without the data already shown.

As you can see, the view gives you a description of the fields in the Sqlite table (highlighted in yellow) and a set of fields where you can enter the information you require. If you are feeling particularly lazy you can simply click on the highlighted column names and they will be added to the SELECT field for you! Accessing data across scrapers is done slightly differently, and is hopefully the subject of a future post.  By default this view will display the output data as a table but you can change it to do what you wish by editing the HTML and Javascript underneath – it is pretty straight forward. Once you have added the fields you wish to find (making sure to use ` to surround any field names with spaces in) clicking the query button will make a request to the ScraperWiki API and display the results on your page. It also shows you the full query so that you can copy your query and save it away for future use.

Now that you have an interface where you can modify your SQL, you can now access your data almost any way you want it!  You can do simple queries by just leaving the SELECT field set to * which will return all of the columns, or you can specify the individual columns and the order they will be retrieved. You can even set their title by using the AS keyword. Setting the SELECT field to “`Name of Organisation` AS Organisation” allows will show that field with the new shorter column name.

Aside from ordering your results (putting a field name in ORDER BY, followed by desc if you want descending order), limiting your results (adding the number of records into LIMIT)  and the aforementioned renaming of columns, one thing the Sqlite will let you do is group your results to show information that isn’t immediately visible in the full result set.  Using the Special Advisers scraper again, the following view shows how by grouping the data on `Name of Organisation` and using the count function in the SELECT field we can show the total number of gifts given by each organisation – surely a lot faster than counting how many times each Organisation appears in the full output!  

In addition to using the count function in SELECT you could also use sum, or even avg to obtain an average of some numerical values. Not only can you add these individual functions into your SELECT field, you can get a lot more complicated to get a better overall view of the data, as in the Arts Council Cuts scraper. Here you can see the output for the total revenue per year and average percent change by artform and draw your own conclusions on where the cuts are, or are not happening.

SELECT `Artform `,
    sum(`Total Revenue 10-11`) as `Total Revenue for this year`,
    sum(`11-12`) as `Total Revenue for 2011-2012`,
    sum(`12-13`) as `Total Revenue for 2012-2013`,
    sum(`13-14`) as `Total Revenue for 2013-2014`,
    (avg(`Real percent change -Oct inflation estimates-`)*100) 
    as `Average % change over 4 years (Oct inflation estimates)`
FROM swdata
GROUP BY `Artform `
ORDER BY `Total Revenue for this year` desc"

If there is anything you’d like to see added to any of these features, let us know either in the comments or via the website.

]]>
https://blog.scraperwiki.com/2011/04/scraperwiki-datastore-the-sql/feed/ 2 758214595