IRC – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Start Talking to Your Data – Literally! https://blog.scraperwiki.com/2011/09/start-talking-to-your-data-literally/ Fri, 23 Sep 2011 15:22:38 +0000 http://blog.scraperwiki.com/?p=758215454 Because ScraperWiki has a SQL database and an API with SQL extraction, I can SQL inject (haha!) straight into the API URL and use the JSON output.

So what does all that mean? I scraped the CSV files of Special Advisers’ meetings gifts and hospitalities at Number 10. This is being updated as the data is published because I can schedule the scraper to run. If it fails to run I get notified via email.

Now, I’ve written a script that publishes this information along with data from 4 other scrapers relating to Number 10 Downing Street, to a twitter account, Scrape_No10. Because I’ve made a twitter bot, I can tweet out a sentence and control the order and timing of tweets. I can even attach a hashtag which I can then rescrape to find what the social media sphere has attached to each data point. This has the potential to have the data go fish for you, as a journalist, but it is not immediately useful to the newsroom.

So I give you MoJoNewsBot! I have written a script as a module in an IRC chat bot. This queries my data via the ScraperWiki API and injects what I write into the SQL and extracts the answer from the resultant JSON file, giving me a written output into the chat room. For example:

Now I can write the commands in a private chat window with MoJoNewsBot or I can do it in the room. This means that rooms can be made for the political team in a newsroom or the environment team or the education team, and they can have their own bots with modules specific to their data streams. That way, computer assisted reporting can be collaborative and social. If you’re working on a story that has a political and an educational angle then you pop into both rooms. So both teams can see what you’re asking of the data. In that sense, you’ve got a social, data driven, virtual newsroom. As such, I’ve added other modules for the modern journalist.

With MoJoNewsBot you can look for twitter trends, search tweets, lookup last tweets, get the latest headlines from various news sources and check Google News. The bot has basic functions like Google search, Wolfram Alpha lookup, Wikipedia lookup, reminder setting and even a weather checker.

Here’s an example of the code needed to query the API and return a string from the JSON:

type = 'jsondict'
scraper = 'special_advisers_gifts_and_hospitality'
site = 'https://api.scraperwiki.com/api/1.0/datastore/sqlite?'
query = ('SELECT `Name of Special Adviser`, `Type of hospitality
received`, `Name of Organisation`, `Date of Hospitality`
FROM swdata WHERE `Name of Special Adviser` = "%s" ORDER BY
`Date of Hospitality` desc' % userinput)

params = { 'format': type, 'name': scraper, 'query': query}	

url = site + urllib.urlencode(params)

jsonurl = urllib2.urlopen(url).read()
swjson = json.loads(jsonurl)

for entry in swjson[:Number]:
    ans = ('On ' + entry["Date of Hospitality"] + ' %s'
          % userinput + ' got ' +
          entry["Type of hospitality received"] + ' from '
          + entry["Name of Organisation"])
    phenny.say(ans)

This is just a prototype and a proof of concept. I would add to the module so the query could cover a specific date range. After that, I could go back to ScraperWiki and write a scraper that pulls in the other 4 Number 10 scrapers and constructs the larger database. Then all I need to do is change the name of the scraper in my module to this new one and I can now query the much larger dataset that includes ministers and permanent secretaries!

Now that’s computer assisted reporting!

PS: have fixed the bug in .gn so the links match the headlines

]]>
758215454