guest – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 ScraperWiki and Enemy Images https://blog.scraperwiki.com/2014/04/scraperwiki-and-enemy-images/ Mon, 14 Apr 2014 08:30:10 +0000 https://blog.scraperwiki.com/?p=758221416 This is a guest blog post by Elizaveta Gaufman, a researcher and lecturer at the University of Tübingen.

The theme of my PhD dissertation is enemy images in contemporary Russia.

However, I am not only interested in the governmental rhetoric which is relatively easy to analyse, but also in the way enemy images are circulating on the popular level. As I don’t have an IT background I was constantly looking for easy-to-use tools that would help me analyse the data from social networks, which for me represent a sort of a petri dish of social experimentation. Before I came across ScraperWiki I had to use a bunch of separate tools to analyse data: I had to scrape the data with one tool, then visualise it with another and ultimately create word clouds with the most frequently used terms with a third tool. A separate problem was scraping visuals, which in my previous methodology was not possible.

In my dissertation I conceptualize of enemy images as ‘an ensemble of negative conceptions that describes a particular group as threatening to the referent object’. As an enemy image is not so easy to operationalize, or to create a query for, I filtered a list of threats through opinion poll data and another tool that helped me establish which threats are debated in Russian mass media (both print and online). Then was ScraperWiki’s turn.

ScraperWiki allows its user to analyse the information on the cloud, including the number of re-tweets, frequency analysis of the tweet texts, language, and a number of other parameters, including visuals which is extremely useful for my research. Another advantage is that I did not need to download any social network analysis programs that run on Windows anyway, while I have a Mac.  In order to analyse Twitter data, firstly, I input the identified threats in the separate datasets that extract tweets from Cyrillic segment of Twitter, because the threats are in Cyrillic as well (in case of Pussy Riot, e. g., I input in Cyrillic transliteration of it as ‘Пусси’).

ScraperWiki data hub

The maximum allowed number of datasets on the cheaper plan is 10, so I included 8 threats that scored on both parameters from my previous filtering (the West, fundamentalism, China, terrorism, inter-ethnic conflict, the US, Pussy Riot and homosexuality) and subsequently tested the remaining threats (Georgia, Estonia, foreign agents, Western investment, and Jews) to check which ones will yield more Tweets in order to be included in the dataset. Surprisingly enough, the two extra datasets ended up being ‘foreign agents’ and ‘Jews’ that completed my data hub. In order to illustrate the features of Scraperwiki, I show two screenshots below that explain the data visualization.

Summary of Twitter data for query ‘yevrei’

The most important data for my dissertation on Screenshot 2 is the word frequency analysis that shows what kind of words were used the most in the tweets in this case from the query ‘yevrei’ (Jews). Scraperwiki also offers an option to view the data in the table, so it is also possible to view the ‘raw’ data and close-read the relevant tweets that in this dataset include the words ‘vinovaty’ (guilty), ‘den’gi’ (money), ‘zhidy’ (kikes) etc, but even without the close-reading some of the frequently used words reveal ‘negative conceptions about a group of people’ that represent the cornerstone of an enemy image.

Visuals from query ‘yevrei’

 

Second most important data for my dissertation is contained in the section ‘media’ where the collection of tweeted pictures is summarized. The visuals from this collection are sorted from the most to least re-tweeted and I can then analyse them according to my visual analysis methodology.

Even though ScraperWiki provides a lot of solutions to my data collection and analysis problems, my analysis is easier to carry out because of the language: Cyrillic provides a sort of a cut-off principle for the tweets that would not be possible with the English language (unless ScraperWiki adds a feature that would allow for geographic filtering of the tweets). But in general it is a great solution for social scientists like me who do not have a lot of funding or substantial IT knowledge, but still want to perform solid research involving big data.

 

]]>
758221416
Digitally enhanced social research https://blog.scraperwiki.com/2014/03/digitally-enhanced-social-research/ Fri, 21 Mar 2014 10:47:52 +0000 https://blog.scraperwiki.com/?p=758221338 Guest post by Dr Rebecca Sandover.

Rebecca_Sandover

The continued expansion of social media activity raises many questions of how this ever-changing digital life spreads ideas and how ‘contagious’ online events arise. Exeter University’s Contagion project has been running since September 2013, funded by the UK Economics and Social Research Council to explore how such events spread in several spheres. Presently the focus of the project is to develop methods for conducting research around ‘contagious’ or ‘viral’ online social interactions. The aim is to discern the environments from which spikes of activity arise and to investigate these spikes in order to further understanding of online social activity.

The team involved in the project are human geographers mostly with backgrounds in research significantly informed by social theory. Therefore the key aims of the project are to compliment the insights of social theory with capabilities of data analysis and to develop a toolbox of methods that other social scientists could adapt. We’re keen to stress our novice status in the field of data science and user-friendly platforms such as Scraperwiki are powering our investigations, enabling us to specify data searches, manipulate data and output those data in ways that are intuitive and easily developed. Tools such as Scraperwiki, Google Refine, Gephi and BASE enabled us to be up and running with our investigations once the project began.

contagion_1

On starting the project we ran a pilot study on social media activity around the badger cull in the UK last year. Like similar research we have decided to concentrate on Twitter data as it contains publically accessible data. By analysing the cull through various hashtags we have begun to identify formations of online communities, interactions of influencers and the intersections of conventional and burgeoning media in shaping the responses to the events. It certainly helps to enliven research when events go viral, such as the moment when the badgers were suggested to have ‘moved the goalposts’. Spikes like this show social media at its best, responsive, inventive and creating humour out of tense situations.

Through the pilot study we have been able to acquaint ourselves with the open access tools mentioned above to develop different approaches to analysing the data emerging from the event. Using Google Refine to clean complex and messy data and Gephi to represent the data spatially and graphically we have created visual representations of social networks interacting on Twitter. Combining Open Office BASE with data from Scraperwiki has allowed us to develop SQL queries to interrogate patterns emerging from the data. For example, this has included drawing out data for time series graphs to identify peaks of activity. The broad-based social sciences backgrounds of the team motivates an investigation beyond the numbers to explore as fully as we can social frameworks feeding these events through various online and offline factors. This experimental and mixed-methods approach will be reflected in forthcoming research papers and conference presentations to explore these methods for ‘scraping the social’ (Marres & Weltevrede 2013), which we will share through the project website.

badgermentions
Dr Rebecca Sandover is an Associate Research Fellow working in the Department of Geography at the University of Exeter. She is currently working on Prof. Stephen Hinchliffe‘s Contagion project that is investigating the diffusion of ideas and cultural change through the means of social media as well as exploring such diffusions within the financial sector and through the spread of disease.
]]>
758221338
Getting SkilledUp https://blog.scraperwiki.com/2013/03/getting-skilledup/ Thu, 21 Mar 2013 16:37:48 +0000 http://blog.scraperwiki.com/?p=758218188 skilledup-homepageGuest Post by SkilledUp‘s Nick Gidwani.

ScraperWiki is a revolutionary tool. Not just because it allows you to collect data, but because it allows anyone – including journalists who now must specialize in data – to organize and draw conclusions from vast data sets. That skill set (organizing what is now called “big data”) was not something that was expected of journalists just a few years ago. Now, in our infographic and data-hungry new world, being versed in analysis is critical.

Knowledge work – the type of work that requires creativity, problem solving and “thinking” – is the most important and valued work of the future. Increasingly, to become a knowledge worker, one must learn a varied set of skills rather than just be a master of your own domain. We are already seeing “Growth Hackers” become an increasingly valued position, where individuals with comfort and expertise with data, analysis, marketing and product development can combine those skills to do what was once the job of several people. It is a rarity these days to find a writer that doesn’t have at least proficiency in social media concepts, optimizing for SEO or editing an image in Photoshop.

Few graduates of universities arrive on the job with any of these hard skills – the ability to immediately contribute. For those starting off at a Fortune 100 company, there is a good chance that they have an internal training system to help you get started and trained to use the newest tools. For everyone else though, you are expected to learn it on your own: not an easy proposition, especially in today’s very competitive labour market.

Well, the good news is that there is an army of experts, startup CEOs, and others that have run companies, mentored and trained employees, and had success in business, that are willing to share their skills – the how, the why, and the what – via online learning to large audiences. Even better, much of this content is free or available at a very low cost, and widely available. Although companies like lynda.com have been around for a while, we’ve seen exponential growth in the variety and number of online courses being introduced in the last 18 months. New, easy to use tools that allow you to create online courses have also enabled the creation of courses. Different from the university model, this educational content is highly fragmented, rarely features certification, and is priced all over the map.

SkilledUp‘s goal is to organize, curate and review the world of online education, with a focus on the type of education that imparts marketable job skills. We believe that in this new world, many of the best workers are going to be those who teach themselves these skills by first learning from experts, and then trying their hand by experimenting in their jobs or on their own. While there is a lot of great training from excellent instructors, much of it is prohibitively expensive or too lengthy to apply quickly. We hope to make it easier to separate the best from the rest.

We’ve begun by creating our online course search app, which is currently organizing 50,000 courses from over 200 unique libraries. We expect these numbers to continuously grow, especially in areas like MOOCs (Massive Open Online Courses), Talks (things like TED.com or Poptech.com) and even E-books. We understand that everyone learns differently, and while some may prefer a 12-hour course with video, others prefer a high quality e-book, perhaps with exercise files.

Over time, we expect to add signal data to these courses, so that parsing the ‘best’ from the ‘rest’ is easier than reading a review.

To create something that is truly comprehensive and robust, we are asking the ScraperWiki community for help: more course libraries needs gathering, sorting and adding to our database, especially of the open (aka free) variety. We’re looking for people who can scrape sites, make suggestions, or suggest learning libraries they know about so that we can build the most robust and useful index possible.

If you have any ideas, or can contribute a few simple lines of code, please get in touch via the ScraperWiki Google Group, drop a comment below, or email us directly at data@skilledup.com

Nick Gidwani uses ScraperWiki as part of his startup company: SkilledUp.

]]>
758218188
Combining survey data with scraped data to tell a story https://blog.scraperwiki.com/2013/03/combining-survey-data-with-scraped-data-to-tell-a-story/ https://blog.scraperwiki.com/2013/03/combining-survey-data-with-scraped-data-to-tell-a-story/#comments Mon, 18 Mar 2013 09:16:47 +0000 http://blog.scraperwiki.com/?p=758218134 Guest post by Dan Armstrong

87f8f922465aacacec71c1a909849c78My last three jobs have involved telling stories with survey data.  Since every story requires a conflict, I look for conflicts. For instance, I’ll go to CFOs and ask them if their marketing people are doing a good job. Then I’ll ask the marketers if they think they’re doing a good job. Here’s a surprise: They don’t agree. And therein lies a story.

One way I’ve been able to uncover stories is to compare what people tell me with other data that they don’t tell me.  That’s where Scraperwiki comes in. There’s a lot of company information available on sites ranging from EDGAR to Wikipedia to Yahoo Finance.  It’s all public. You can buy it from people who aggregate and repackage it. Or you can scrape it. I choose the latter.

In my current job at ITSMA—the IT Services Marketing Association—we did a brand awareness survey of buyers of IT services. We asked them: “Which companies come to mind when you think of technology consulting and services?” The executives named 49 companies more than once. We tracked the percentage of mentions that each company received, which ranged from 52% (for IBM) to only one mention (114 companies). (Nobody mentioned Scraperwiki, but I’m confident that they will next year.)

What drives brand awareness? It’s partly how much the company spends to promote itself. But it’s mostly size, usually measured as number of employees, market capitalization or annual revenue.  So I collected the ticker symbols of the companies mentioned by our survey respondents and wrote a scraper (with the help of the good people on the Scraperwiki Google Group) to collect data on company size. I chose to go to Yahoo Finance because the page layout there is easy to parse, and the URLs are based on ticker symbols. The most recent version of the scraper is at https://scraperwiki.com/scrapers/yahoo_finance_company_info/.

Screen Shot 2013-03-14 at 10.27.55

I found that the company size variable most closely correlated with brand awareness was the number of professionals available to take on large-scale projects.  Clearly there’s a power law at work: a handful of companies with very high brand awareness and a “long tail” of firms that are barely on the radar.

As the chart (created using Tableau) shows, this pattern mirrors another dynamic: the fact that there are a few very big companies and many small ones. Brand perceptions reflect the underlying reality of company size.

My next project will be to scrape the ages of CEOs to update this age distribution from a few years back: http://www.analyzethis.net/2009/11/30/the-93-year-old-ceo/  (The oldest CEO, Walter Zable, died last year at the age of 97.) The age data is all there on Yahoo Finance. It’s going to take some regular expressions to get at it. But I’m confident there will be a story in the results.

You can catch Dan on twitter via @ITSMA_B2B, or post feedback, suggestions and ideas to him on the ScraperWiki Group.

]]>
https://blog.scraperwiki.com/2013/03/combining-survey-data-with-scraped-data-to-tell-a-story/feed/ 5 758218134
Glue Logic and Flowable Data https://blog.scraperwiki.com/2013/03/glue-logic-and-flowable-data/ https://blog.scraperwiki.com/2013/03/glue-logic-and-flowable-data/#comments Mon, 04 Mar 2013 16:51:29 +0000 http://blog.scraperwiki.com/?p=758218027 Guest post by Tony Hirst

Tony Hirst

As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool:  joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.

Ever since I first read The UN peacekeeping mission contributions mostly baked just over a year ago, I’ve had many guilty moments using Scraperwiki, grabbing down data and then… nothing. In that post, Julian Todd opened up with a salvo against those of us that create a scraper that sort of works and then think: “Job done.”

Many of the most promising webscraping projects are abandoned when they are half done. The author often doesn’t know it. “What do you want? I’ve fully scraped the data,” they say.

But it’s not good enough. You have to show what you can do with the data. This is always very hard work. There are no two ways about it.

So whilst I have created more than my fair share of half-abandoned scraper projects, I always feel a little bit guilty about not developing an actual application around a scraped dataset. My excuse? I don’t really know how to create applications at all, let alone applications that anyone might use… But on reflection, Julian’s quote above doesn’t necessarily imply that you always need to build an application around a data set, it just suggests that you should “show what you can do with the data”. Which to my mind includes showing how you can free up the data so that it can naturally flow into other tools and applications.

One great way of showing what you can do with the data is to give some example queries over your Scraperwiki database using the Scraperwiki API. A lot of “very hard work” may be involved (at least for a novice) in framing a query that allows you to ask a particular sort of question over a database, but once written, such queries can often be parametrised, allowing the user to just change one or two search terms, or modify a specified search range. Working queries also provide a good starting point for developing new queries by means of refining old ones. The Scraperwiki API offers a variety of output formats—CSV, JSON, HTML data tables—which means that your query might actually provide the basis for a minimum viable application, providing URL accessible custom data views via and HTML table, for example.

Though playing with a wide variety of visualisation tools and toolkits, I have learned a couple of very trivial sounding but nonetheless important lessons:

  1. different tools work out-of-the-box with different data import formats
  2. a wide variety of data visualisation toolkits can generate visualisations for you out-of-the-box if the data has the right shape (for example, the columns and rows are defined in a particular way or using a particular arrangement).

As an example of the first kind, if you have data in the Google Visualisation API JSON data format, you can plug it directly into a sortable table component, or more complex interactive dashboard. For example, in Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API I describe how data in the appropriate JSON format can be plugged into an HTML view that generates a sortable table with just the following lines of code:

[sourcecode language=”javascript”]
//jsonData contains data in the appropriate format
var json_table = new google.visualization.Table(document.getElementById(‘table_div_json’))
var json_data = new google.visualization.DataTable(jsonData, 0.6);
json_table.draw(json_data, {
showRowNumber: true
});
[/sourcecode]

An example of the second kind might be getting data formatted correctly for a motion chart.

To make life easier, some of the programming libraries that are preinstalled on Scraperwiki can generate data in the correct format from you. In the case of the above example, I used the gviz_api Python library to transform data obtained using a Scraperwiki API query into the format expected by the sortable table.

Also in that example, I used a single Scraperwiki Python view to pull down the data, transform it, and then paste it into a templated HTML page.

Another approach is to create a view that generates a data feed—accessible via its own URL—and then consume that feed either in another view—or in another application or webpage outside the Scraperwiki context altogether. This is the approach I took when trying to map co-director networks using data sourced from OpenCorporates (described in Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx).

Here’s the route I took—one thing to reflect on is not necessarily the actual tools used, more the architecture of the approach: opencorporates-graphflow In this case, I used Scraperwiki to run a bootstrapped search over OpenCorporates, first finding the directors of a particular company, then searching for other directors with the same name, then pulling down company details corresponding to those director records, and so on. A single Scraperwiki table could then be used to build up list of (director, company) pairs showing how companies were connected based on co-director links. This sort of data is naturally described using a graph (graph theory style, rather than line chart!;-), the sort of object that is itself naturally visualised using a network style layout. But they’re hard to generate, right?

Wrong… Javascript libraries such as d3.js and sigma.js both provide routes to browser based network visualised data if you use the right sort of JSON or XML format. And it just so happens that another of the Python libraries available on Scraperwiki—networkx—is capable of generating those data formats from a graph that is straightforwardly constructed from the (director, company) paired data.

comapny-director-netwrok

As a graph processing library, networkx can also be used to analyse the corporate data using network methods—for example, reducing the director1-company-director2 relations to director1-director2 relations (showing directors related by virtue of being co-directors of one or more of the same companies), or performing a complementary mapping for company1-director-company2 style relations.

As Scraperwiki views can be passed, and hence act on, URL variables, it’s possible to parametrise the “graph view” to return data in different formats (JSON or XML) as well as in different views (full graph, director-director mapping etc). Which I hope shows one or two ways of using the data, along with making it easier for someone else with an eye for design to pull that data into an actual end-user application.

Tony Hirst is an academic in the Dept of Communication and Systems at The Open University and member of the Open Knowledge Foundation Network. He blogs regularly at ouseful.info and on OpenLearn. Tony’s profile pic from ouseful.info was taken by John Naughton.

]]>
https://blog.scraperwiki.com/2013/03/glue-logic-and-flowable-data/feed/ 1 758218027