thomaslevine – ScraperWiki

Data Business Models

thomaslevine — Wed, 27 Feb 2013 09:46:39 +0000

If it sometimes feels like the data business is full of buzzwords and hipster technical jargon, then that’s probably because it is. But don’t panic! I’ve been at loads of hip and non-hip data talks here and there and, buzzwords aside, I’ve come across four actual categories of data business model in this hip data ecosystem. Here they are:

Big storage for big people
Money in, insight out: Vertically integrated data analysis
Internal data analysis on an organization’s own data
Quantitative finance

1) Big storage for big people

This is mostly Hadoop. For example,

Teradata
Hortonworks
MapR
Cloudera

Some people are using NoHadoop. (I just invented this word.)

Datastax (Cassandra)
Couchbase (Couch but not the original Couch)
10gen (Mongo)

Either way, these companies sell consulting, training, hosting, proprietary special features &c. to big businesses with shit tons of data.

2) Money in, insight out: Vertically integrated data analysis

Several companies package data collection, analysis and presentation into one integrated service. I think this is pretty close to “research”. One example is AIMIA, which manages the Nectar card scheme; as a small part of this, they analyze the data that they collect and present ideas to clients. Many producers of hip data tools also provide hip data consulting, so they too fall into this category.

Data hubs

Some companies produce suites of tools that approach this vertical integration; when you use these tools, you still have to look at the data yourself, but it is made much easier. This approaches the ‘data hubs’ that Francis likes talking about.

Lots of advertising, web and social media analytics tools fall into this category. You just configure your accounts, let data accumulate, and look at the flashy dashboard. You still have to put some thought into it, but the collection, analysis and presentation are all streamlined and integrated and thus easier for people who wouldn’t otherwise do this themselves.

Tools like Tableau, ScraperWiki, RStudio (combined with its tangential R services) also fall into this category. You still have to do your analysis, but they let you do all of your analysis in one place, and connections between that place, your data sources and your presentatino media are easy. Well that’s the idea at least.

3) Internal data analysis

Places with lots of data have internal people do something with them. Any company that’s making money must have something like this. The mainstream companies might call these people “business analysts”, and they might do all their work in Excel. The hip companies are doing “data science” with open source software before it gets cool. And the New York City government has a team that just analyzes New York data to make the various government services more efficient. For the current discussion, I see these as similar sorts of people.

I was pondering distinguishing between analysis that affects businessy decisions from models that get written into software. Since I’m just categorising business models and these things could both be produced by the same guy working inside a company with lots of data, I chose not to distinguish between them.

4) Quantitative finance

Quantitative finance is special in that the data analysis is very close to a product in itself. The conclusion of analysis or algorithm is: “Make these trades when that happens.” Rather than “If you market to these people, you might sell more products.”

This has some interesting implications. For one thing, you could have a whole company doing quantative finance. On a similar note, I suspect that analyses can be more complicated because the analyses might only need to be conveyed to people with quantitative literacy; in the other categories, it might be more important to convey insights to non-technical managers.

The end

Pretend that I made some insightful, conclusionary conclusion in this sentence. And then get back to your number crunching.

Hip Data Terms

thomaslevine — Tue, 26 Feb 2013 11:23:40 +0000

“Big Data” and “Data Science” tend to be terms whose meaning is defined the moment they are used. They are sometimes meaningful, but their meaning is dependent on context. Through the agendas of many hip and not-so-hip data talks we could come up with some definitions some people mean, and will try and describe how big data and data science are used now.

Big data

When some people say big data, they are describing something physically big—in terms of bytes. So a petabyte of data would be big, at least today. Other people think of big data in terms of thresholds of big. So if data don’t fit into random-access memory, or cannot be stored on a single hard-drive, they’re talking about big data. More generally, we might say that if the data can’t be accessed by Excel (the world’s standard data analysis tool), it is certainly big, and you need to know something more about computing in order to store and access the data. Judging data-bigness by physical size sometimes works today. But sizes that seem big today are different from what seemed big twenty years ago and from what will seem big in twenty years. Lets look more at two descriptions of big data that get to the causes of data-bigness. One presenter at Strata London 2012 proposed that big data comes about when it becomes less expensive to store data than to decide about whether or not to delete it. Filing cabinets and libraries are giving way to Hadoop clusters and low-power hard drives, so it has become recently reasonable to just save anything. The second thing to look at is where all of this data comes from. Part of this big data thing is that we can now collect much more data automatically. Before computers, if the post office wanted to study where mail is sent, it could sample letters at various points and record their destinations, return addresses, and routes. Today we already have all our emails, Twitter posts and other correspondence in reasonably standard formats. So, this process is far more automatic, and we can collect much more data.

Data science

So, what is ‘data science’? It broadly seems to be some combination of ‘statistics’ and ‘computer engineering’. They’re in quotes because these categories are ambiguous and because they are difficult to define except in terms of one another. Let’s define ‘data science’ by relating it to ‘statistics’ and ‘software engineering’, and we’ll start with statistics.

‘Data science’ and ‘statistics’

First off, the statistical methods used in ‘data science’ and ‘big data’ seem quite unsophisticated compared to those used in ‘statistics’. Often, it’s just search. For example, the data team at La Nación demonstrated how they’re acquiring loads of documents and allowing journalists to search them. Certainly, they will eventually start doing crude quantitative analyses on the overall document sets, but even the search has already been valuable. Their team pulls in another hip term: ‘data journalism’. Quantitative analyses that do happen are often quite simple. Consider the FourSquare checkin analyses that a couple people from FourSquare demoed at DataGotham. The demo mostly comprised scatterplots of checkins on top of a map, and sometimes it played over time. They touched on the models they were using to guess where someone wanted to check in, but they emphasised the knowledge gained from looking at checkin histories, and these simple plots were helpful for conveying this. In other cases, ‘data science’ simply implies ‘machine learning’. Compared to ‘statistics’, ‘machine learning,’ though, implies a focus on prediction rather than inference. Statisticians seem to make use of more complex models on simpler datasets, and are more concerned with consuming and applying data than they are with the modelling of the data.

‘Data science’ and ‘software engineering’

The products of ‘software engineering’ tend to be tools, and the products of ‘data science’ tend to be knowledge. We can break that distinction into some components for illustration. (NB: These components exaggerate the differences.)

Realtime v. batch: If something is ‘realtime’, it is the result of ‘software engineering’; ‘data science’ is usually done in batches. (Let’s avoid worrying too much about what ‘realtime’ means. We could take ‘realtime’ to mean push rather than pull, and that could work for a reasonable definition of ‘realtime’.)

Organization: ‘Data scientists’ are embedded within organizations that have questions about data (typically about their own data, though that depends on how we think of ownership). Consider any hip web startup with a large database. ‘Software engineers’, on the other hand, make products to be used by other organizations or by other departments within a large organization. Consider any hip web startup ever. Also consider some teams within large companies; I know someone who worked at Google as a ‘software engineer’ to write code for packaging ChromeBooks.

What about ‘analysts’?

If we simplify the world to a two-dimensional space, ‘data scientists’, ‘statisticians’, ‘software engineers’, (This chart uses ‘developer’. Oops.) ‘engineers’ might land here.

Conflating ‘data science’ and ‘big data’

Some people conflate ‘data science’ and ‘big data’. For some definitions of these two phrases, the conflation makes perfect sense, like when ‘big data’ means that the data are big enough that you need to know something about computers. Some people are more concerned with ‘data science’ than they are with ‘big data’, and vice-versa. For example, ‘big data’ is much talked-about at Strata, but ‘data science’ isn’t discussed as much. Perhaps ‘big data’ is buzzier and more popular among the marketing departments? To other people, ‘data science’ is more common, and this is in part to emphasise the fact that they can do useful things with small datasets too. It might be that we want some word to describe what we do. ‘Statistician’ and ‘software developer’ aren’t close enough, but ‘data scientist’ is decent.

Utility of these definitions

Consider taking this post with a grain of salt. Some definitions may be more clear to one group of people than to another, and they may be over-simplified here. On the other hand, these definitions are intended be descriptive rather than prescriptive, so they might be more useful than some other definitions that you’ve heard. No matter how you define a hip or un-hip term, it is impossible to avoid all ambiguities.

How to test shell scripts

thomaslevine — Wed, 12 Dec 2012 17:16:27 +0000

Extreme hipster superheroes like me need tests for their shell. Here’s what’s available.

YOLO: No automated testing

Few shell scripts have any automated testing because shell programmers live life on the edge. Inevitably, this results in tedious manual ‘testing’. Loads of projects use this approach.

Here are some more. I separated them because they’re all shell profiles.

This is actually okay much of the time. The programs I reference above are reasonably complex, but shell scripts are often much simpler; shell is often convenient for small connections among programs and for simple configuration. If your shell scripts are short and easy to read, maybe you don’t need tests.

Posers: Automated commands with manual human review

You can easily generate a rough test suite by just saving the commands you used for manual debugging; this creates the illusion of living only once while actually living multiple times. Here are some examples.

These scripts just run a range of commands, and you look for weird things in the output. You can also write up the intended output for comparison.

Mainstream: Test cases are functions

This approach is somewhat standard in other languages. Write functions inside of files or classes, and run assertions within those functions. Failed assertions and other errors are caught and raised.

In Roundup, test cases are functions, and their return code determines whether the test passes. Shell already has a nice assertion function called test, so Roundup doesn’t need to implement its own. It also helps you structure your tests; you can use the describe function to name your tests, and you can define before and after functions to be run before and after test cases, respectively. For an example of roundup in action, check out spark

shunit is similar. One notable difference is that it defines its own assertion functions, like assertEquals and assertFalse git-ftp uses it.

tf is also similar, but it is cool because it provides some special shell-style assertions (“matchers”) that are specified as shell comments. Rather than just testing status codes or stdout, you can also test environment characteristics, and you can test multiple properties of one command. rvm uses it.

There are some language-agnostic protocals with assertion libraries in multiple languages. The idea is that you can combine test results from several languages. I guess this is more of a big deal for shell than for other languages because shell is likely to be used for a small component of a project that mostly uses another language. WvTest and Test Anything Protocal (This site is down for me right now.) are examples of that.

Even though all of these frameworks exist, artisinal test frameworks are often specially crafted for a specific projects. This is the case for bash-toolbox and treegit.

Implementing your own framework like this is pretty simple; the main thing you need to know is that $? gives you the exit code of the previous command, so something like this will tell you whether the previous command passed.

test "$?" = '0'

Ironic elegance: Design for the shell

Assertion libraries are common and reasonable in other languages, but I don’t think they work as well for shell. Shell uses a bizarre concept of input and output, so the sort of assertion functions that work in other languages don’t feel natural to me in shell.

In Urchin, test cases are executable files. A test passes if its exit code is 0. You can define setup and teardown procedures; these are also files. For an example of Urchin tests, check out nvm. (By the way, I wrote both Urchin and the nvm tests.)

In cmdtest, one test case spans multiple files. Minimally, you provide the test script, but you can also provide files for the stdin, the intended stdout, the intended stderr and the intended exit code. Like in urchin, the setup and teardown procedures are files.

The fundamental similarity that I see between Urchin and cmdtest is that they are based on files rather than functions; this is much more of a shell way to do things. There are obviously other similarities between these two frameworks, but I think most of the other similarities can be seen as stemming from the file basis of test cases.

Here’s one particularly cool feature that might not be obvious. Earlier, I mentioned some protocals for testing in multiple languages. I found them somewhat strange because I see shell as the standard interface between languages. In Urchin and cmdtest, test cases are just files, so you can actually use these frameworks to test code written in any language.

Which framework should I use?

If you are writing anything complicated in shell, it could probably use some tests. For the simplest tests, writing your own framework is fine, but for anything complicated, I recommend either Urchin or cmdtest. You’ll want to use a different one depending on your project.

cmdtest makes it easy to specify inputs and test outputs, but it doesn’t have a special way of testing what files have changed. Also, the installation is a bit more involved.

Urchin doesn’t help you at all with outputs, but it makes testing side-effects easier. In urchin, you can nest tests inside of directories; to test a side-effect, you make a subdirectory, put the command of interest in the setup_dir file and then test your side effects in your test files. Urchin is also easier to install; it’s just a shell script.

I recommend cmdtest if you are mainly testing input and output; otherwise, I recommend Urchin. If you are working on a very simple project, you might also consider writing your own framework.

For hip trend-setters like me

Test-driven development is mainstream in other languages but uncommon in shell. Nobody does test-driven development in shell, so all of these approaches are ahead of the curve. Hip programmers like me know this, so we’re testing our shell scripts now, before shell testing gets big.

The Big Clean

thomaslevine — Mon, 05 Nov 2012 09:10:30 +0000

I’m just about to return from Prague, Czech Republic, where I gave a workshop at the Big Clean. What a nice little conference this was!

It had two tracks: Talks and the workshop. So I didn’t get to see many of the talks :(. But this meant I had the whole day to teach people about cleaning up data.

We started with some overview thoughts on cleaning up data, and then I went through the architecture of an analog data-cleaning process, like it might have been done 30 years ago and is still done more often than you’d realize. I work with computers enough that I can’t stand them, so I drew out a diagram on paper instead of using slides.

The fun part happens when we realize that the architecture is the same when we digitize the process. Once we realize this, the process seems less magic; it’s just a faster version of what people would do. Also, when you break up the project like this, it’s easier to work on it in stages.

I went through the writing of a simple web scraper script, then we broke for lunch, which was prepared by HotKarot using big data experimental social media crowdsourced realtime open-source catering methodologies.

After stuffing ourselves at lunch, we worked on some of the participants’ projects/ideas.

We added a column a spreadsheet of Czech municipality characteristics by finding municipality areas in another website.
We talked about various approaches to parsing PDF documents for one of Juha‘s projects.
We pulled the song-play history out of Last.fm. (I unfortunately don’t recall who’s account we were looking at.) Lastfm records exposes loads of data about your activities through it’s surprisingly convenient API, and this gets interesting if you’ve been using Last.fm for seven years.

Also,

The Dutch acronym for FOI is way better; it’s “WOB”. Wobbing is way better than foiaing or foiing or foiling. Wob wob wob.
Diacritics are the bane of the Czech programmer’s existence.
Ask your public servants what you should wob them for.

Party

thomaslevine — Fri, 21 Sep 2012 18:09:43 +0000

I went to a three-day party in Buenos Aires this past month. The first two days were talks and workshops,

I gave a talk on how awesome I am and a workshop on cleaning data. The latter involved no computers and no slides, so I held it outside!

I modeled an analog version of the Army Corps 404 Website Scraper—what it might have been like before the internet. I took volunteers to play the a courier, an army secretary and staff/volunteers at an advocacy group. The simulated advocacy group acquired paper notices about applications to build on wetlands, archived the notices, recorded some structured information on a table (with four legs!) and then tried to find notices that should not be approved. I had no slides for the workshop, but here are some of my notes/outlines.

The third day was a shockingly organized hackathon. Hacks/Hackers Buenos Aires has a website for discussing hackathon ideas. And people worked on it during the hackathon! Anyway, my team found references of money in court documents.

More importantly, I ate pizza, milanesa, empanadas, dulce de leche and ice cream.

DumpTruck 0.0.3

thomaslevine — Mon, 06 Aug 2012 08:26:51 +0000

I’ve added some new features to DumpTruck.

Changes

Dictionary case sensitivity

I removed the dictionaries with case-insensitive keys because that just seemed to be delaying the conversion to case sensitivity.

Ordered Dictionaries

DumpTruck.execute now returns a collections.OrderedDict for each row rather than a dict for each row. Also, order is respected on insert, so you can pass OrderedDicts to DumpTruck.insert or DumpTruck.create_table to specify column order.

Index creation syntax

Previously, indices were created with

DumpTruck.create_index(table_name, column_names)

This order was chosen to match SQL syntax. It has been changed to

DumpTruck.create_index(column_names, table_name)

to match the syntax for DumpTruck.insert.

Handling NULL values

Null value handling has been documented and tweaked.

RowId

dt.insert returns the the rowid(s) that were inserted.

scraperwiki_local

The DumpTruck interface has changed slightly, so I also adjusted scraperwiki_local based on those changes.

Install

Get the new version from pip.

pip install dumptruck

Do all “analysts” use Excel?

thomaslevine — Tue, 31 Jul 2012 11:34:26 +0000

We were wondering how common spreadsheets are as a platform for data analysis. It’s not something I’ve really thought about in a while; I find it way easier to clean numbers with real programming languages. But we suspected that virtually everyone else used spreadsheets, and specifically Excel Spreadsheet, so we did a couple of things to check that.

Methods

First, I looked for job postings for “analyst” jobs. I specifically looked in companies that provide tools or analysis for social media stuff. For each posting, I marked whether the posting require knowledge of Excel. They all did. And not only did they require knowledge of Excel, they required “Excellent computer skills, especially with Excel”, “Advanced Excel skills a must”, and so on. I generally felt that Excel was presented as the most important skill for each particular job.

Second, I posted on Facebook to ask “analyst” friends whether they use anything other than Excel.

It seems that they don’t.

Conclusions

It seems that Excel is a lot more common than I’d realized. Moreover, it seems that “analyst” is basically synonmous with “person who uses Excel”.

Having investigated this and concluded that “analyst” is synonymous with “person who uses Excel”, I personally am going to stop saying that I “analyze” data because I don’t want people to think that I use Excel. But now I need another word that explains that I can do more advanced things.

Maybe that’s why people invented that nonsense role “data scientist”, which I apparently am. Actually, Dragon thought we should define “big data” as “data that you can’t analyze in Excel”.

For ScraperWiki as a whole, this ~~analysis~~ data science gives us an idea of the technical expertise to expect of people with particular job roles. We’ve recognized that the current platform expects that people are comfortable programming, so we’re working on something simpler. We pondered making Excel plugins for social media analysis functions, but now we think that that will be far too complicated for the sort of people who could use them, so we’re thinking about ways of making the interface even simpler without being overly constrained.

Twitter Scraper Python Library

thomaslevine — Wed, 04 Jul 2012 17:39:06 +0000

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport
search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select
from time import sleep

# Search by user to get some more
users = [row['from_user'] + ' tcamp12' for row in 
select('distinct from_user from swdata where from_user where user > "%s"' 
% get_var('previous_from_user', ''))]

for user in users:
    search([user], num_pages = 2)
    save_var('previous_from_user', user)
    sleep(2)

By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.

Library

Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Middle Names in the United States over Time

thomaslevine — Fri, 15 Jun 2012 15:06:20 +0000

I was wondering what proportion of people have middle names, so I asked the Census.

Recently you requested personal assistance from our on-line support
center. Below is a summary of your request and our response.

We will assume your issue has been resolved if we do not hear from you
within 48 hours.

Thank you for allowing us to be of service to you.

To access your question from our support site, click the following
link or paste it into your web browser.
https://ask.census.gov/app/account/questions/detail/i_id/186591


Subject
---------------------------------------------------------------
What proportion of people have middle initials?


Discussion Thread
---------------------------------------------------------------
Response Via Email(CLMSO - EMM) - 03/14/2011 16:04
Thank you for using the US Census Bureau's Question and Answer Center. Un-
fortunately, the subject you asked about is not one for which the Census
Bureau collects data. We are sorry we were not able to assist you.


Question Reference #110314-000041
---------------------------------------------------------------
 Category Level 1: People
 Category Level 2: Miscellaneous
     Date Created: 03/14/2011 15:29
     Last Updated: 03/14/2011 16:04
	   Status: Pending Closure
	       Cc:

Since they didn’t know, I looked at students at a university. Cornell University email addresses can contain two or three letters, depending on whether the Cornellian has a middle name. I retrieved all of the email addresses of then-current Cornell University students from the Cornell Electronic Directory and came up with this plot.

Middle name prevalence among Cornell University students

Based on discussions with some of the students in that census, I suspected that students underreport rather than overreport middle names and that the under-reporting is generally an accident.

A year later, I finally got around to testing that. I looked at the names of 85,822,194 dead Americans and came up with some more plots.

Middle name prevalence as a function of time and state

The rate of middle names these days is about 90%, which is a lot more than the Cornell University student figures; this supports my suspicion that people under-report middle names rather than overreport them.

I was somewhat surprised that reported middle name prevalance varied so much over time but so relatively little by state. I suspect that most of the increase over time is explained by improvement of office procedures, but I wonder what explains slower increases around 1955 and 1990.

(pdf)

The death file provides a lot more data than I’ve shown you here, so check with me in a couple months to see what else I come up with.

Local ScraperWiki Library

thomaslevine — Thu, 07 Jun 2012 15:24:28 +0000

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use

pip install scraperwiki_local

You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck

Differences

DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values

What happens if you do this?

import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw']
scraperwiki.sqlite.save([], {'shopping_list': shopping_list})

On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:

[u'carrots', u'orange juice', u'chainsaw']

In the local version, it is encoded to JSON, so it looks like this:

["carrots","orange juice","chainsaw"]

And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names

SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]: scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
In [3]: scraperwiki.sqlite.save([], {'sHOpPiNg_liST': shopping_list})
In [4]: scraperwiki.sqlite.select('* from swdata')
Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data = scraperwiki.sqlite.select('* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, scraperwiki.sqlite.select returns a list of ordinary dictionaries. In the local version, scraperwiki.sqlite.select returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!