excel – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone https://blog.scraperwiki.com/2015/09/henry-morris-entrepreneur-for-social-mobility-on-getting-contacts-from-pdf-into-his-iphone/ Wed, 30 Sep 2015 14:11:16 +0000 https://blog.scraperwiki.com/?p=758224084 Henry Morris

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach.  They’re amazing businesses that focus on social mobility.

We interviewed him for PDFTables.com

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

 

]]>
758224084
Data Business Models https://blog.scraperwiki.com/2013/02/data-business-models/ Wed, 27 Feb 2013 09:46:39 +0000 http://blog.scraperwiki.com/?p=758217603 If it sometimes feels like the data business is full of buzzwords and hipster technical jargon, then that’s probably because it is. But don’t panic! I’ve been at loads of hip and non-hip data talks here and there and, buzzwords aside, I’ve come across four actual categories of data business model in this hip data ecosystem. Here they are:

  1. Big storage for big people
  2. Money in, insight out: Vertically integrated data analysis
  3. Internal data analysis on an organization’s own data
  4. Quantitative finance

1) Big storage for big people

This is mostly Hadoop. For example,

  • Teradata
  • Hortonworks
  • MapR
  • Cloudera

Some people are using NoHadoop. (I just invented this word.)

  • Datastax (Cassandra)
  • Couchbase (Couch but not the original Couch)
  • 10gen (Mongo)

Either way, these companies sell consulting, training, hosting, proprietary special features &c. to big businesses with shit tons of data.

2) Money in, insight out: Vertically integrated data analysis

Several companies package data collection, analysis and presentation into one integrated service. I think this is pretty close to “research”. One example is AIMIA, which manages the Nectar card scheme; as a small part of this, they analyze the data that they collect and present ideas to clients. Many producers of hip data tools also provide hip data consulting, so they too fall into this category.

Data hubs

Some companies produce suites of tools that approach this vertical integration; when you use these tools, you still have to look at the data yourself, but it is made much easier. This approaches the ‘data hubs’ that Francis likes talking about.

Lots of advertising, web and social media analytics tools fall into this category. You just configure your accounts, let data accumulate, and look at the flashy dashboard. You still have to put some thought into it, but the collection, analysis and presentation are all streamlined and integrated and thus easier for people who wouldn’t otherwise do this themselves.

Tools like Tableau, ScraperWiki, RStudio (combined with its tangential R services) also fall into this category. You still have to do your analysis, but they let you do all of your analysis in one place, and connections between that place, your data sources and your presentatino media are easy. Well that’s the idea at least.

3) Internal data analysis

Places with lots of data have internal people do something with them. Any company that’s making money must have something like this. The mainstream companies might call these people “business analysts”, and they might do all their work in Excel. The hip companies are doing “data science” with open source software before it gets cool. And the New York City government has a team that just analyzes New York data to make the various government services more efficient. For the current discussion, I see these as similar sorts of people.

I was pondering distinguishing between analysis that affects businessy decisions from models that get written into software. Since I’m just categorising business models and these things could both be produced by the same guy working inside a company with lots of data, I chose not to distinguish between them.

4) Quantitative finance

Quantitative finance is special in that the data analysis is very close to a product in itself. The conclusion of analysis or algorithm is: “Make these trades when that happens.” Rather than “If you market to these people, you might sell more products.”

This has some interesting implications. For one thing, you could have a whole company doing quantative finance. On a similar note, I suspect that analyses can be more complicated because the analyses might only need to be conveyed to people with quantitative literacy; in the other categories, it might be more important to convey insights to non-technical managers.

The end

Pretend that I made some insightful, conclusionary conclusion in this sentence. And then get back to your number crunching.

]]>
758217603
Scraping guides: Excel spreadsheets https://blog.scraperwiki.com/2011/09/scraping-guides-excel-spreadsheets/ Wed, 14 Sep 2011 15:55:29 +0000 http://blog.scraperwiki.com/?p=758215391 Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.

The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?

The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.

We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.

Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.

This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.

Next time – parsing HTML with CSS selectors.

]]>
758215391
Ruby screen scraping tutorials https://blog.scraperwiki.com/2011/01/ruby-screen-scraping-tutorials/ https://blog.scraperwiki.com/2011/01/ruby-screen-scraping-tutorials/#comments Fri, 28 Jan 2011 16:44:24 +0000 http://blog.scraperwiki.com/?p=758214220 Mark Chapman has been busy translating our Python web scraping tutorials into Ruby.

They now cover three tutorials on how to write basic screen scrapers, plus extra ones on using .ASPX pages, Excel files and CSV files.

We’ve also installed some extra Ruby modules – spreadsheet and FastCSV – to make them possible.

These Ruby scraping tutorials are made using ScraperWiki, so you can of course do them from your browser without installing anything.

Thanks Mark!

]]>
https://blog.scraperwiki.com/2011/01/ruby-screen-scraping-tutorials/feed/ 1 758214220