technology – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Technology Radar Report https://blog.scraperwiki.com/2015/06/technology-radar-report/ https://blog.scraperwiki.com/2015/06/technology-radar-report/#comments Fri, 26 Jun 2015 07:43:33 +0000 https://blog.scraperwiki.com/?p=758223193 Creating a sustainable technology company involves keeping up with technology. The thing about technology is that it changes, and we have to look to the future, and invest our time now in things that will be valuable in the future. Or, we could switch to doing SharePoint consultancy for the rest of our lives, but I think most of us here would regard that as “checking out”.

This is a partly personal perspective of the future as I see it from our little hill in the Northwest of England (Brownlow Hill). Iʼm really just sketching out a few things that I see as being important for ScraperWiki. And since theyʼre important for ScraperWiki, theyʼre important for you! Or at least, you might be interested too.

Met1200

The future is already here – itʼs just not evenly distributed. — William Gibson

Gibsonʼs quote certainly applies to the software industry. All of the things I highlight already exist and are in use (some for quite a long time now), they just haven’t reached saturation yet. So looking to the near future is a matter of looking to the now, and making an educated guess as to what technologies will become increasingly abundant.

Python 3

(I have been saying this for 6 years now, but) Python 3 is a real thing and in five yearsʼ time we will have stopped using Python 2 and all switched to Python 3. If you think this all seems obvious, I don’t think we can say the same about the transition from Perl 5 to Perl 6 (which lives in a perpetual state of being “out by christmas”) or from Latex 2e to Latex 3.

Encouragingly, in 2015 real people are using it for real projects (including ScraperWiki!). I would now consider it foolish to start a greenfield Python project in Python 2. If you maintain a Python library, it is starting to look negligent if it doesn’t work with Python 3.

Your Python 2 programming skills will mostly transfer to Python 3. There will be some teething trouble: print() and urllib still fox me sometimes, and I find myself using list() a lot more when debugging (because more things are generators). Niggly details aside, basically everything works and most things are a bit better.

The Go Programming Language

Globally I think the success of Go (the programming language) still remains uncertain, but its ecosystem is now large enough to sustain it in its own right. The risks here are not particularly technical but in the community. I think we would have difficulty hiring a Go programmer (we would have to find a programmer and train them).

The challenge for the next year or so is to work out what existing skills people have that transfer to Go, and, related to that, what a good framework of pre-cursor skills for learning Go looks like. Personally speaking, when learning Go my C skills help me a lot, as does the fact that I already know what a coroutine is. I would say that knowledge of Java interfaces will help.

I don’t think there’s a good path to learning Go yet, it will be interesting to see what develops. For the “Go curious” the Tour of the Go Programming Language is worth a look.

Docker / containers

Docker is healthy, and while it might not win the “container wars” clearly containers are a thing that are going to be technically useful for the next few years (flashback to OS VM). Effort in learning Docker is likely to also be useful in other “API over container” solutions.

Services

Increasingly software is accessed not via a library but via a service available on the web (Software as a Service, SaaS). For example, ScraperWiki has a service to convert PDFs to tables.

ScraperWiki already use a few of these (for email delivery, database storage, accounting, payments, uptime alerts, notifications), and we’ll almost certainly be using more in the future. The obvious difference compared to using a library or building it yourself is that Software as a Service has a direct monetary cost. But that doesn’t necessarily make it more expensive. Consider e-mail delivery. ScraperWiki definitely has the technical expertise to manage our own mail delivery. But as a startup, we don’t have the time to maintain mail servers or the desire keep our mail server skills up to date. We’d rather buy that expertise in the form of the service that Sendgrid offers.

The future is much like the present. We will continue to make buy/build decisions, and increasingly the “buy” side will be a SaaS. The challenges will be in evaluating the offerings. Do they have a nice icon?

Amazon Web Services (AWS)

The mother of all SaaS.

It’s not going away and it’s getting increasingly complex. Amazon release new products every few weeks or so, and the web console becomes increasingly bewildering. I think @frabcus’s observation that “operating the AWS console” is a skill is spot on. I think there is an analogy (suggested by @IanHopkinson_) with the typing pool to desktop word processor transition: a low-paid workforce skilled in typing got replaced by giving PCs with word processors to high-paid executives with no typing skills. We no longer need IT technicians to build racks and wire them together, but instead relatively well paid devops staff do it virtually.

Cloud Formation. It’s a giant “JSON language” that describes how to create and wire together any piece of AWS infastructure.

sigh

Probably the thing to look at though. Even if we don’t use it directly (for example, we might use some replacement for Elastic Beanstalk or generate Cloud Formation files with scripts), knowing how to read it will be useful.

Big instances versus MapReduce

Whilst I think MapReduce will remain an important technology for the sector as a whole, this will be in opposition to the “single big instance”. Don’t get too hung up on terminology, I’m really using MapReduce as a placeholder for all MapReduce and hadoop-like “big data query” technologies.

Amazon Web Services makes it possible to rent “High Performance Computing” class nodes, for reasonable amounts of money. In 2015, you can get a 16 core (32 hyperthreads) instance with 60 or 244 Gigabytes of RAM for a couple of bucks per hour. I think the gap between laptops and big instances is widening, meaning that more ad hoc analysis will be done on a transient instance. You can process some pretty big datasets with 244 GB of RAM without needing to go all Hadoopy.

That is not to say that we should ignore MapReduce, but the challenge may be to find datasets of interest that actually require it.

Crypto

Snowden’s revelations tell us that the NSA, and other state-level actors, are basically everywhere. In particular, there are hostile actors in the data centre. We should consider node to node communications as going across the public internet, even if they are in the same data centre. Practically speaking, this means HTTPS / TLS everywhere.

If we provide a data service to our clients using AWS then ideally only the client, us, and AWS should have access to that data. It is unfortunate that AWS have to have access to the data, but it is practical necessity. Having trusted AWS, we can’t stop them (or even know) shipping all of our data to the NSA, so it is a matter of their reputation that they not do that. At least if we encrypt our network traffic, AWS have to take fairly aggressive steps to send our data to anyone else (they have to fish our session keys out of their RAM, or mass transfer the contents of their RAM somewhere).

There is lots more to do and discuss here. Fortunately ScraperWiki is pretty healthy in this regard, we are sensitive to it and we’re always discussing security.

Browser IDE

Here I’m talking about the “behind the scenes” world that is accessed from the Developer Tools. There is an awesome box of tools there. Programmers are probably all aware of the JavaScript Console and the Web Inspector, but these are the tip of a very large and featureful iceberg. Almost everything is dynamic: adding and disabling CSS rules updates the page live, as does editing the HTML. There is a fully featured single-step debugger that includes a code editor. Only the other day I learnt of the “emulate mobile device” mode for screen size and network.

Spend time poking about with the Developer Tools.

Machine Learning

Although it’s not an area that I know much about, I suspect that it’s not just a buzzword and it may turn out to be useful.

git / Version Control

git is great and there is a lot to learn, but don’t forget its broader historical context. Believe it or not git is not the first version control tool to come along, and github.com is not the first Software Configuration Management company. Just because git does it one particular way doesn’t mean that that way is best. It means that it is merely good enough for one person to manage the flow of patches of patches that go to make up the Linux kernel. I would also remind everyone that git != github. Practically, be aware of which bits of your workflow are git, and which are github.

(I’m bound to say something like that, Software Configuration Management used to be part of my consultancy expertise)

Google have declared this race won. They’ve shut down their own online code management product and have started hosting projects on github.

A plausible future is where everyone uses git and most people are blind to there being anything better and most people think that git == github. Whingeing aside, that future is a much better place to work in than if sourceforge had won.

The Future Technology Radar Report

Who knows what will be on the radar in the future.

]]>
https://blog.scraperwiki.com/2015/06/technology-radar-report/feed/ 1 758223193