firebox – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 New backend now fully rolled out https://blog.scraperwiki.com/2011/10/new-backend-now-fully-rolled-out/ https://blog.scraperwiki.com/2011/10/new-backend-now-fully-rolled-out/#comments Thu, 06 Oct 2011 14:46:56 +0000 http://blog.scraperwiki.com/?p=758215611 The new faster, safer sandbox that powers ScraperWiki is now fully rolled out to all users.

You should find running and developing scrapers and views faster than before, and that you’re using much more recent versions of Ruby, Python and associated libraries.

Thank you to everyone, and there were lots of you, who helped us beta test it!

Now, Ross and Julian are fighting for the right to delete all the old code we don’t need any more…

]]>
https://blog.scraperwiki.com/2011/10/new-backend-now-fully-rolled-out/feed/ 1 758215611
A faster, safer sandbox to play in https://blog.scraperwiki.com/2011/09/a-faster-safer-sandbox-to-play-in/ https://blog.scraperwiki.com/2011/09/a-faster-safer-sandbox-to-play-in/#comments Mon, 12 Sep 2011 09:45:26 +0000 http://blog.scraperwiki.com/?p=758215361 When programmers first hear about ScraperWiki, their initial reaction is often “what! you let anyone edit general purpose code and run it on your servers!”.

The answer is that, yes, we do, but in an isolated environment. Your own “sandbox” if you like, where you can safely build castles without knocking others over. Or, as The Julian calls it, a “firebox” where you can burn logs without burning down the whole house.

We’re rolling out an upgrade to that environment, changing to a new core technology. We used to use a thing called UML (User Mode Linux), and now we’re changing our sandbox to use a thing called LXC (Linux Containers).

It’s just been deployed, but enabled only for beta test users. Changes are:

  • Safe: The scripts now run in better isolation from each other. This is so that we can offer private scrapers securely, making sure they cannot read each other’s data and code.
  • Fast: Both in the editor, and when scheduled, scrapers and views run a lot quicker. The old system used a particularly slow method to identify scrapers, making it pause for half a second each page scraped, or write to the datastore (for Unix geeks, it spawned “lsof” each time). This is now down to a fraction of the time (it just looks at a bridge network IP address).
  • Robust: We don’t have any long running virtual machines any more, LXC is light enough it effectively “boots up” each time the script is run. After we’ve fixed any bugs in the daemon that manages all this, it should be fundamentally more reliable.
  • Updated languages: With the migration, we’re also moving from Python 2.6.2 to Python 2.7.1, and from Ruby 1.8.7 to Ruby 1.9.2. The Ruby move is particularly significant, it should be faster and make scraping unicode easier.
  • Updated libraries: We’ve updated all the 3rd party libraries in the sandbox to their most recent versions.

What next? We’ll spend about a week with beta testers, testing the new containers, for bugs, compatibility and performance. If you’d like to help test, please do get in touch. We can enable it so all scrapers and views you own will run in the new LXC environment.

After that, we will start rolling it out whether you like it or not! This will break some scrapers. Specifically, there are some minor syntax changes in Ruby 1.9, and some of the library upgrades might cause problems. We’ll be eliminating as many of these as possible in the test phase, and will make another announcement before we start rolling it out for everyone. But it is possible that you will have to fix up some of your scrapers. Lets us know if you need help fixing them and we’ll do our best to get one of our developers to help you out.

Bearing in mind, after that, everything will be faster 🙂

]]>
https://blog.scraperwiki.com/2011/09/a-faster-safer-sandbox-to-play-in/feed/ 2 758215361