Saturday 23 August 2008

Amazon Elastic Block Store

This has been presented as a small, incremental step forward for Amazon's cloud-based computing. It's not. It's a big deal.

First, some grossly-simplified terminology. In the beginning there was Amazon Elastic Compute Cloud (EC2), which is your server(s). There was also Amazon Simple Storage Service (S3) which you can think of as being your network-attached file system. A little later they also brought out (though it's still in beta a.t.m.) Amazon SimpleDB which is your database and Amazon Simple Queue Service which is your messaging infrastructure. So what's the problem? Why do we need EBS?

Well I said "servers" above, but it's more accurate to think of EC2 instances as being like a CPU + RAM, where the RAM has enough space for a virtual disk. Your server does have some disk space attached ("instance storage" in the lingo) to which it can write files, but remember that the whole point of EC2 is that servers can get started automatically in response to demand, and that when demand falls, they can get automatically switched off. At that point, anything they've written to their local disk just disappears. I suppose it's the same as an object going out of scope and getting garbage-collected.

So if your web application, running on your EC2 instance, has done some work and updated some records, then it needs to salt them away somewhere outside the server-instance itself, and that's where S3 and SimpleDB come in. And of course, once you realise that your EC2 instance is probably going to need to save some data, it's but a step to see that it's also probably going to have to initialise itself, once it's up and running, from a data store too. So data passes between EC2 and S3 and SimpleDB in both directions.

So again, where's the problem? Well, I said that S3 was like a file system. It is, it's like some old, clunky PC-DOS 1.x floppy-disk-only file system, before you had directories and certainly before you had subdirectories. Each user account gets I think it's 100 bags (they're the floppies) and each of those bags can hold vast amounts of objects, but all at the top level, and all in the same namespace. If you want to impose any kind of structure on top of that, then you must do it by appropriately decorating the names (keys) that you use. It's as though your whole hard disk could only have a single folder and you had to keep everything organised by using tremendously long filenames.

Again, I said SimpleDB was the database, but it's not an SQL database. In fact it doesn't even have tables. It's more like a set of objects and a set of attributes, and any object can have (or not have) any of those attributes, and you can search on the attributes. So imaging that you could add some kind of metadata tags to S3 files and search on the tags, and that's SimpleDB.

I'm not criticising them, the constraints of replication, timeliness, availability etc. etc. very likely imposed all these limitations. And in an odd kind of way, the objects you get in S3 and SimpleDB remind me more of the sort of constructs you get in programming languages like Java or Javascript, so you can see that as building blocks they might be very powerful and useful.

But it's probably true to say that the software that would run well in this environment would have to be written or adapted specially for it. A lot has been written too, in the couple of years that EC2 and S3 have been available. But there's a whole class of more traditional applications that haven't, and which won't work in this environment, and they represent 99.9999% of all the software that's out there.

So now Amazon introduce EBS, and you can think of that as an infinite array of network-attached hard disks (but this time they are proper ones, with subdirectories and everything :) ). Previously, if I wanted to store my web application's data in say a PostgreSQL database, to get the benefit of SQL based access, I had to put that database on the EC2 instance's hard disk (instance storage) and had to make sure that the data in that database got pickled to S3 before the instance was shut down. Now I can just put the database on the EBS disk and forget about it.

Of course, for the very largest, most fault-tolerant and distributed of applications, EC2 + S3 + SimpleDB + SQS was probably architecturally the right way to go anyway; the sheer size of such applications might make exactly the same constraints necessary. The difference is in the area where I live: the web site that can fit comfortably on a single server, or even a single virtual server, with an SQL database on the same machine or on a different machine on the same network. That's to say, the space where the smallest to the medium sized database driven websites live.

These are well served at the moment by the numerous small hosting providers who'll give you a virtual server or two, or your own box. But if your web site does well, then the upgrade path can be difficult and expensive. And if your web site very suddenly does well then you likely won't be able to resource-up in time and it may fall over. And the pricing models are very coarse-grained in comparison to Amazon's.

For those web sites, Amazon EC2 wasn't an option because they would have had to rewrite their data layer and still end up with something that wasn't as flexible as SQL. But with EBS, those kinds of sites can be ported directly.

1 comment: