Posts Tagged ‘s3’

Torrent download Cloud appliance

// September 12th, 2011 // 13 Comments » // Uncategorized

The Why



A friend of mine who's a Linux systems geek as well, was tasked with building a library of Linux distro ISOs, this involved downloading tens of ISOs many of which were only offered in torrent form. Even if that were not the case, it would still be good practice to download such large binaries from torrent to avoid loading a certain mirror too much. Anyway, we were chatting about it, and since where I live bandwidth (especially upload) is a scarce resource, he was considering paying some service to download the torrents he needed and convert them to HTTP!

I mentioned I could build something to do just that in about an hour! It wouldn't even be complex. Armed with Ensemble I can very simply launch an EC2 instance, install rtorrent (my fav cli torrent client) and rtgui (rtorrent Web UI) and have it ready to crunch on any of your torrenting needs. We both became interested in seeing how well that would work and so here we go...

The How


I'll assume you already know how to get started with Ensemble. Let's see what it takes to deploy my torrent appliance

$ bzr branch lp:~kim0/+junk/rtgui
$ ensemble bootstrap
# Wait for ec2 to catch up (2~5 mins)
$ ensemble deploy --repository . rtgui
$ ensemble expose rtgui

That is basically all you need to "use" this appliance! Give another few minutes for the rtgui appliance to boot, install and configure itself. You can check status with

$ ensemble status 
2011-09-12 13:23:53,868 INFO Connecting to environment.
machines:
0: {dns-name: ec2-50-19-19-234.compute-1.amazonaws.com, instance-id: i-2c37ee4c}
1: {dns-name: ec2-107-20-96-125.compute-1.amazonaws.com, instance-id: i-1c28f17c}
services:
rtgui:
exposed: true
formula: local:rtgui-9
relations: {}
units:
rtgui/0:
machine: 1
open-ports: [80/tcp, 55556/tcp, 55557/tcp, 55558/tcp, 55559/tcp, 55560/tcp,
6881/udp]
relations: {}
state: started
2011-09-12 13:24:01,334 INFO 'status' command finished successfully

The import bit to watch for is "state: started", if it's something else, that means the ec2 instance is still being configured. It's nice to note that the following ports have been opened 55556-55560 since rtorrent is configured to use those ports, port 80 was opened for the Web UI, and port 6881 UDP was opened for the DHT network. I am in no way a torrent expert, so this could be completely unoptimized, but hey it seems to work

Ready to test? Machine 1 runs rtgui, so go ahead and visit it in a browser, for me that's http://ec2-107-20-96-125.compute-1.amazonaws.com/rtgui (replace that DNS name with the right one for your instance, and don't forget the trailing /rtgui like I always do). Click "Add torrent" and pass it the URL to a torrent file, I'm gonna be testing with Ubuntu 11.10 beta1 amd64 torrent file. Once the torrent is added, click the green play button to start it. Since EC2 instances have quite some bandwidth available to them, this Ubuntu torrent downloaded in a just few seconds. I am shipping a default configuration with rtorrent that limits upload speed to 100KB (since you're paying for bandwidth), but you can change that from the web UI. Here's how the whole thing looks


Once a torrent file is downloaded, you can download it through http://ec2-107-20-96-125.compute-1.amazonaws.com/complete (again replace the machine DNS name, with the correct name in your case)

A single torrent appliance is not ofcourse limited to a single torrent! You can keep adding as much as you want, however eventually you're going to hit some limit (disk IO, network IO, disk space ...etc). As such (probably only if you're after downloading really large number of torrents) you may need to "scale up" this torrent download appliance (well it's a cloud for God's sake!). If that's what you wish for, you only need to
$ ensemble add-unit rtgui

Simple as that, as with everything Ensemble! So now you know how you can download your 11.10 copy without loading Ubuntu's servers, actually you'd be helping them and all millions of Ubuntu users if you use this method on release day. Once you're done playing with the appliance, you need to destroy it (to stop paying Amazon for the machines)

$ ensemble destroy-environment
WARNING: this command will destroy the 'sample' environment (type: ec2).
This includes all machines, services, data, and other resources. Continue [y/N]y
2011-09-12 13:53:33,018 INFO Destroying environment 'sample' (type: ec2)...
2011-09-12 13:53:36,641 INFO Waiting on 2 EC2 instances to transition to terminated state, this may take a while
2011-09-12 13:54:18,617 INFO 'destroy_environment' command finished successfully

Want to improve it?


Things I wish I had time to improve:
  • Once a file is downloaded, upload it to S3. You can then terminate the appliance, and still download the files at your own pace
  • Parameterize rtorrent rc configuration file, such that you can pass it parameters from Ensemble (such as upload rate...etc)
  • Integrate notification upon download completion (SMS me, email me, IM me ...etc)
  • Add an auto-redirect to /rtgui :)
  • Figure out a way to download completed files from within the rtgui web UI

If you're interested in improving that appliance, drop by in #ubuntu-ensemble on IRC/Freenode and ping me (kim0) or any of the friendly folks around.

What kind of skills do you need to hack on that project? Just bash shell scripting foo! The feature I love most about Ensemble as a cloud orchestration tool, is that it doesn't twist you into using some abstracted syntax. You get to write in whatever language you feel like using, for me that's bash. You can find the script that does all of the above right here.

Interested to learn more about Ensemble and automating Ubuntu server deployments in the cloud or on physical servers ?
Want to hack on this torrent appliance, or do something similar?
Have comments or a better idea?

Let me know about it, just drop me a comment right here! You can also grab me (kim0) over Freenode irc

TKLBAM: a new kind of smart backup/restore system that just works

// September 8th, 2010 // Comments Off // Uncategorized

Drum roll please...

Today, I'm proud to officially unveil TKLBAM (AKA TurnKey Linux Backup and Migration): the easiest, most powerful system-level backup anyone has ever seen. Skeptical? I would be too. But if you read all the way through you'll see I'm not exaggerating and I have the screencast to prove it. Aha!

This was the missing piece of the puzzle that has been holding up the Ubuntu Lucid based release batch. You'll soon understand why and hopefully agree it was worth the wait.

We set out to design the ideal backup system

Imagine the ideal backup system. That's what we did.

Pain free

A fully automated backup and restore system with no pain. That you wouldn't need to configure. That just magically knows what to backup and, just as importantly, what NOT to backup, to create super efficient, encrypted backups of changes to files, databases, package management state, even users and groups.

Migrate anywhere

An automated backup/restore system so powerful it would double as a migration mechanism to move or copy fully working systems anywhere in minutes instead of hours or days of error prone, frustrating manual labor.

It would be so easy you would, shockingly enough, actually test your backups. No more excuses. As frequently as you know you should be, avoiding unpleasant surprises at the worst possible timing.

One turn-key tool, simple and generic enough that you could just as easily use it to migrate a system:

  • from Ubuntu Hardy to Ubuntu Lucid (get it now?)
  • from a local deployment, to a cloud server
  • from a cloud server to any VPS
  • from a virtual machine to bare metal
  • from Ubuntu to Debian
  • from 32-bit to 64-bit

System smart

Of course, you can't do that with a conventional backup. It's too dumb. You need a vertically integrated backup that has system level awareness. That knows, for example, which configuration files you changed and which you didn't touch since installation. That can leverage the package management system to get appropriate versions of system binaries from package repositories instead of wasting backup space.

This backup tool would be smart enough to protect you from all the small paper-cuts that conspire to make restoring an ad-hoc backup such a nightmare. It would transparently handle technical stuff you'd rather not think about like fixing ownership and permission issues in the restored filesystem after merging users and groups from the backed up system.

Ninja secure, dummy proof

It would be a tool you could trust to always encrypt your data. But it would still allow you to choose how much convenience you're willing to trade off for security.

If data stealing ninjas keep you up at night, you could enable strong cryptographic passphrase protection for your encryption key that includes special countermeasures against dictionary attacks. But since your backup's worst enemy is probably staring you in the mirror, it would need to allow you to create an escrow key to store in a safe place in case you ever forget your super-duper passphrase.

On the other hand, nobody wants excessive security measures forced down their throats when they don't need them and in that case, the ideal tool would be designed to optimize for convenience. Your data would still be encrypted, but the key management stuff would happen transparently.

Ultra data durability

By default, your AES encrypted backup volumes would be uploaded to inexpensive, ultra-durable cloud storage designed to provide %99.999999999 durability. To put 11 nines of reliability in perspective, if you stored 10,000 backup volumes you could expect to lose a single volume once every 10 million years.

For maximum network performance, you would be routed automatically to the cloud storage datacenter closest to you.

Open source goodness

Naturally, the ideal backup system would be open source. You don't have to care about free software ideology to appreciate the advantages. As far as I'm concerned any code running on my servers doing something as critical as encrypted backups should be available for peer review and modification. No proprietary secret sauce. No pacts with a cloudy devil that expects you to give away your freedom, nay worse, your data, in exchange for a little bit of vendor-lock-in-flavored convenience.

Tall order huh?

All of this and more is what we set out to accomplish with TKLBAM. But this is not our wild eyed vision for a future backup system. We took our ideal and we made it work. In fact, we've been experimenting with increasingly sophisticated prototypes for a few months now, privately eating our own dog food, working out the kinks. This stuff is complex so there may be a few rough spots left, but the foundation should be stable by now.

Seeing is believing: a simple usage example

We have two installations of TurnKey Drupal6:

  1. Alpha, a virtual machine on my local laptop. I've been using it to develop the TurnKey Linux web site.
  2. Beta, an EC2 instance I just launched from the TurnKey Hub.

In the new TurnKey Linux 11.0 appliances, TKLBAM comes pre-installed. With older versions you'll need to install it first:

apt-get update
apt-get install tklbam webmin-tklbam

You'll also need to link TKLBAM to your TurnKey Hub account by providing the API-KEY. You can do that via the new Webmin module, or on the command line:

tklbam-init QPINK3GD7HHT3A

I now log into Alpha's command line as root (e.g., via the console, SSH or web shell) and do the following:

tklbam-backup

It's that simple. Unless you want to change defaults, no arguments or additional configuration required.

When the backup is done a new backup record will show up in my Hub account:

To restore I log into Beta and do this:

tklbam-restore 1

That's it! To see it in action watch the video below or better yet log into your TurnKey Hub account and try it for yourself.

Quick screencast (2 minutes)

Best viewed full-screen. Having problems with playback? Try the YouTube version.

The screencast shows TKLBAM command line usage, but users who dislike the command line can now do everything from the comfort of their web browser, thanks to the new Webmin module.

Getting started

TKLBAM's front-end interface is provided by the TurnKey Hub, an Amazon-powered cloud backup and server deployment web service currently in private beta.

If you don't have a Hub account already, request an invitation. We'll do our best to grant them as fast as we can scale capacity on a first come, first served basis. Update: currently we're doing ok in terms of capacity so we're granting invitation requests within the hour.

To get started log into your Hub account and follow the basic usage instructions. For more detail, see the documentation.

Feel free to ask any questions in the comments below. But you'll probably want to check with the FAQ first to see if they've already been answered.

Upcoming features

  • PostgreSQL support: PostgreSQL support is in development but currently only MySQL is supported. That means TKLBAM doesn't yet work on the three PostgreSQL based TurnKey appliances (PostgreSQL, LAPP, and OpenBravo).
  • Built-in integration: TKLBAM will be included by default in all future versions of TurnKey appliances. In the future when you launch a cloud server from the Hub it will be ready for action immediately. No installation or initialization necessary.
  • Webmin integration: we realize not everyone is comfortable with the command line, so we're going to look into developing a custom webmin module for TKLBAM. Update: we've added the new TKLBAM webmin module to the 11.0 RC images based on Lucid. In older images, the webmin-tklbam package can also be installed via the package manager.

Special salute to the TurnKey community

First, many thanks to the brave souls who tested TKLBAM and provided feedback even before we officially announced it. Remember, with enough eyeballs all bugs are shallow, so if you come across anything else, don't rely on someone else to report it. Speak up!

Also, as usual during a development cycle we haven't been able to spend as much time on the community forums as we'd like. Many thanks to everyone who helped keep the community alive and kicking in our relative absence.

Remember, if the TurnKey community has helped you, try to pay it forward when you can by helping others.

Finally, I'd like to give extra special thanks to three key individuals that have gone above and beyond in their contributions to the community.

By alphabetical order:

  • Adrian Moya: for developing appliances that rival some of our best work.
  • Basil Kurian: for storming through appliance development at a rate I can barely keep up with.
  • JedMeister: for continuing to lead as our most helpful and tireless community member for nearly a year and a half now. This guy is a frigging one man support army.

Also special thanks to Bob Marley, the legend who's been inspiring us as of late to keep jamming till the sun was shining. :)

Final thoughts

TKLBAM is a major milestone for TurnKey. We're very excited to finally unveil it to the world. It's actually been a not-so-secret part of our vision from the start. A chance to show how TurnKey can innovate beyond just bundling off the shelf components.

With TKLBAM out of the way we can now focus on pushing out the next release batch of Lucid based appliances. Thanks to the amazing work done by our star TKLPatch developers, we'll be able to significantly expand our library so by the next release we'll be showcasing even more of the world's best open source software. Stir It Up!

TKLBAM: a new kind of smart backup/restore system that just works

// September 8th, 2010 // Comments Off // Uncategorized

Drum roll please...

Today, I'm proud to officially unveil TKLBAM (AKA TurnKey Linux Backup and Migration): the easiest, most powerful system-level backup anyone has ever seen. Skeptical? I would be too. But if you read all the way through you'll see I'm not exaggerating and I have the screencast to prove it. Aha!

This was the missing piece of the puzzle that has been holding up the Ubuntu Lucid based release batch. You'll soon understand why and hopefully agree it was worth the wait.

We set out to design the ideal backup system

Imagine the ideal backup system. That's what we did.

Pain free

A fully automated backup and restore system with no pain. That you wouldn't need to configure. That just magically knows what to backup and, just as importantly, what NOT to backup, to create super efficient, encrypted backups of changes to files, databases, package management state, even users and groups.

Migrate anywhere

An automated backup/restore system so powerful it would double as a migration mechanism to move or copy fully working systems anywhere in minutes instead of hours or days of error prone, frustrating manual labor.

It would be so easy you would, shockingly enough, actually test your backups. No more excuses. As frequently as you know you should be, avoiding unpleasant surprises at the worst possible timing.

One turn-key tool, simple and generic enough that you could just as easily use it to migrate a system:

  • from Ubuntu Hardy to Ubuntu Lucid (get it now?)
  • from a local deployment, to a cloud server
  • from a cloud server to any VPS
  • from a virtual machine to bare metal
  • from Ubuntu to Debian
  • from 32-bit to 64-bit

System smart

Of course, you can't do that with a conventional backup. It's too dumb. You need a vertically integrated backup that has system level awareness. That knows, for example, which configuration files you changed and which you didn't touch since installation. That can leverage the package management system to get appropriate versions of system binaries from package repositories instead of wasting backup space.

This backup tool would be smart enough to protect you from all the small paper-cuts that conspire to make restoring an ad-hoc backup such a nightmare. It would transparently handle technical stuff you'd rather not think about like fixing ownership and permission issues in the restored filesystem after merging users and groups from the backed up system.

Ninja secure, dummy proof

It would be a tool you could trust to always encrypt your data. But it would still allow you to choose how much convenience you're willing to trade off for security.

If data stealing ninjas keep you up at night, you could enable strong cryptographic passphrase protection for your encryption key that includes special countermeasures against dictionary attacks. But since your backup's worst enemy is probably staring you in the mirror, it would need to allow you to create an escrow key to store in a safe place in case you ever forget your super-duper passphrase.

On the other hand, nobody wants excessive security measures forced down their throats when they don't need them and in that case, the ideal tool would be designed to optimize for convenience. Your data would still be encrypted, but the key management stuff would happen transparently.

Ultra data durability

By default, your AES encrypted backup volumes would be uploaded to inexpensive, ultra-durable cloud storage designed to provide %99.999999999 durability. To put 11 nines of reliability in perspective, if you stored 10,000 backup volumes you could expect to lose a single volume once every 10 million years.

For maximum network performance, you would be routed automatically to the cloud storage datacenter closest to you.

Open source goodness

Naturally, the ideal backup system would be open source. You don't have to care about free software ideology to appreciate the advantages. As far as I'm concerned any code running on my servers doing something as critical as encrypted backups should be available for peer review and modification. No proprietary secret sauce. No pacts with a cloudy devil that expects you to give away your freedom, nay worse, your data, in exchange for a little bit of vendor-lock-in-flavored convenience.

Tall order huh?

All of this and more is what we set out to accomplish with TKLBAM. But this is not our wild eyed vision for a future backup system. We took our ideal and we made it work. In fact, we've been experimenting with increasingly sophisticated prototypes for a few months now, privately eating our own dog food, working out the kinks. This stuff is complex so there may be a few rough spots left, but the foundation should be stable by now.

Seeing is believing: a simple usage example

We have two installations of TurnKey Drupal6:

  1. Alpha, a virtual machine on my local laptop. I've been using it to develop the TurnKey Linux web site.
  2. Beta, an EC2 instance I just launched from the TurnKey Hub.

In the new TurnKey Linux 11.0 appliances, TKLBAM comes pre-installed. With older versions you'll need to install it first:

apt-get update
apt-get install tklbam webmin-tklbam

You'll also need to link TKLBAM to your TurnKey Hub account by providing the API-KEY. You can do that via the new Webmin module, or on the command line:

tklbam-init QPINK3GD7HHT3A

I now log into Alpha's command line as root (e.g., via the console, SSH or web shell) and do the following:

tklbam-backup

It's that simple. Unless you want to change defaults, no arguments or additional configuration required.

When the backup is done a new backup record will show up in my Hub account:

To restore I log into Beta and do this:

tklbam-restore 1

That's it! To see it in action watch the video below or better yet log into your TurnKey Hub account and try it for yourself.

Quick screencast (2 minutes)

Best viewed full-screen. Having problems with playback? Try the YouTube version.

The screencast shows TKLBAM command line usage, but users who dislike the command line can now do everything from the comfort of their web browser, thanks to the new Webmin module.

Getting started

TKLBAM's front-end interface is provided by the TurnKey Hub, an Amazon-powered cloud backup and server deployment web service currently in private beta.

If you don't have a Hub account already, request an invitation. We'll do our best to grant them as fast as we can scale capacity on a first come, first served basis. Update: currently we're doing ok in terms of capacity so we're granting invitation requests within the hour.

To get started log into your Hub account and follow the basic usage instructions. For more detail, see the documentation.

Feel free to ask any questions in the comments below. But you'll probably want to check with the FAQ first to see if they've already been answered.

Upcoming features

  • PostgreSQL support: PostgreSQL support is in development but currently only MySQL is supported. That means TKLBAM doesn't yet work on the three PostgreSQL based TurnKey appliances (PostgreSQL, LAPP, and OpenBravo).
  • Built-in integration: TKLBAM will be included by default in all future versions of TurnKey appliances. In the future when you launch a cloud server from the Hub it will be ready for action immediately. No installation or initialization necessary.
  • Webmin integration: we realize not everyone is comfortable with the command line, so we're going to look into developing a custom webmin module for TKLBAM. Update: we've added the new TKLBAM webmin module to the 11.0 RC images based on Lucid. In older images, the webmin-tklbam package can also be installed via the package manager.

Special salute to the TurnKey community

First, many thanks to the brave souls who tested TKLBAM and provided feedback even before we officially announced it. Remember, with enough eyeballs all bugs are shallow, so if you come across anything else, don't rely on someone else to report it. Speak up!

Also, as usual during a development cycle we haven't been able to spend as much time on the community forums as we'd like. Many thanks to everyone who helped keep the community alive and kicking in our relative absence.

Remember, if the TurnKey community has helped you, try to pay it forward when you can by helping others.

Finally, I'd like to give extra special thanks to three key individuals that have gone above and beyond in their contributions to the community.

By alphabetical order:

  • Adrian Moya: for developing appliances that rival some of our best work.
  • Basil Kurian: for storming through appliance development at a rate I can barely keep up with.
  • JedMeister: for continuing to lead as our most helpful and tireless community member for nearly a year and a half now. This guy is a frigging one man support army.

Also special thanks to Bob Marley, the legend who's been inspiring us as of late to keep jamming till the sun was shining. :)

Final thoughts

TKLBAM is a major milestone for TurnKey. We're very excited to finally unveil it to the world. It's actually been a not-so-secret part of our vision from the start. A chance to show how TurnKey can innovate beyond just bundling off the shelf components.

With TKLBAM out of the way we can now focus on pushing out the next release batch of Lucid based appliances. Thanks to the amazing work done by our star TKLPatch developers, we'll be able to significantly expand our library so by the next release we'll be showcasing even more of the world's best open source software. Stir It Up!

Finding the closest data center using GeoIP and indexing

// August 16th, 2010 // Comments Off // Uncategorized

We are about to release the TurnKey Linux Backup and Migration (TKLBAM) mechanism, which boasts to be the simplest way, ever, to backup a TurnKey appliance across all deployments (VM, bare-metal, Amazon EC2, etc.), as well as provide the ability to restore a backup anywhere, essentially appliance migration or upgrade.

Note: We'll be posting more details really soon - In this post I just want to share an interesting issue we solved recently.

Backups need to be stored somewhere - preferably somewhere that provides unlimited, reliable, secure and inexpensive storage. After exploring the available options, we decided on Amazon S3 for TKLBAM's storage backend.
 

The problem

Amazon have 4 data centers called regions spanning the world, situated in North California (us-west-1), North Virginia (us-east-1), Ireland (eu-west-1) and Singapore (ap-southeast-1).
 
The problem: Which region should be used to store a servers backups, and how should it be determined?
 
One option was to require the user to specify the region to be used during backup, but, we quickly decided against polluting the user interface with options which can be confusing, and opted for a solution that could automatically determine the best region.
 

The solution

The below map plots the countries/states with their associated Amazon region:
 
Generated automatically using Google Maps API from the indexes.
 
The solution: Determine the location of the server, then lookup the closest Amazon region to the servers location.
 

Part 1: GeoIP

This was the easy part. The TurnKey Hub is developed using Django which ships with GeoIP support in contrib. Within a few minutes of being totally new to geo-location I had part 1 up and running.
 
When TKLBAM is initialized and a backup is initiated, the Hub is contacted to get authentication credentials and the S3 address for backup. The Hub performs a lookup on the IP address and enumerates the country/state.
 
In a nutshell, adding GeoIP support to your Django app is simple: Install Maxmind's C library and download the appropriate dataset. Then, once you update your settings.py file, you're all set.
 
settings.py

GEOIP_PATH = "/volatile/geoip"
GEOIP_LIBRARY_PATH = "/volatile/geoip/libGeoIP.so"

code

from django.contrib.gis.utils import GeoIP

ipaddress = request.META['REMOTE_ADDR']
g = GeoIP()
g.city(ipaddress)
    {'area_code': 609,
     'city': 'Absecon',
     'country_code': 'US',
     'country_code3': 'USA',
     'dma_code': 504,
     'latitude': -39.420898,
     'longitude': - 74.497703,
     'postal_code': '08201',
     'region': 'NJ'}
 

Part 2: Indexing

This part was a little more complicated.
 
Now that we have the servers location, we can lookup the closest region. The problem is creating an index of each and every country in the world, as well as each US state - and associating them with their closest Amazon region.
 
Creating the index could have been really pain staking, boring and error prone if doing it manually - so I devised a simple automated solution:
  • Generate a mapping of country and state codes with their coordinates (latitude and longitude).
  • Generate a reference map of the server farms coordinates.
  • Using a simple distance based calculation, determine the closest region to each country/state, and finally output the index files.
I was also planning on incorporating data about internet connection speeds and trunk lines between countries, and add weight to the associations, but decided that was overkill.
 
We are making the indexes available for public use (countries.index, unitedstates.index).
 
More importantly, we need your help to tweak the indexes - as you have better knowledge and experience on your connection latency and speed. Please let us know if you think we should associate your country/state to a different Amazon region.

Finding the closest data center using GeoIP and indexing

// August 16th, 2010 // Comments Off // Uncategorized

We are about to release the TurnKey Linux Backup and Migration (TKLBAM) mechanism, which boasts to be the simplest way, ever, to backup a TurnKey appliance across all deployments (VM, bare-metal, Amazon EC2, etc.), as well as provide the ability to restore a backup anywhere, essentially appliance migration or upgrade.

Note: We'll be posting more details really soon - In this post I just want to share an interesting issue we solved recently.

Backups need to be stored somewhere - preferably somewhere that provides unlimited, reliable, secure and inexpensive storage. After exploring the available options, we decided on Amazon S3 for TKLBAM's storage backend.
 

The problem

Amazon have 4 data centers called regions spanning the world, situated in North California (us-west-1), North Virginia (us-east-1), Ireland (eu-west-1) and Singapore (ap-southeast-1).
 
The problem: Which region should be used to store a servers backups, and how should it be determined?
 
One option was to require the user to specify the region to be used during backup, but, we quickly decided against polluting the user interface with options which can be confusing, and opted for a solution that could automatically determine the best region.
 

The solution

The below map plots the countries/states with their associated Amazon region:
 
Generated automatically using Google Maps API from the indexes.
 
The solution: Determine the location of the server, then lookup the closest Amazon region to the servers location.
 

Part 1: GeoIP

This was the easy part. The TurnKey Hub is developed using Django which ships with GeoIP support in contrib. Within a few minutes of being totally new to geo-location I had part 1 up and running.
 
When TKLBAM is initialized and a backup is initiated, the Hub is contacted to get authentication credentials and the S3 address for backup. The Hub performs a lookup on the IP address and enumerates the country/state.
 
In a nutshell, adding GeoIP support to your Django app is simple: Install Maxmind's C library and download the appropriate dataset. Then, once you update your settings.py file, you're all set.
 
settings.py

GEOIP_PATH = "/volatile/geoip"
GEOIP_LIBRARY_PATH = "/volatile/geoip/libGeoIP.so"

code

from django.contrib.gis.utils import GeoIP

ipaddress = request.META['REMOTE_ADDR']
g = GeoIP()
g.city(ipaddress)
    {'area_code': 609,
     'city': 'Absecon',
     'country_code': 'US',
     'country_code3': 'USA',
     'dma_code': 504,
     'latitude': -39.420898,
     'longitude': - 74.497703,
     'postal_code': '08201',
     'region': 'NJ'}
 

Part 2: Indexing

This part was a little more complicated.
 
Now that we have the servers location, we can lookup the closest region. The problem is creating an index of each and every country in the world, as well as each US state - and associating them with their closest Amazon region.
 
Creating the index could have been really pain staking, boring and error prone if doing it manually - so I devised a simple automated solution:
  • Generate a mapping of country and state codes with their coordinates (latitude and longitude).
  • Generate a reference map of the server farms coordinates.
  • Using a simple distance based calculation, determine the closest region to each country/state, and finally output the index files.
I was also planning on incorporating data about internet connection speeds and trunk lines between countries, and add weight to the associations, but decided that was overkill.
 
We are making the indexes available for public use (countries.index, unitedstates.index).
 
More importantly, we need your help to tweak the indexes - as you have better knowledge and experience on your connection latency and speed. Please let us know if you think we should associate your country/state to a different Amazon region.
 
[update] We updated the indexes to include the new AWS regions (Oregon, Sao Paulo, Tokyo), tweaked automatic association to use the haversine formula, and added overrides based on underwater internet cables. Lastly, we've open sourced the whole project on github (checkout the live map meshup).

Review: Cloud automation tools – Computerworld

// June 13th, 2010 // Comments Off // Uncategorized

"In this groundbreaking test, we looked at products from RightScale, Appistry and Tap In Systems that automate and manage launching the job, scaling cloud resources, making sure you get the results you want, storing those results, then shutting down the pay-as-you-go process."

Review: Cloud automation tools – Computerworld

// June 13th, 2010 // Comments Off // Uncategorized

"In this groundbreaking test, we looked at products from RightScale, Appistry and Tap In Systems that automate and manage launching the job, scaling cloud resources, making sure you get the results you want, storing those results, then shutting down the pay-as-you-go process."

Identifying When a New EBS Volume Has Completed Initialization From an EBS Snapshot

// March 24th, 2010 // Comments Off // Uncategorized

On Amazon EC2, you can create a new EBS volume from an EBS snapshot using a command like

ec2-create-volume --availability-zone us-east-1a --snapshot snap-d53484bc

This returns almost immediately with a volume id which you can attach and mount on an EC2 instance. The EBS documentation describes the magic behind the immediate availability of the data on the volume:

“New volumes created from existing Amazon S3 snapshots load lazily in the background. This means that once a volume is created from a snapshot, there is no need to wait for all of the data to transfer from Amazon S3 to your Amazon EBS volume before your attached instance can start accessing the volume and all of its data. If your instance accesses a piece of data which hasn’t yet been loaded, the volume will immediately download the requested data from Amazon S3, and then will continue loading the rest of the volume’s data in the background.”

This is a cool feature which allows you to start using the volume quickly without waiting for the blocks to be completely populated. In practice, however, there is a period of time where you experience high iowait when your application accesses disk blocks that need to be loaded. This manifests as somewhat to extreme slowness for minutes to hours after the creation of the EBS volume.

For some applications (mine) the initial slowness is acceptable, knowing that it will eventually pick up and perform with the normal EBS access speed once all blocks have been populated. As Clint Popetz pointed out on the ec2ubuntu group, other applications might need to know when the EBS volume has been populated and is ready for high performance usage.

Though there is no API status to poll or trigger to notify you when all the blocks on the EBS volume have been populated, it occurred to me that there was a method which could be used to guarantee that you are waiting until the EBS volume is ready: simply request all of the blocks from the volume and throw them away.

Here’s a simple command which reads all of the blocks on an EBS volume attached to device /dev/sdX (substitute your device name) and does nothing with them:

sudo dd if=/dev/sdX of=/dev/null bs=1G

By the time this command is done, you know that all of the data blocks have been copied from the EBS snapshot in S3 to the EBS volume. At this point you should be able to start using the EBS volume in your application without the high iowait you would have experienced if you started earlier. Early reports from Clint are that this method works in practice.

Exploring S3 based filesystems S3FS and S3Backer

// January 11th, 2010 // Comments Off // Uncategorized

In the last couple of days I've been researching Amazon S3 based filesystems, to figure out if maybe we could integrate that into an easy to use backup solution for TurnKey Linux appliances.

Note that S3 could only be a part of the solution. It wouldn't be a good idea to rely exclusively on S3 based automatic backups because of the problematic security architecture it creates. If an attacker compromises your server, he can easily compromise and subvert or destroy any S3 based automatic backups. That's bad news.

S3 performance, limitations and costs

S3 performance

  • S3 itself is faster than I realized. I've been fully saturating our server's network connection and uploading/downloading objects to S3 at 10MBytes/s.

  • Each S3 transaction comes with fixed overhead of 200ms for writes and about 350ms for reads.

    This means you can only access about 3 objects a second sequentially, which will of course massively impact your data throughput. (e.g., if you read many 1 bytes objects sequentially you'll get 3 bytes a second)

S3 performance variability

S3 is usually very fast, but it's based on a complex distributed storage network behind the scenes that is known to vary in its behavior and performance characteristics.

Use it long enough and you will come across requests that take 10 seconds to complete instead of 300ms. Point is, you can't rely on the average behavior ALWAYS happening.

S3 limitations

  • Objects can contain a maximum of 5GB.
  • You can't update part of an object. If you want to update 1 byte in a 1GB object you'll have to reupload the entire GB.

S3 costs

Three components:

  1. Storage: $0.15 GB/month
  2. Data transfer: $0.1 GB in, $0.17 GB out
  3. Requests: $0.01 per 1000 PUT requests, $0.01 per 10,000 GET and other requests

A word of caution, some people using S3 based filesystems have made the mistake of focusing on just the storage costs and forgotten about other expenses, especially requests, which look so inexpensive.

You need to watch out for that because using one filesystem under the default configuration (4KB blocks), storing 50GB of data cost $130 just in PUT request fees, more than 17X the storage fees!

Filesystems

s3fs

Which s3fs?

It's a bit confusing but there are two working projects competing for the name "s3fs", both based on FUSE.

One is implemented in C++, last release Aug 2008:

http://code.google.com/p/s3fs/wiki/FuseOverAmazon

Another implemented in Python, last release May 2008:

https://fedorahosted.org/s3fs/

I've only tried the C++ project, which is better known and more widely used (e.g., the Python project comes with warnings regarding data loss) so when I say s3fs I mean the C++ project on Google Code.

Description

s3fs is a direct mapping of S3 to a filesystem paradigm. Files are mapped to objects. Filesystem metadata (e.g., ownership and file modes) are stored inside the object's meta data. Filenames are keys, with "/" as the delimiter to make listing more efficient, etc.

That's significant because it means there is nothing terribly magical about a bucket being read/written to by s3fs, and in fact you can mount any bucket with s3fs to explore it as a filesystem.

s3fs's main advantage is its simplicity. There are however a few gotchas:

  • If you're using s3fs to access a bucket it didn't create and have objects in it that have directory-like components in their names (e.g., mypath/myfile), you'll need to create a dummy directory in order to see them (e.g., mkdir mypath).

  • The project seems to be "regretware". The last open source release was in August 2008. Since then the author seems to have continued all development of new features (e.g., encryption, compression, multi-user access) as a commercial license (subcloud), and with that inherent conflict of interest the future of the GPLed licensed open source version is uncertain.

    In fact a few of the unresolved bugs (e.g., deep directory renames) in the open source version have been long fixed in the proprietary version.

  • No embedded documentation. Probably another side-effect of the proprietary version, though the available options are documented no the web site.

  • Inherits S3's limitations: no file can be over 5GB, and you can't partially update a file so changing a single byte will re-upload the entire file.

  • Inherits S3's performance characteristics: operation on many small files are very efficient (each is a separate S3 object after all)

  • Though S3 supports partial/chunked downloads, s3fs doesn't take advantage of this so if you want to read just one byte of a 1GB file, you'll have to download the entire GB.

    OTOH, s3fs supports a disk cache, which can be used to mitigate this limitation.

  • Watch out, the ACL for objects/files you update/write to will be reset to s3fs's global ACL (e.g., by default "private"). So if you rely on a richer ACL configuration for objects in your bucket you'll want to access your S3FS bucket in read-only mode.

  • By default, s3fs doesn't use SSL, but you can get that to work by using the -o url option to specify https://s3.amazonaws.com/ instead of http://s3.amazonaws.com/

    It's not documented very well of course. Cough. Proprietary version. Cough.

S3Backer

S3Backer is a true open source project under active development, which has a very clever design.

Also based on FUSE but instead of implementing usable filesystem on top of S3 it implements a virtual loopback device on top of S3:

mountpoint/
    file       # (e.g., can be used as a virtual loopback)
    stats      # human readable statistics

Except for this simple virtual filesystem S3Backer doesn't know anything about filesystems itself. It just maps that one virtual file to a series of dynamically allocated blocks inside S3.

The advantage to this approach is that it allows you to leverage well tested code built into the kernel to take care of the higher level business of storing files. For all intents and purposes, it's just a special block device (e.g., use any filesystem, LVM, software Raid, kernel encryption, etc.).

In practice it seems to work extremely well, thanks to a few clever performance optimizations. For example:

  • In-memory block cache: so rereads of the same unchanged block don't have to go across the network if it's still cacehd.
  • Delayed, multi-threaded write queue: dirty blocks aren't written immediately to S3 because that can be very inefficient (e.g., the smallest operation would update the entire block). Instead changes seem to be accumulated for a couple of seconds and then written out in parallel to S3.
  • Read-ahead algorithm: will detect and try to predict sequential block reads so the data is available in your cache before you actually ask for it.

Bottom line, under the right configuration S3Backer works well enough that it can easily saturate our Linode's 100Mbit network connection. Impressive.

In my testing with Reiserfs I found the performance good enough so that it would be conceivable to use it as an extension of working storage (e.g., as an EBS alternative for a system outside of EC2).

There are a few gotchas however:

  • High risk for data corruption, due to the delayed writes (e.g., your system or the connection to AWS fails). Journaling doesn't help because as far as the filesystem is concerned the blocks have already been written (I.e., to S3Backer's cache).

    In other words, I wouldn't use this as a back up drive. You can reduce the risk by turning off some of the performance optimizations to minimize the amount of data in limbo for write to Amazon.

  • too small block sizes (e.g., the 4K default) can add significant extra costs (e.g., $130 for 50GB with 4K blocks worth of storage)

  • too large block sizes can add significant data transfer and storage fees.

  • memory usage can be prohibitive: by default it caches 1000 blocks. With the default 4K block size that's not an issue but most users will probably want to increase block size.

    So watch out 1000 x 256KB block = 256MB.

    You can adjust the amount of cached blocks to control memory usage.

Future versions of S3Backer will probably include disk based caching which will mitigate data corruption and memory usage issues.

Tips:

  • Use Reiserfs. It can store multiple small files in the same blocks.

    It also doesn't populate the filesystem with too many empty blocks when the filesystem is created which makes filesystem creation faster and more efficient.

    Also, with Reiserfs I tested expanding and shrinking off the filesystem (after increasing/decreasing the size of the virtual block device) and it seemed to work just fine.

  • Supports storing multiple block devices in the same bucket using prefixes.

Conclusions

  • s3fs: safe, efficient storage of medium-large files. Perfect for backup / archiving purposes.
  • S3Backer: high performing live storage on top of S3. EBS alternative outside of EC2. Not safe for backups at this stage.

Practical Cloud Computing | "The cloud" explained for normal people

// January 1st, 2010 // Comments Off // Uncategorized

"Basically, by moving these services out of your data center and into the cloud, you no longer need to have your own data center and supporting staff. Also, you save money in taxes as you transform your capital expenditure to operational expenditure."