Posts Tagged ‘cloud computing’
// January 20th, 2012 // 1 Comment » // Uncategorized
Amazon has made it easier for authorized business users to manage their Amazon Web Services infrastructure after signing on — once — to their corporate network.
This is the latest in a steady drip, drip, drip of functionality that Amazon adds to its services over time. This week, for example, Amazon announced free Windows “micro” instances to its EC2 Elastic Compute Cloud service on Sunday, and three days later announced the DynamoDB NoSQL database to its roster.
In this case, the aim is to make it easier for authorized users to maintain and tweak their Amazon-based services. Once the user is identified and authenticated by whomever manages the AWS account, he or she can sign onto the corporate network using existing credentials, then navigate to the AWS Management Console without re-entering a password, according to an AWS blog posted late Thursday. Before, users had to sign into the AWS Management Console separately.
When that user requests entry into the management console, the identity broker “validates that user’s access rights and provides temporary security credentials which includes the user’s permissions to access AWS. The page includes these temporary security credentials as part of the sign-in request to AWS,” according to the blog.
This all requires up-front work. The person in charge of a company’s AWS account must set up the user’s identity and federate it to the appropriate services. When the user signs into the corporate network, the identity broker pings Amazon’s Security Token Service (STS) to request temporary security credentials. Until now, those credentials gave specified users access to Amazon services for a set period of time (up to 36 hours.) Now those same credentials will be good for AWS Management Console as well.
The bulk of Amazon services — including Amazon EC2, Amazon S3, VPC, ElastiCache — support that identity federation to the management console. The company is working to add the new Amazon DynamoDB NoSQL database service to that list, said Amazon Web Services Evangelist Jeff Barr in the post.
As Microsoft beefs up its Azure cloud offering with expected Infrastructure-as-a-Service capabilities, and more OpenStack-based IaaS offerings come online, the competition to provide cloud services will only heat up.
Feature photo courtesy of Flickr user Will Merydith
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
// January 12th, 2012 // 3 Comments » // Uncategorized
The public cloud is looking pretty good as a development platform this year and gaining cloud development skills is a top priority for many developers, according to new research.
Of the 3,335 developers surveyed by Zend Technologies about what work they expect to do this year, 61 percent said they expect to use a public cloud for their projects. And, of those going that route, 30 percent named Amazon Web Services as their public cloud of choice; 28 percent did not yet know which cloud they would use; 10 percent named Rackspace; and 6 percent cited Microsoft Azure. ”Other” public clouds came in at 5 percent and IBM Smart Cloud at 3 percent.
These numbers come courtesy of the Zend Developer Pulse. (Zend, a provider of PHP tools, is the same company that broke the news that a surprising number of PHP developers are also Metallica fans.)
In terms of overall types of projects, a whopping 66 percent of respondents said they will be doing mobile development this year – hardly surprising given the glut of smartphones, tablets and app stores flooding the market.
Forty-one percent said they expect to work on cloud-based development and 40 percent said they see big data work in their immediate future. Those cloud and big data numbers seem pretty low given the level of interest around both topics but then again “mobile development” is a broad term that could be interpreted to include cloud work.
Nearly half of those surveyed (48 percent) said they will work on APIs and 45 percent said they will work on social media integration this year. Zend surveyed the developers in November , 2011, and plans to make the Zend Developer Pulse an annual event.
Photo courtesy of Flickr user skuds
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
// January 10th, 2012 // 1 Comment » // Uncategorized
A new year, a old topic. Complexity.
“CLOUD IS COMPLEX” screamed the headline of two recent blog posts from my Clouderati alumni, James Urquhart
and Sam Johnston
. Really ? Who would have thought ? There really is nothing that gets past these two guys, is there ?
Joking aside, their respective copy brings a sharp focus on a topic that has, in my opinion, two very different sets of meanings and two very different sets of challenges, depending upon which side of the proverbial cloudfence
one is resting one’s posterior upon.
If, dear reader, you are a vocal proponent of public cloud, renowned and famed to, upon occasion, theatrically wave your arms around while openly condemning the very notion of private cloud to be evil and reprehensible, then I would metaphorically place you in the “don’t know, don’t care” bucket when it comes to understanding how (to quote Sam Johnston) “the delivery of information technology as a service rather than a product” is brought to your browser and your credit card, respectively. Hmm, the classic “power grid” analogy – just plug it in and it works.
Nothing wrong with that. Not at all.
If, however, like this author, you are somewhat willing to entertain the concept of private cloud, either through experience or hope, or even as a much needed and logical evolution of the galling monotony and crippling legacy of today’s large enterprise IT environments, then I might suggest that the complexity thrust upon you via veritable pot-pourri of technologies, services operating models and organizational challenges will place you, either wittingly or unwittingly, somewhere between Levels 1 and 2 of the Conscious Competence Ladder
I’ve never been a particularly big fan of the “cloud / utility computing is the same as the move away from building your own power station to the public grid” analogy. It’s fine for an incredibly basic mental picture of the difference between having a substation located at the bottom of your garden versus a medium voltage cable and a meter connected to the local provider, but as far as the depth of the analogy’s relevance to the practical application of any cloud strategy goes, it would be easier to say “someone else provides the capacity”.
Job done. Same result. Not much use at all.
The major flaw in the analogy is that in today’s rapidly changing enterprise IT world, where virtualization has only just begun to take hold (arguably) but is widely accepted as being a cornerstone
of any cloud, it sadly isn’t as simple as just sticking a fat pipe into a service provider and letting someone else provide the capacity. If it was, then everyone would do it. Imagine if every organization since time immaterial had asked the electricity provider to take over running the rest of the machinery, plant or robotic equipment that it’s invisible juice powered ? Well, to me, that’s the crux of the application
of cloud infrastructure. Hardly apples to apples.
There is complexity in public cloud, there is complexity in private cloud – it’s simply a case of who owns and manages the complexity and how much appetite you have for running your services on each – but equally as service providers are doing a better and better job at managing “simplexity”, most enterprises continue to wrestle with their strategies, egged on by a myriad of vendors who now have the word “cloud” in every piece of marketing literature. It’s not a one size fits all model, there is no either / or.
In my incredibly humble opinion, it is increasingly arguable that the case for private clouds is stronger than ever, yet, as the struggle to keep up to date with technology trends and models gains momentum, I don’t see any sign of the landscape becoming simpler to design, implement or operate. In fact, I think many enterprises, in their best efforts to implement all that they are told they can’t do without, are heading for more complexity than they ever dreamed possible - creating an environment so complex, that even Rube Goldberg might raise a mechanical eyebrow.
Physical servers, virtual servers, physical switches, virtual switches, physical interfaces, virtual interfaces, physical storage, virtual storage, physical load balancers, virtual load balancers, physical firewalls, virtual firewalls, physical networks (?), virtual networks, physical interfaces, virtual interfaces, physical IP address (?), virtual IP address, physical data centers, virtual data centers, physical people, virtual people, Mechanical Turks.
Mechanical what ? Mechanical Turks. I thought that’s what you said. And so, it goes on and on, something like this.
“IT ? Where’s my server ? Oh where did it go ?”
(We are all losing money and patience is low)
“I want it to work, you must call the Turk.”
(We should call the Turk, he’ll get it to work)
“We have something to tell you, the Turk isn’t real.”
(The Turk isn’t real ? That’s quite a big deal.)
“The server is somewhere, it just can’t have gone.”
(We just need to find out which storage it’s on)
“The CMDB ! Now this one’s in the bag.”
(But the CMBD has just waved a white flag)
“It can’t be the DevOps, it can’t be those guys !”
(The DevOps are admins ? That’s quite a surprise)
“So it must be the network, it’s eaten my app.”
(As the Net guys will tell you, that’s monstrous crap)
“We’ve found it, don’t worry, we’ll just bring it back.”
(Now several young admins are facing the sack)
“That downtime has killed us, we’ve lost fifty grand.”
(..as the CEO enters still waving his hand)
“It’s much more efficient !” IT screams out loud.
(But the moral is simply “shit happens in cloud”)
Today, there isn’t a CMBD tool on earth (yet) that can realistically and efficiently keep pace with the inherent fluidity, agility and flexibility of even the most well intended cloud deployments and the ditty above is a not-so-tongue-in-cheek example of what happens when complexity is mixed with a lack of clear visibility.
Interestingly, this problem isn’t unique to technology, nor cloud. In my every day life, I come across a E&C (Engineering & Construction) industry wide problem that relates to a concept called “wrench time”. Typically, wrench time is a measure of crafts personnel at work, using tools, in front of jobs. Wrench time does not include obtaining parts, tools or instructions, or the travel associated with those tasks.
In some cases, wrench time can be as low as 20% of a total working week, meaning roughly 8 hours spent fixing problems with the remaining 32 hours spent with non-value-added tasks including finding and qualifying maintenance record information.
The parallels are obvious. Difficulty in finding and qualifying information, in and amongst these complex systems – clouds or power stations – leads to inefficient maintenance, poor RTO times and eventually to revenue or reputation loss.
Spanner in the works, anyone ?
(Cross-posted @ The Loose Couple's Blog)
// October 27th, 2011 // 1 Comment » // Uncategorized
Orchestra is one of the most exciting new capabilities in 11.10. It provides automated installation of Ubuntu across sets of machines. Typically, it’s used by people bringing up a cluster or farm of servers, but the way it’s designed makes it very easy to bring up rich services, where there may be a variety of different kinds of nodes that all need to be installed together.
There’s a long history of tools that have been popular at one time or another for automated installation. FAI is the one I knew best before Orchestra came along and I was interested in the rationale for a new tool, and the ways in which it would enhance the experience of people building clusters, clouds and other services at scale. Dustin provided some of that in his introduction to Orchestra, but the short answer is that Orchestra is savvy to the service orchestration model of Juju, which means that the intelligence distilled in Juju charms can easily be harnessed in any deployment that uses Orchestra on bare metal.
What’s particularly cool about THAT is that it unifies the new world of private cloud with the old approach of Linux deployment in a cluster. So, for example, Orchestra can be used to deploy Hadoop across 3,000 servers on bare metal, and that same Juju charm can also deploy Hadoop on AWS or an OpenStack cloud. And soon it should be possible to deploy Hadoop across n physical machines with automatic bursting to your private or favourite public cloud, all automatically built in. Brilliant. Kudos to the conductor
Private cloud is very exciting – and with Ubuntu 11.10 it’s really easy to set up a small cloud to kick the tires, then scale that up as needed for production. But there are still lots of reasons why you might want to deploy a service onto bare metal, and Orchestra is a neat way to do that while at the same time preparing for a cloud-oriented future, because the work done to codify policies or practices in the physical environment should be useful immediately in the cloud, too.
For 12.04 LTS, where supporting larger-scale deployments will be a key goal, Orchestra becomes a tool that every Ubuntu administrator will find useful. I bet it will be the focus of a lot of discussion at UDS next week, and a lot of work in this cycle.
// October 14th, 2011 // 12 Comments » // Sticky Posts
The much talked about Ubuntu Cloud Live 11.10 image given away at the OpenStack Essex Conference is now available for download at:
The image uses OpenStack Diablo, requires an x86_64 compatible desktop/laptop machine, and is approximately 560Mb in size. We recommend flashing to a 4GB USB drive (or larger) to allow for proper setup and use of the cloud. Use the ‘dd’ command to copy the image over to your USB drive. For example, if your USB drive is connected to /dev/sdb, make sure the drive isn’t mounted, and then run `dd if=ubuntu-11.10-cloud-live-amd64.img of=/dev/sdb`. WARNING: THIS COMMAND WILL ERASE ALL DATA PREVIOUSLY STORED ON THE TARGET DEVICE. MAKE SURE YOU HAVE THE CORRECT DEVICE WHEN FLASHING.
Once flashed, simply boot your laptop/desktop from the USB drive and follow the instructions displayed on the desktop.
// February 21st, 2011 // Comments Off // Uncategorized
I got good feedback on last week’s post about the stuff I’d achieved in Openstack, so I figured I’d do the same this week.
We left the hero of our tale (that would be me (it’s my blog, I can entitle myself however I please)) last Friday somewhat bleary eyed, hacking on a mountall patch that would more gracefully handle SIGPIPE caused by Plymouth going the way of the SIGSEGV. I got the ever awesome Scott James Remnant to review it and he (rightfully) told me to fix it in Plymouth instead. My suggested patch was much more of a workaround than a fix, but I wasn’t really in the mood to deal with Plymouth. Somehow, I had just gotten it into my head that fixing it in Plymouth would be extremely complicated. That probably had to do with the fact that I’d forgotten about MSG_NOSIGNAL for a little bit, and I imagined fixing this problem without MSG_NOSIGNAL would probably mean rewriting a bunch of I/O routines which I certainly didn’t have the brain power for at the time. Nevertheless, a few attempts later, I got it worked out. I sent it upstream, but it seems to be stuck in the moderation queue for now.
I spent almost a day and a half wondering why some of our unit tests were failing “randomly”. It only happened every once in a while, and every time I tried running it under e.g. strace, it worked. It had “race condition” written all over it. After a lot of swearing, rude gestures and attempts to nail down the race condition, I finally noticed that it only failed if a randomly generated security group name in the test case sorted earlier than “default”, which it would do about 20% of the time. We had recently fixed DescribeSecurityGroups to return an ordered resultset which broke an assumption in this test case. Extremely annoying. My initial proposed fix was a mere 10 characters, but it ended up slightly larger, but the resulting code was easier on the eyes.
Log file handling has been a bit of an eye sore in Nova since The Big Eventlet Merge™. Since then, the Ubuntu packages have simply piped stdout and stderr to a log file and restartet the workers when the log files needed rotating. I finally got fed up with this and resurrected the logdir option and after one futile attempt, I got the log files to rotate without even reloading the workers. Sanity restored.
With all this done, I could now realiably run all the instances I wanted. However, I’d noticed that they’d all be run sequentially. Our workers, while built on top of eventlet, were single-threaded. They could only handle one RPC call at a time. This meant that if the compute worker was handling a long request (e.g. one that involved downloading a large image, and postprocessing it with copy-on-write disabled), another user just wanting to look at their instance’s console output might have to wait minutes for that request to be served. This was causing my tests to take forever to run, so a’fixin’ I went. This means that each worker can now (theoretically) handle 1024 (or any other number you choose) requests at a time.
To test this, I cranked up the concurrency of my tests so that up to 6 instances could started at the same time on each host. This worked about 80% of the time. The remaining 20% instances would entirely fail to be spawned. As could have been predicted, this was a new race condition that was uncovered because we suddenly had actual concurrency in the RPC workers. This time, iptables-restore would fail when trying to run multiple instances at the exact same time. I’ve been wanting to rework our iptables handling for a looong time anyway, so this was a great reason to get to work on that. By 2 AM between Friday and Saturday, I still wasn’t quite happy with it, so you’ll have to read the next post in this series to know how it all worked out.
// February 12th, 2011 // Comments Off // Uncategorized
With OpenStack’s second release safely out the door last week, we’re now well on our way towards the next release, due out in April. This release will be focusing on stability and deployability.
To this end, I’ve set up a HudsonJenkins box that runs a bunch of tests for me. I’ve used Jenkins before, but never in this (unintentional TDD) sort of way and I’d like to share how it’s been useful to me.
I have three physical hosts. One runs Lucid, one runs Maverick, and one runs Natty. I’ve set them up as slaves of my Hudson server (which runs separately on a cloud server at Rackspace).
I started out by adding a simple install job. It would blow away existing configuration and install afresh from our trunk PPA, create an admin user, download the Natty UEC image and upload it to the “cloud”. This went reasonably smoothly.
Then I started exercising various parts of the EC2 API (which happens to be what I’m most fluent in). I would:
- create a keypair (euca-create-keypair),
- find the image id (euca-describe-images with a bit of grep),
- run an instance (euca-run-instances),
- wait for it to go into the “running” state (euca-describe-instances),
- open up port 22 in the default security group (euca-authorize),
- find the ip (euca-describe-instances),
- connect to the guest and run a command (ssh),
- terminate the instance (euca-terminate-instances),
- close port 22 in the security group again (euca-revoke),
- delete the keypair (euca-delete-keypair),
I was using SQLite as the data store (the default in the packages) and it was known to have concurrency issues (it would timeout attempting to lock the DB), so I wrapped all euca-* commands in a retry loop that would try everything up to 10 times. This was good enough to get me started.
So, pretty soon I would see instances failing to start. However, once Jenkins was done with them, it would terminate them, and I didn’t have anything left to use for debugging. I decided to add the console log to the Jenkins output, so I just added a call to euca-get-console-output. They revealed that every so often, they’d fail to get an IP from dnsmasq. The syslog had a lot of entries from dnsmasq refusing to hand out the IP that Nova asked it to, because it already belonged to someone else. Clearly, Nova was recycling IP’s too quickly. It read through the code that was supposed to handle this several times, and it looked great. I tried drawing it on my whiteboard to see where it would fall through the cracks. Nothing. Then I tried logging the SQL for that specific operation, and it looked just fine. It wasn’t until I actually copied the sql from the logs and ran it in sqlite3′s CLI that I realised it would recycle IP’s that had just been leased. It took me hours to realise that sqlite didn’t compare these as timestamps, but as strings. They were formatted slightly differently, so it would almost always match. An 11 character patch later, this problem was solved. 1½ days of work. -11 characters. That’s about -1 character an hour. Rackspace is clearly getting their money’s worth having me work for them. I could do this all day!
That got me a bit further. Instances would now reliably come up, one at a time. I expanded out a bit, trying to run two instances at a time. This quickly blew up in my face. This time I made do with a 4 character patch. Awesome.
At this point, I’d had too many problems with sqlite locking that I got fed up. I was close to just replacing it with MySQL to get it over with, but then I decided that it just didn’t make sense. Sure, it’s a single file and we’re using it from different threads and different processes, but we’re not pounding on it. They really ought to be able to take turns. It took quite a bit of Googling and wondering, but eventually I came up with a (counting effectively changed lines of code) 4 line patch that would tell SQLAlchemy to don’t hold connections to sqlite open. Ever. That totally solved it. I was rather surprised, to be honest. I could now remove all the retry loops, and it’s worked perfectly ever since.
So far, so good. Then I decided to try to go even more agressive. I would let the three boxes all target a single one, so they’d all three run as clients against the same single-box “cloud”. I realised that because I used private addressing, I had to expand my tests and use floating ip’s to be able to reach VM’s from another box. Having done so, I realised that this didn’t work on the box itself. A 4 line patch (really only 2 lines, but I had to split them for pep8 compliance) later, and I was ready to rock and roll.
It quickly turned out that, as I had suspected, my 4 character patch earlier wasn’t broad enough, so I expanded a bit on that (4 lines modified).
Today, though, I found that surprising amount of VM’s were failing to boot, ending up with the dreaded:
General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
Give root password for maintenance
(or type Control-D to continue):
I tried changing the block device type (we use virtio by default, so I tried ide and scsi), I tried not using copy-on-write images, I tried disabling any code that would touch the images. Nothing worked. I blamed the kernel, I blamed qemu, everything. I replaced everything, piece by piece, and it still failed quite often. After a long day of debugging, I ended looking at mountall. It seems Plymouth often segfaults in these settings (where the only console is a serial port), and when it does, mountall dies, killed by SIGPIPE. A 5 line (plus a bunch of comments) patch to mountall, that is still pending review, and I can now run hundreds of VM’s in a row and (5-10-ish) in parallel with no failures at all.
So, in the future, Jenkins will provide me with a great way to test drive and validate my changes, making sure that I don’t break anything, but right now, I’m extending the tests, discovering bugs and fixing them as I extend the test suite, very test-driven-development-y. It’s quite nice. At this rate, I should have pretty good test coverage pretty soon and be able to stay confident that things keep working.
It also think it’s kind of cool how much of a difference this week has made in terms of stability of the whole stack and only 19 lines of code have been touched.
// January 14th, 2011 // Comments Off // Uncategorized
tl;dr: I now have daily backups of my laptop, powered by Rackspace Cloud Files (powered by Openstack), Deja-Dup, and Duplicity.
I’ve been using computers for a long time. If memory serves, I got my first PC when I was 9, so that’s 20 years ago now. At various times, I’ve set up some sort of backup system, but I always ended up
- annoyed that I couldn’t acutally *use* the biggest drive I had, because it was reserved for backups,
- annoyed because I had to go and connect the drive and do something active to get backups running, because having the disk always plugged into my system might mean the backup got toasted along with my active data when disaster struck,
- and annoyed at a bunch of other things.
Cloud storage solves the hardest part of this. With Rackspace Cloud Files, I have access to an infinite amount of storage. I can just keep pushing data, Rackspace keep them safe, and I pay for exactly how much space I’m using. Awesome.
All I need is something that can actually make backups for me and upload them to Cloud Files. I’ve known about Duplicity for a long time, and I also knew that it’s been able to talk to Cloud Files for a while, but I never got into the habit of running it at regular intervals, and running it from cron was annoying, because maybe I didn’t have my laptop on when it wanted to run, and if I wasn’t logged in, by homedir would be encrypted anyway, etc. etc. Lots of chances for failure.
Enter Deja-Dup! Deja-dup is a project spearheaded by my awesome, former colleague at Canonical, Mike Terry. It uses Duplicity on the backend, and gives me a nice, really simple frontend to get it set up. It has its own timing mechanism that runs in my GNOME desktop session. This means it only runs when my laptop is on and I’m logged in. Every once in a while, it checks how long it’s been since my last backup. If it’s more than a day, an icon pops up in the notification area that offers to run a backup. I’ve only been using this for a day, so it’s only asked me once. I’m not sure if it starts on its own if I give it long enough.
A couple of caveats:
- Deja-dup needs a very fresh version of libnotify, which means you need to either be running Ubuntu Natty, use backported libraries, or patch Deja-dup to work with the version of libnotify in Maverick. I opted for the latter approach.
- I have a lot of data. Around 100GB worth. Some of it is VM’s, some of it is code, some of it is various media files. Duplicity doesn’t support resuming a backup if it breaks halfway, and I “only” have 8 Mbit/s upstream bandwidth.. That meant I had to stay connected to the Internet for 28 hours straight (in a perfect world) and not have anything unexpected happen along the way. I wasn’t really interested in that, so I made my initial backup to an external drive and I’m now copying the contents of that to Rackspace at my own pace. I can stop and resume at will. The tricky part here was to get Deja-Dup to understand that the backup it thinks is on an external drive really is on Cloud Files. I’ll save that for a separate post.
: Maybe not actually infinite, but infinite enough.
// December 6th, 2010 // 2 Comments » // Uncategorized
Over the course of the next week, I will be moving the majority of my professional blogging to the new website for Telematica. As many of you know, I've re-established the consulting practice of Telematica, and over the past year have focused my energies on cloud computing, its infrastructure and platforms, as well as the requirements for an 'intercloud' --...