grack.com

Blog

AppEngine vs. EC2 (an attempt to compare apples to oranges)

This is an expanded version of my answer on Quora to the question “You’ve used both AWS and GAE for startups: do they scale equally well in terms of availability and transaction volume?”:

I’ve had an opportunity to spend time building products in both EC2 (three years at DotSpots and the last year working on Gri.pe). At DotSpots, we started using EC2 in the early days, back when it was just a few services: EC2, S3, SQS and SDB. It grew a great deal in those years, tacking on a number of useful services, some which we used (<3 CloudFront) and a number that we didn’t. Last year, around April, we switched over to building Gri.pe on AppEngine and I’ve come to appreciate how great this platform is and how much more time I spend building product rather than infrastructure. Other developers enjoy building custom infrastructure, but I’m happy to outsource it to Google.

Given these two technologies, it’s difficult to directly compare the them because they are two different beasts: EC2 is a general purpose virtual machine host, while AppEngine is a very sophisticated application host. AppEngine comes with a number of services out-of-the-box. The Amazon Web Services suite tacks on a number of various utilities to EC2 that give you access to structured, query-able storage, automatic scaling at the VM level, monitoring and other goodies too numerous to mention here.

Transaction Volume

When dealing with AppEngine, the limit to your scaling is effectively determined by how well you program to the AppEngine environment. This means that you must be aware of how transactions are processed at the entity group level and make judicious use of the AppEngine memcache service. If you program well against the architecture and best practices of AppEngine, you have the potential of scaling as well as some of Google’s own properties.

Here’s an example of one of the more expensive pages we render at Gri.pe. Our persistence code automatically keeps entities hot in memcache to avoid hitting the datastore more than a few times while rendering a page:

On EC2, scaling is entirely in your hands. If you are a MySQL wizard, you can scale that part of the stack well. Scaling is half limited only by the constraints of the boxes you rent from Amazon behalf by your skill and the other half by creativity in making them perform well. At DotSpots, we spent a fair bit of time scaling MySQL up for our web crawling activities. We built out custom infrastructure for serving key/value data fast. There was infrastructure all over the place just to keep up with what we wanted to do. Our team put a lot of work into it, but at the end of the day, it was fast.

It’s my opinion that your potential for scaling on AppEngine is much higher for a given set of resources, if your application fits within the constraints of the “ideal application set” for AppEngine. There are some upcoming technologies that I’m not allowed to talk about right now that will expand this set of ideal applications for AppEngine dramatically.

Reliability and availability

As for reliability and availability, it’s not an exact comparison here again. On EC2, an instance will fail from time-to-time for no reason. In some cases it’s just a router failure and the instance comes back in a few minutes to a few hours later. Other times, the instance just dies, taking its ephemeral state with it. In our large DotSpots fleet, we’d see a machine lock up or disappear somewhere around once a 1 month or so. The overall failure rate here was pretty low, but enough that you need to keep on your toes for monitoring and backups. We did have a catastrophic failure while using Elastic Block Store that effectively wiped out all of the data we were storing on it - that was the last time we used EBS (in fairness, that was in the early days of EBS and this probably not as likely to happen again).

On AppEngine, availability is a bit of a different story. Up until the new High-replication datastore, you’d be forced to go down every time the Datastore went into maintenance - a few hours a month. With the new HR datastore, this downtime is effectively gone, in exchange for slightly higher transaction processing fees on Datastore operations. These fees are negligible overall and definitely worth the tradeoff for increased reliability.

AppEngine had some rough patches for Datastore reliability around last September, but these have pretty much disappeared for us. Google’s AppEngine team has been working hard behind the scenes to keep it ticking well. There are some mysterious failures in application publishing from time-to-time on AppEngine. They happen for a few minutes to a few hours at a time, then get resolved as someone internally at Google fixes it. These publish failures don’t affect your running code - just your ability to publish new code. We’re doing continuous deployment on AppEngine, so this affects us more than others.

If you measure reliability in terms of the stress imposed on developers keeping the application running, AppEngine is a clear winner in my mind. If you measure it by the time that your application isn’t unavailable from forces beyond your control, EC2 wins (but only by a small amount, and by a much smaller margin when comparing the HR datastore).

Follow me (@mmastrac) on Twitter.

Now that's a prediction

Amusing: IBM’s Jeopardy-winning knowledge bot was named Watson, after IBM’s first president, Dr. Thomas J. Watson.

In Robert Heinlein’s 1966 novel, The Moon is a Harsh Mistress, one of the main characters is named Mike, nicknamed so after Mycroft Holmes in a short story written by the same Dr. Watson, before founding IBM:

Mike was not official name; I had nicknamed him for Mycroft Holmes, in a story written by Dr. Watson before he founded IBM. This story character would just sit and think—and that’s what Mike did. Mike was a fair dinkum thinkum, sharpest computer you’ll ever meet.

Mike is a computer, of the HOLMES IV (High-Optional, Logical, Multi-Evaluating Supervisor, Mark IV, Mod. L.) variety. And it turns out that Mike is designed to answer messy natural-language-type questions:

Remember Mike was designed, even before augmented, to answer questions tentatively on insufficient data like you do; that’s “high optional” and “multi-evaluating” part of name. So Mike started with “free will” and acquired more as he was added to and as he learned—and don’t ask me to define “free will.” If comforts you to think of Mike as simply tossing random numbers in air and switching circuits to match, please do.

By then Mike had voder-vocoder circuits supplementing his read-outs, print-outs, and decision-action boxes, and could understand not only classic programming but also Loglan and English, and could accept other languages and was doing technical translating—and reading endlessly. But in giving him instructions was safer to use Loglan. If you spoke English, results might be whimsical; multi-valued nature of English gave option circuits too much leeway.

There’s one big difference between the two: in the story, Mike “wakes up”.

Inception (the ARM implementation)

As a way of getting better acquainted with the ARM-based devices cluttering up my desk, I picked up this gem of a textbook at Amazon:

Reading through the interrupts section reveals an implementation of Inception on the ARM platform:

SSH escape sequences (or "don't kill -9 that process")

Back when I first started SSH (around the RedHat 4.x days), I’d occasionally be connected to another host via SSH when the host or the network connection would suddenly lock up. I’d end up trying to figure out which SSH process was the one that was frozen and kill -9ing it. That is, until someone showed me how to use SSH escape sequences. Occasionally I see people talking about killing frozen SSH sessions and it reminds me to pass on this tip.

If SSH is running on an interactive terminal, it listens for an escape character whenever it is listening for a the first character after a newline (or the first character in the stream). By default, this character is the tilde (~), but you can specify a different character using the -e argument.

You can get a list of escape sequences by typing “~?” after a newline:

Supported escape sequences:
  ~.  - terminate connection (and any multiplexed sessions)
  ~B  - send a BREAK to the remote system
  ~C  - open a command line
  ~R  - Request rekey (SSH protocol 2 only)
  ~^Z - suspend ssh
  ~#  - list forwarded connections
  ~&  - background ssh (when waiting for connections to terminate)
  ~?  - this message
  ~~  - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.)

The one I use most frequently is ~.. This one kills the SSH terminal, along with any of the port forwardings you might have started as part of the command-line or in your .ssh/config file. ~& is also pretty useful when using port forwarding: it closes the SSH terminal, backgrounds SSH and leaves the port forwardings alive until the last one terminates.

These escape sequences will work, even if the underlying TCP/IP connection is toast and SSH seems completely unresponsive. If SSH isn’t picking up your escape sequences, make sure that you’re hitting the enter key first. It won’t pick up escape sequences in the middle of a line.

A week with a ChromeOS netbook

A box showed up earlier this week in the mail with an interesting set of markings. It wasn’t a big surprise - I’d been eagerly anticipating the arrival of a Cr-48 since it was shipped late last week.

Inside the box was the netbook, a set of instructions and the new set of Cr-48 decals. The decals are pretty flashy and look good, but I figured I’d wait a bit before putting them on (hey, this thing looks pretty good as-is).

The first thing you notice when starting the netbook up is that it’s fast. Pushing the power button to the firstboot or login screen is a matter of seconds. It’s the same while signing out or powering down. Oh, and the power button functions as a signout key as well. Hold it for a few seconds and it signs you out. Keep it held down a few more seconds and it powers down.

There aren’t a lot of surprises on this box. It’s basically a giant battery strapped to the Chrome browser. The battery is pretty amazing. Popping the power cord out yielded a runtime of just under eight hours when I first got the machine. A few discharge/charge cycles later and it’s sitting at more than eight hours.

Overall, the hardware is pretty decent. It’s an Atom N445 processor with 2GB RAM. It has 16GB of onboard solid-state storage. For comparison, the Dell Inspiron Mini 10 I just bought had a similar processor, but half the memory and way more storage (albeit spinning bits instead). The screen is really great and the keyboard is very comfortable to use.

I’ve heard bad things about the trackpad on the Cr-48. It seemed to be working well after I first started up the machine, but over time it’s clearly shaping up to be the weakest part of the system. The trackpad is unreliable at times. It gets stuck in a clicked state at time, where it thinks that you’re holding a finger down and moving a finger around starts selecting things. Other times it fails to recognize the two-finger right-click, making for a frustrating experience trying to copy and paste from one place to another.

Aside: I swear that when I first got this machine, the trackpad didn’t support the ability to click and drag by pushing down on the whole trackpad with one finger and dragging the other. This is working now and I can’t explain it. *shrug*

The Chrome browser runs fairly well on this hardware given the size of its CPU, but it’s definitely not as slick as Chrome on my Macbook. It can start to feel a bit sluggish when you end up with a number of tabs open. Sites that use position:fixed or background-attachment:fixed are terribly slow to scroll as well. I imagine that future versions of the OS will bring hardware-accelerated compositing to scrolling.

The netbook supports multiple users, but it can’t support more than one user logged in at a time. That’s likely to avoid having more than one user hogging the limited resources of the box. I’d really love to see something along the lines of tab hibernation used, instead of forcing one user to log out to let another log in. Once a user signs out, the state of their session should be persisted to disk locally and restored after they log in again.

I’ve been trying to get used to a world without any apps beyond the browser. It’s tough. I set up Guacamole to get access to a Linux desktop where I could run a bunch of applications that I need access to. As a developer, I can’t really live without a few desktop apps. If there were a way for me to get access to the applications on my desktop remotely, I’d be bringing this netbook everywhere instead of lugging around the much heavier Macbook Pro.

Overall, I’m really impressed with the ChromeOS netbook. It feels designed, not just made. I’m confident that a lot of the issues I’ve seen can be fixed in software updates. There are probably a lot of people that could make a switch full-time to this netbook. I’m not one of those right now, but I’d love to use something small like this for more of my computing needs.

This blog post was composed entirely on the Cr-48, including the awkward dance to download my previous Cr-48 pics from Twitpic and upload them into Wordpress.

Follow me on Twitter: @mmastrac.