AppEngine vs. EC2 (an attempt to compare apples to oranges)

March 1st, 2011

This is an expanded version of my answer on Quora to the question “You’ve used both AWS and GAE for startups: do they scale equally well in terms of availability and transaction volume?”:

I’ve had an opportunity to spend time building products in both EC2 (three years at DotSpots and the last year working on At DotSpots, we started using EC2 in the early days, back when it was just a few services: EC2, S3, SQS and SDB. It grew a great deal in those years, tacking on a number of useful services, some which we used (<3 CloudFront) and a number that we didn’t. Last year, around April, we switched over to building on AppEngine and I’ve come to appreciate how great this platform is and how much more time I spend building product rather than infrastructure. Other developers enjoy building custom infrastructure, but I’m happy to outsource it to Google.

Given these two technologies, it’s difficult to directly compare the them because they are two different beasts: EC2 is a general purpose virtual machine host, while AppEngine is a very sophisticated application host. AppEngine comes with a number of services out-of-the-box. The Amazon Web Services suite tacks on a number of various utilities to EC2 that give you access to structured, query-able storage, automatic scaling at the VM level, monitoring and other goodies too numerous to mention here.

Transaction Volume

When dealing with AppEngine, the limit to your scaling is effectively determined by how well you program to the AppEngine environment. This means that you must be aware of how transactions are processed at the entity group level and make judicious use of the AppEngine memcache service. If you program well against the architecture and best practices of AppEngine, you have the potential of scaling as well as some of Google’s own properties.

Here’s an example of one of the more expensive pages we render at Our persistence code automatically keeps entities hot in memcache to avoid hitting the datastore more than a few times while rendering a page:

On EC2, scaling is entirely in your hands. If you are a MySQL wizard, you can scale that part of the stack well. Scaling is half limited only by the constraints of the boxes you rent from Amazon behalf by your skill and the other half by creativity in making them perform well. At DotSpots, we spent a fair bit of time scaling MySQL up for our web crawling activities. We built out custom infrastructure for serving key/value data fast. There was infrastructure all over the place just to keep up with what we wanted to do. Our team put a lot of work into it, but at the end of the day, it was fast.

It’s my opinion that your potential for scaling on AppEngine is much higher for a given set of resources, if your application fits within the constraints of the “ideal application set” for AppEngine. There are some upcoming technologies that I’m not allowed to talk about right now that will expand this set of ideal applications for AppEngine dramatically.

Reliability and availability

As for reliability and availability, it’s not an exact comparison here again. On EC2, an instance will fail from time-to-time for no reason. In some cases it’s just a router failure and the instance comes back in a few minutes to a few hours later. Other times, the instance just dies, taking its ephemeral state with it. In our large DotSpots fleet, we’d see a machine lock up or disappear somewhere around once a 1 month or so. The overall failure rate here was pretty low, but enough that you need to keep on your toes for monitoring and backups. We did have a catastrophic failure while using Elastic Block Store that effectively wiped out all of the data we were storing on it – that was the last time we used EBS (in fairness, that was in the early days of EBS and this probably not as likely to happen again).

On AppEngine, availability is a bit of a different story. Up until the new High-replication datastore, you’d be forced to go down every time the Datastore went into maintenance – a few hours a month. With the new HR datastore, this downtime is effectively gone, in exchange for slightly higher transaction processing fees on Datastore operations. These fees are negligible overall and definitely worth the tradeoff for increased reliability.

AppEngine had some rough patches for Datastore reliability around last September, but these have pretty much disappeared for us. Google’s AppEngine team has been working hard behind the scenes to keep it ticking well. There are some mysterious failures in application publishing from time-to-time on AppEngine. They happen for a few minutes to a few hours at a time, then get resolved as someone internally at Google fixes it. These publish failures don’t affect your running code – just your ability to publish new code. We’re doing continuous deployment on AppEngine, so this affects us more than others.

If you measure reliability in terms of the stress imposed on developers keeping the application running, AppEngine is a clear winner in my mind. If you measure it by the time that your application isn’t unavailable from forces beyond your control, EC2 wins (but only by a small amount, and by a much smaller margin when comparing the HR datastore).

Follow me (@mmastrac) on Twitter and let us know what you think of!

View/vote on HackerNews

(this is Thing A Week #6)

Now that’s a prediction

February 28th, 2011

Amusing: IBM’s Jeopardy-winning knowledge bot was named Watson, after IBM’s first president, Dr. Thomas J. Watson.

In Robert Heinlein’s 1966 novel, The Moon is a Harsh Mistress, one of the main characters is named Mike, nicknamed so after Mycroft Holmes in a short story written by the same Dr. Watson, before founding IBM:

Mike was not official name; I had nicknamed him for Mycroft Holmes, in a story written by Dr. Watson before he founded IBM. This story character would just sit and think—and that’s what Mike did. Mike was a fair dinkum thinkum, sharpest computer you’ll ever meet.

Mike is a computer, of the HOLMES IV (High-Optional, Logical, Multi-Evaluating Supervisor, Mark IV, Mod. L.) variety. And it turns out that Mike is designed to answer messy natural-language-type questions:

Remember Mike was designed, even before augmented, to answer questions tentatively on insufficient data like you do; that’s “high optional” and “multi-evaluating” part of name. So Mike started with “free will” and acquired more as he was added to and as he learned—and don’t ask me to define “free will.” If comforts you to think of Mike as simply tossing random numbers in air and switching circuits to match, please do.

By then Mike had voder-vocoder circuits supplementing his read-outs, print-outs, and decision-action boxes, and could understand not only classic programming but also Loglan and English, and could accept other languages and was doing technical translating—and reading endlessly. But in giving him instructions was safer to use Loglan. If you spoke English, results might be whimsical; multi-valued nature of English gave option circuits too much leeway.

There’s one big difference between the two: in the story, Mike “wakes up”.

Inception (the ARM implementation)

February 27th, 2011

Warning: serious geek content ahead.

As a way of getting better acquainted with the ARM-based devices cluttering up my desk, I picked up this gem of a textbook at Amazon:

Reading through the interrupts section reveals an implementation of Inception on the ARM platform:

(this is Thing A Week #5)

SSH escape sequences (or “don’t kill -9 that process”)

February 23rd, 2011

Meta: I’m a bit behind on “thing-a-week” posts due to my cold from hell last week. I’ll be packing a few more into this week to make up for it.

Back when I first started SSH (around the RedHat 4.x days), I’d occasionally be connected to another host via SSH when the host or the network connection would suddenly lock up. I’d end up trying to figure out which SSH process was the one that was frozen and kill -9ing it. That is, until someone showed me how to use SSH escape sequences. Occasionally I see people talking about killing frozen SSH sessions and it reminds me to pass on this tip.

If SSH is running on an interactive terminal, it listens for an escape character whenever it is listening for a the first character after a newline (or the first character in the stream). By default, this character is the tilde (~), but you can specify a different character using the -e argument.

You can get a list of escape sequences by typing “~?” after a newline:

Supported escape sequences:
  ~.  - terminate connection (and any multiplexed sessions)
  ~B  - send a BREAK to the remote system
  ~C  - open a command line
  ~R  - Request rekey (SSH protocol 2 only)
  ~^Z - suspend ssh
  ~#  - list forwarded connections
  ~&  - background ssh (when waiting for connections to terminate)
  ~?  - this message
  ~~  - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.)

The one I use most frequently is ~.. This one kills the SSH terminal, along with any of the port forwardings you might have started as part of the command-line or in your .ssh/config file. ~& is also pretty useful when using port forwarding: it closes the SSH terminal, backgrounds SSH and leaves the port forwardings alive until the last one terminates.

These escape sequences will work, even if the underlying TCP/IP connection is toast and SSH seems completely unresponsive. If SSH isn’t picking up your escape sequences, make sure that you’re hitting the enter key first. It won’t pick up escape sequences in the middle of a line.

(this is Thing A Week #4)

A week with a ChromeOS netbook

February 5th, 2011

Meta: apologies for taking so long to approve comments on the blog. I haven’t set WordPress up to notify me by mail of new comments, so it takes a bit of time to notice them.

A box showed up earlier this week in the mail with an interesting set of markings. It wasn’t a big surprise – I’d been eagerly anticipating the arrival of a Cr-48 since it was shipped late last week.

Inside the box was the netbook, a set of instructions and the new set of Cr-48 decals. The decals are pretty flashy and look good, but I figured I’d wait a bit before putting them on (hey, this thing looks pretty good as-is).

The first thing you notice when starting the netbook up is that it’s fast. Pushing the power button to the firstboot or login screen is a matter of seconds. It’s the same while signing out or powering down. Oh, and the power button functions as a signout key as well. Hold it for a few seconds and it signs you out. Keep it held down a few more seconds and it powers down.

There aren’t a lot of surprises on this box. It’s basically a giant battery strapped to the Chrome browser. The battery is pretty amazing. Popping the power cord out yielded a runtime of just under eight hours when I first got the machine. A few discharge/charge cycles later and it’s sitting at more than eight hours.

Overall, the hardware is pretty decent. It’s an Atom N445 processor with 2GB RAM. It has 16GB of onboard solid-state storage. For comparison, the Dell Inspiron Mini 10 I just bought had a similar processor, but half the memory and way more storage (albeit spinning bits instead). The screen is really great and the keyboard is very comfortable to use.

I’ve heard bad things about the trackpad on the Cr-48. It seemed to be working well after I first started up the machine, but over time it’s clearly shaping up to be the weakest part of the system. The trackpad is unreliable at times. It gets stuck in a clicked state at time, where it thinks that you’re holding a finger down and moving a finger around starts selecting things. Other times it fails to recognize the two-finger right-click, making for a frustrating experience trying to copy and paste from one place to another.

Aside: I swear that when I first got this machine, the trackpad didn’t support the ability to click and drag by pushing down on the whole trackpad with one finger and dragging the other. This is working now and I can’t explain it. *shrug*

The Chrome browser runs fairly well on this hardware given the size of its CPU, but it’s definitely not as slick as Chrome on my Macbook. It can start to feel a bit sluggish when you end up with a number of tabs open. Sites that use position:fixed or background-attachment:fixed are terribly slow to scroll as well. I imagine that future versions of the OS will bring hardware-accelerated compositing to scrolling.

The netbook supports multiple users, but it can’t support more than one user logged in at a time. That’s likely to avoid having more than one user hogging the limited resources of the box. I’d really love to see something along the lines of tab hibernation used, instead of forcing one user to log out to let another log in. Once a user signs out, the state of their session should be persisted to disk locally and restored after they log in again.

I’ve been trying to get used to a world without any apps beyond the browser. It’s tough. I set up Guacamole to get access to a Linux desktop where I could run a bunch of applications that I need access to. As a developer, I can’t really live without a few desktop apps. If there were a way for me to get access to the applications on my desktop remotely, I’d be bringing this netbook everywhere instead of lugging around the much heavier Macbook Pro.

Overall, I’m really impressed with the ChromeOS netbook. It feels designed, not just made. I’m confident that a lot of the issues I’ve seen can be fixed in software updates. There are probably a lot of people that could make a switch full-time to this netbook. I’m not one of those right now, but I’d love to use something small like this for more of my computing needs.

This blog post was composed entirely on the Cr-48, including the awkward dance to download my previous Cr-48 pics from Twitpic and upload them into WordPress.

Follow me on Twitter: @mmastrac and check out my latest project,

(this is Thing A Week #3)

The heartbreak of the pivot

January 23rd, 2011

(This was originally a response to a question asked of me on Quora, but I’ve copied it here to my blog so I can expand on it)

“Pivot” is a big catchphrase in the startup world these days. It has a glamorous feel to it: you picture the Twitter team dropping Odeo on the ground and moving on to the brand new idea without a moment’s thought. There’s more to it than that.

In May of 2010 we stopped spending all of our resources on DotSpots and started working on our new idea, By our September launch at TechCrunch Disrupt, we were focusing all of our efforts on it and decided that we’d start the process of shutting down DotSpots.

Moving on from DotSpots was bittersweet. We spent a few years pouring our heart into the product, which we all honestly believed would revolutionize how people consumed news. Unfortunately we underestimated how difficult it would be to get publishers to sign on and missed the window overall for success.

Another thing that’s hard for me is having the great GWT-based infrastructure we built to pop up UI overtop of any webpage basically sitting idle. I’d love to open-source that one day, but it’s not pressing right now. Perhaps if the right project comes along, we might be able to hand off some of it to them.

We did manage to use some of the DotSpots technology to bootstrap our efforts at The core web infrastructure for serving pages is basically what we were going to use for serving I’ll be opening that up at some point in the future at

The great part about moving on is that you have a chance to let go of all the future plans that were weighing on your mind. After a few years at a startup, you’ve got stuff that you’ve always wanted to do but never had time for. Pivoting to a new idea lets you finally let go of those in your mind.

You also have a chance to make your core technology choices again. At DotSpots, we built everything in Java running Jetty on EC2. At, we’ve chosen to work on AppEngine, which opens up so many exciting possibilities with all of the technology Google is hooking up to it.

It’s also a great feeling to be working on an application that’s seeing significant traction in the market. We blew by the total lifetime number of users of DotSpots within a few months and continue to great press and a continuous stream of signups. It makes you feel good seeing numbers that show people are using the application and it’s making a positive impact on their lives. I would have loved for DotSpots to have a positive impact too, but it just never caught on.

In the end, it’s an awesome move for us. On the other hand it’s a bit sad – like moving from a house you’ve lived in for years and have built memories in.

Follow me on Twitter: @mmastrac and check out my latest project,

(this is Thing A Week #2)

A Thing A Week for 2011

January 17th, 2011

Inspired by Andrew Brown, I’m going to try something new for in 2011: A Thing A Week. Rather than putting out blog posts randomly, I’ll focus on putting one out every week on whatever subject is easiest to write about. With practice, the effort of writing should get easier.

As someone deeply interesting in hacking stuff, most of my posts will likely be focused on programming and new technology gadgets (lots of Android stuff!).

In the spirit of things, I’m calling this Thing A Week #1.

Solving the Carwoo code-breaking challenge

December 19th, 2010

Erik at Carwoo posted an interesting codebreaking challenge to the Carwoo blog earlier today.

This particular challenge piqued my curiosity. One of my hobbies is rooting Android phones and I enjoy all sorts reverse-engineering problems. More importantly, this one happened to match up with some downtime I had this afternoon.

I fired up Python and quickly got to work. At my day job,, we use Java. I love Java as a development language, but I find that Python’s interactive mode is best for quickly experimenting with data in unknown formats. There’s a great deal of power in the standard Python libraries and it’s well-suited for quick byte and bit manipulation.

The first thing I did was to take a look at the individual bytes of the base64 decoded data:

>> min(s), max(s)

The range of 32 (‘ ‘) to 90 (‘Z’) suggested that the bytes had been coerced to a printable range from their natural, though binary-uncomfortable, base-59 representation. I worked from this part on assuming that this was true as it held for the challenge itself, as well as the various sample encodings:

>>> qbf
[1, 24, 38, 18, 30, 16, 43, 40, 15, 39, 35, 29, 28, 46, 48, 32, 33, 7, 19, 12, 45, 2, 1, 39, 4, 45, 45, 53, 23, 56, 41, 14, 47, 45, 23, 37, 47, 49, 34, 53, 43, 44, 13, 48, 54, 34, 30, 7, 22, 32, 52, 0, 27, 10, 13, 35, 51, 14, 21, 34, 5, 15, 27, 2, 44, 7, 47, 49, 31, 14, 6, 49, 3, 49, 24, 1, 25, 33, 15, 16, 55, 21, 46, 50, 20, 40, 6, 52, 19, 28, 32, 51, 20, 47, 14, 16, 12, 48, 15, 15, 42, 1, 21, 21, 17, 35, 21, 22, 7, 15, 22, 13, 49, 9, 23, 40, 9, 42, 27, 23, 20, 1, 25, 35, 57, 55, 34, 53, 16, 49, 55, 21, 17, 33, 55, 20, 18, 27, 13, 36, 7, 27, 47, 33, 40, 34, 29, 57, 55, 45, 38, 43, 46, 45, 23, 43, 22, 34, 30, 30, 30, 39, 0, 27, 42, 43, 54, 4, 41, 54, 20, 3, 42, 49, 29, 27, 56, 53, 6, 29, 11, 8, 57, 52, 32, 10, 7, 9, 11, 5, 2, 9, 2, 23, 12, 3, 39, 4, 56, 30, 18, 42, 52, 40, 2, 32, 1, 51, 12, 30, 53, 11, 52, 33, 54, 31, 22, 5, 9, 20, 54, 53, 18, 11, 28, 3, 9, 55, 18, 5, 29, 53, 49, 42, 40, 50, 23, 58, 34, 23, 43, 32, 31, 33, 19, 1, 2, 41, 14, 31, 42, 37, 17, 50, 42, 34, 12, 56, 5, 1, 38, 7, 31, 30, 44, 31, 29, 56, 56, 14, 7, 4, 50, 7, 56, 56, 2, 50, 8, 52, 33, 53, 14, 53, 16, 26, 23, 20, 26, 14, 2, 32, 25, 45, 25, 37, 47, 41, 39, 39, 26, 35, 25, 18, 31, 43, 58, 9, 34, 18, 57, 6, 30, 29, 55, 19, 42, 13, 42, 12, 6, 8, 53, 24, 7, 15, 40, 7, 23, 25, 58, 33, 53, 35, 53, 37, 28, 49, 32, 35, 43, 39, 24, 9, 56, 12, 25, 5, 28, 27, 31, 47, 48, 47, 43, 40, 51, 28, 16, 2, 58, 36, 33, 33, 53, 50, 25, 18, 1, 10, 28, 33, 39, 38, 45, 23, 10, 49, 23, 28, 12, 17, 50]

It didn’t appear to be a simple encoding as the number of bits wasn’t a nice, round number:

>>> len(qbf)/43.

I explored the possibility that it was a base-59 encoded version of something as well, but nothing seemed to pop out (none of the base-59 numbers had promising base-2, 10 or 16 representations):

>>> reduce(lambda x,y: x*59+y, qbf)
>>> hex(reduce(lambda x,y: x*59+y, qbf))
>>> bin(reduce(lambda x,y: x*59+y, qbf))

I also used numpy’s histogram function to see if there were signs of a simple substitution. While there were spikes of various values, nothing seemed to be significant when you compared the histograms of the various encodings of “THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG”.

The fact that more than eight output positions on average represented a single input position strongly suggested that there was some sort of reduction involved in decoding. I tried a number of different approaches with bit counting and different groupings, but none yielded any promising results.

The breakthrough came about a minute after Erik posted his last hint:

If you base64 decode this


The first 5 characters are “!8F2>”

Those 5 characters represent the first “T”. Another “T” won’t necessarily result in the same code. Actually, the odds of ever seeing that code again for a “T” is low.

I ran through a few quick tests and discovered the first part of the solution: the sum of the first five characters, modulo 59, equalled the first character of the output!

>>> sum(qbf[:5])
>>> sum(qbf[:5])%59
>>> ord('T')-32

I went though the other versions of the sample encoding and saw that the same held for the first five digits. I checked the next five and didn’t see the correlation there. It also didn’t hold for the next six. The pattern *did* hold for the next seven. Even more tantalizing was that the challenge itself appeared to follow the same pattern:

>>> chr(sum(orig[:5])%59+32)
>>> chr(sum(orig[5:12])%59+32)

At this point I figured that I’d write a quick program to brute force the rest of the string.

from base64 import *
qbf = [ord(x)-32 for x in b64decode("""IThGMj4wS0gvR0M9PE5QQEEnMyxNIiFHJE1NVTdYSS5PTTdFT1FCVUtMLVBW Qj4nNkBUIDsqLUNTLjVCJS87IkwnT1E/LiZRI1E4ITlBLzBXNU5SNEgmVDM8 QFM0Ty4wLFAvL0ohNTUxQzU2Jy82LVEpN0gpSjs3NCE5Q1lXQlUwUVc1MUFX

orig = [ord(x)-32 for x in b64decode("""QCgyPSghK1kwTlRJO0MrVTtZLUxWLCg3MypEUCNQJjNGTlMuIyU8TzwpSDNP ND0tRFg7RkUoJjotJzQwIz8pJUFCODtQI0AyPCFaTkNTUyQ/PUgqN0VKLCEi VUxWRjo2SUYnSiAiMkdGVCpGSCw+NDAqVyhUIEAtWTVMRiQzMi8jIiMuPyFP

output = ""

i = 0
start = 0
text_i = 0

while True:
        c = chr(sum(qbf[start:i])%59+32)
        print sum(qbf[start:i]), c, text[text_i]
        if c == text[text_i]:
                print "Found at %d-%d %d" % (start,i,i-start)
                output += chr(sum(orig[start:i])%59+32)
                start = i
                i = i + 4
                text_i = text_i + 1
                print output

        i = i + 1

The output from this test program yielded most of the solution:

0   T
1 ! T
25 9 T
63 $ T
81 6 T
111 T T
Found at 0-5 5
153 C H
188 + H
217 H H
Found at 5-12 7
187 * E
194 1 E
213 D E
225 P E
270 B E
272 D E
273 E E
Found at 12-23 11

... [a few dozen lines snipped] ...

Found at 347-360 13
213 D D
Found at 360-365 5
104 M O
132 . O
165 O O

The solution wasn’t perfect at this point, but when you look at the output, the pattern becomes pretty clear:

Found at 0-5 5
Found at 5-12 7
Found at 12-23 11
Found at 23-36 13
Found at 36-41 5
Found at 41-48 7
Found at 48-59 11
Found at 59-72 13
Found at 72-77 5
Found at 77-84 7
Found at 84-95 11
Found at 95-108 13

From the pattern, I quickly whipped up a solution. It’s not the most elegant bit of Python, but the results are the most important part:

from base64 import *


def decode(code):
        values = [ord(a)-32 for a in b64decode(code)]
        offsets = [5,7,11,13]*len(code) # note: won't need all of these, but guaranteed more than enough
        decoded = ""
        while len(values):
                decoded += chr(sum(values[:offsets[0]])%59+32)
                values = values[offsets[0]:]
                offsets = offsets[1:]
        return decoded

print decode(qbf)
print decode(orig)

The output, which I posted quickly to the blog and the HN story:


And thanks to Mark Trapp, the classic Seinfeld scene:

AppEngine: “Peace-of-mind scalability”

November 25th, 2010

I had a lot of great feedback on my AppEngine post the other day. We put our trust in the AppEngine team to keep things running smoothly while we work on our app and live the rest of our lives. Today is pretty quiet around the Gripe virtual offices (aka Skype channels): it’s Thanksgiving in US and I’m getting hit by a cold too hard to do much today besides write a short post.

We had a great surprise this morning. The View episode we were featured in was in re-runs this morning and we had a huge traffic spike. Nobody noticed until about 30 minutes in, since  everything was scaling automatically and the site was still serving up as quickly as ever:

This is a whole new way to build a startup: no surprise infrastructure worries, no pager duty, no getting crushed by surprise traffic. It’s peace-of-mind scalability.

Now back to fighting my cold, without the worry of our site’s load hanging over my head. For those of you in the US, enjoy your thanksgiving and watch out for Turkey drops!

WKRP Turkey Drop from Mitch Cohen on Vimeo.

Why we’re really happy with AppEngine (and not going anywhere else)

November 23rd, 2010

There’s been a handful of articles critical of Google’s AppEngine that have bubbled up to the top of Hacker News lately. I’d like to throw our story into the ring as well, but as a story of a happy customer rather than a switcher.

We’ve been building out our product,, since last May, after pivoting from our previous project, DotSpots. The only resource available to develop Gripe in the early days was myself. I’d been playing with AppEngine on and off, creating a few small applications to get a feel for AppEngine’s strengths and weaknesses since its initial Python release. We were also invited to the early Java pre-release at DotSpots, but it would have been too much effort to make the switch from our Spring/MySQL platform to AppEngine.

My early experiments on AppEngine near its first release showed that it was promising, but was still an early release product. Earlier this year, I started work on a small personal-use aggregator that I’ve always wanted to write. I targeted AppEngine again and I was pleasantly surprised at how far the platform had matured. It was ready for us to test further if we wanted to tackle projects with it.

Shortly after that last experiment, one of our interns at DotSpots came to us with an interesting idea. A social, mobile complaint application that we eventually named Gri,pe. We picked AppEngine as our target platform for the new product, given the platforms new maturity. It also helped that as the sole developer on the first part of the project, I wanted to focus on building the application rather than spending time building out EC2 infrastructure and ops work that goes along with productizing your startup idea. I prototyped it on the side for a few months with our designer and once we determined that it was a viable product, we decided to focus more of the company’s effort on it.

There were a number of great bonuses to choosing AppEngine as well. We’ve been wanting to get out of the release-cycle treadmill that was killing us at DotSpots and move to a continuous deployment environment. AppEngine’s one-liner deployment and automated versioning made this a snap (I hope to detail our Hudson integration another blog post). The new task queue functionality in AppEngine let us do stuff asynchronously as we always wanted to do at DotSpots, but found to be awkward to automate with existing tools like Quartz. The AppEngine Blobstore does the grunt work of dealing with our image attachments without us having to worry about signing S3 requests (in fairness, we’re using S3 signed requests for our new video upload feature, but the Blobstore let us launch image attachments with a single day’s work).

When it came time for us to launch at TechCrunch 50 this year, I was a bit concerned about how AppEngine would deal with the onslaught of traffic. The folks on the AppEngine team assured me that as long as we weren’t doing anything to cause a lot of write contention on single entity groups, we’d scale just fine. And scale we did:

In the days after our launch, AppEngine hit a severe bit of datastore turbulence. There was the occasional latency spike on AppEngine while we were developing, but late September/early October was much rougher. Simple queries that normally took 50ms would skyrocket up to 5s. Occasionally they would even time out. Our application was still available, but we were seeing significant error rates all over. We considered our options at that point, and decided to stick it out.

Shortly after the rough period started, the AppEngine team fixed the issues. And shortly after that, a bit of datastore maintenance chopped the already good latencies down even further. It’s been smooth sailing since then and the AppEngine team has been committed to improving the datastore situation even more as time goes on:

We didn’t jump ship on AppEngine for one rough week because we knew that their team was committed to fixing things. We’ve also had our rough weeks with the other cloud providers. In 2009, Amazon lost one our EBS drives while we were prototyping DotSpots infrastructure on it. Not just a crash with dataloss, but actually lost. The whole EBS volume was no longer available at all. We’ve also had weeks where EC2 instances had random routing issues between instances, instances lock up or get wedged with no indication of problems on Amazon’s side. Slicehost had problems with our virtual machines losing connectivity to various parts of the globe.

“Every cloud provider has problems” isn’t an excuse for any provider to get sloppy. It’s an understanding I have as the CTO of a company that bases its future on any platform. No matter which provider we choose, we are putting some faith in a third-party that they will solve the problems as they come up. As a small startup, it makes more sense for us to outsource management of IT issues than to spend 1/2 of an engineer’s time dealing with this. We’ve effectively hired them to deal with managing the hard parts of our scaleability infrastructure and they are working for a fraction of what it would cost to do this ourselves.

Putting more control in the hands of a third party means that you have to give up the feeling of being in control of every aspect of your startup. If your self-managed colo machine dies, you might be down for hours while you get your hands dirty fixing, reinstalling or repairing it. When you hand this off to Google (or Amazon, or Slicehost, or Heroku), you give up the ability to work through a problem yourself. It took some time for me to get used to this feeling, but the AppEngine team has done amazing work in gaining the trust of our organization.

Since that rough week in September, we’ve had fantastic service from AppEngine. Our application has been available 100% of the time and our page rendering times are way down. We’re committing 100% of our development resources to banging out new features for and not having to worry about infrastructure.

On top of the great steady-state service, we were mentioned on ABC’s The View and had a massive surge in traffic which was handled flawlessly by AppEngine. It transparently scaled us up to 20+ instances that handled all of the traffic without a sweat. In the end, this surge cost us less than $1.00:

There’s a bunch of great features for AppEngine in the pipeline, some of which you can see in the 1.4 prerelease SDK and others that aren’t publicly available yet but address many of the issues and shortcomings of the AppEngine platform.

If you haven’t given AppEngine a shot, now is a great time.

Post-script: We do still use EC2 as part of our infrastructure. We have a small nginx instance set up to redirect our naked domain from to and deal with some other minor items. We also have EC2 boxes that run Solr to deal with some search infrastructure. We talk to Solr from GAE using standard HTTP. As mentioned in the article above, we also use S3 for video uploads and transcoding.

Follow @mmastrac on Twitter and let us know what you think of!