N-Tier Services and Systems Complexity — cover

· Drunken Blog Rants · Catalog entry →

N-Tier Services and Systems Complexity

As usual, this is an opinion-filled essay, with my own wacky outlook on things at Amazon, and plenty of half-baked assertions thrown in.

It's not easy to write about multi-tier architectures. You can spend months writing a book, perhaps one with Friendly Martians on the cover, or you can take a few cheap pot shots and get back to the business of bouncing your servers. I chose the latter.

Today's essay is mostly about the following question, so before you read on, ask yourself what you think the answer is:

Do Amazon's internal data services and their corresponding object-oriented APIs reduce or increase our overall systems complexity?

I'm writing this editorial because, to be quite honest, I thought the answer was patently obvious: that services have obviously increased our overall systems complexity.

But I've heard a few smart people suggest that the whole point of services was to reduce our complexity. They say that the services will hide the complexity of our code behind "well-defined interfaces". It's become a company-wide slogan. Everything will supposedly become easier to understand and maintain, and as a result, our quality measures (availability, data integrity, flexibility) will finally start to improve. Or so the theory goes.

I'm coming at this from the perspective of an app developer, having been in an app group — specifically, Customer Service Apps — for the past four years. We certainly don't see it that way. I think the majority of development at Amazon is done by application groups who are trying to support business objectives, rather than by infrastructure groups who support or monitor the app developers.

So the level of difficulty required in order to produce applications here is a pretty important topic.

Why Services?

So why did we launch Services, anyway? Anyone remember? By Services, I mean the big services like Customer Master Service (CMS), Order Master Service (OMS), Item Master Service (IMS), Customer Account Management Plus Something (CAMPS), and the like. These ideas got started back in the Dark Ages of Amazon, in early 1997 or so, and the implementations were finished, by and large, by 2007. (Written in 2004 — past tense, about a date three years in the future.)

If you weren't around here in 1997, you might surmise that we launched services to reduce our systems complexity. That's not why we did it, though.

We launched services because of a fundamental problem with our code organization called the 2-tier problem. 2-tier is the name we give to systems in which SQL is embedded in client code. It's pretty much that simple.

Hypothetical Example

I'll use a running example, customer hair color, to illustrate why we needed services.

Let's say you have a pressing business need to be able to look up customers by hair color. You have a database with customer information in it, and one of those bits of information is their hair color. Someone wants to send out a promotional mailing to everyone with red hair, so you write yourself a script that includes a snippet of SQL that might look something like this:

SELECT customer_name FROM customers WHERE hair_color="red"

Easy! No matter what language you use to make this query, you'll get back a list of names of customers who have red hair. As part of the configuration for this code, you need to specify which database to connect to, a username, and a password, all conveniently embedded in your code.1

Now let's suppose that some percentage of our red-headed customers reply to the promotional mail and say "hey, my promotion code didn't work." This rarely happens in practice, of course.2 But when it does, we get a bunch of contacts, and CS reps need to investigate. The reps will probably need to look up a customer's order to see what happened.

Unfortunately, our customers aren't always as clued-in as we'd like them to be, so the most information they might be able to offer the CS rep is: "Well, um, I have red hair." So CS asks CS Apps to write an Arizona page, temporary of course, that pulls up all the orders placed by red-haired customers, so they can try to figure out which one it is.

CS Apps obligingly writes the following SQL query:

SELECT customer_name, order_id FROM customers, orders
WHERE blah blah blah AND hair_color="red"

   (Note: some details omitted.)

Well now, this is where the problem starts. You see, "customers" and "orders" are two different database tables, and while they lived in the same database a long time ago (ACB, which stands for "Amazon.Com Books"), in 1998 we outpaced Moore's Law, and we had to split the database in half before the machine "upped and died"3.

One classic solution to splitting a hurting database is called vertical partitioning, and it means "move a bunch of tables into another database". Easy enough, right? Well, sure. But doing it breaks our CS Apps query, and in fact breaks the promotional-mailing query as well, if you decided to move the "customers" table out of ACB and into a new database called, for instance, CUST.

That's the problem with 2-tier architectures. Any time the data modelers and/or database administrators need to make a change to the underlying schema, any tools with SQL in them may stop working. And then all of our red-headed customers will be upset, which of course would be bad.

Services to the Rescue

This is an industry-standard problem, and it comes with a handy industry-standard solution: 3-tier architectures.4

In a 3-tier architecture, you add an extra layer between your database and your client code. Clients are no longer allowed to make direct SQL calls. Instead, they invoke something called a Service API. The API might look something like this:

List getCustomersByHairColor("red")

Or:

List getAllCustomerOrdersWhereCustomerHairColorIs("red")

This is an Object-Oriented API Call, and it is the mainstay of Services architectures. This is because all modern languages, and even a few crufty old ones like C++, use "Object-Oriented Design", which is really, really different from relational database design. Instead of Tables, Rows, and Columns, Object-Oriented (OO) programs use Classes, Objects, and Fields. Lots of blood has been spilt on this battleground already, so let's not worry about it for now. It's just the way things are.

The important thing to focus on is that for every possible SQL statement I could make, I must create a corresponding OO API call that tells me the same exact thing. The difference is that by going through the services layer, my client code is no longer directly tied to the database table schema.5

To illustrate with a concrete example, let's say my query is, in pseudo-SQL:

GET ME ALL customers WHO ARE worth THEIR WEIGHT IN pesos
I can create a Service call, suitable for processing in my language of choice6, that does the same thing:
List getMeAllCustomersWhoAreWorthTheirWeightIn("pesos", "please")

This process is called "Object/Relational Mapping" (O/R mapping, or just "ORM"), and occupies much developer time, and much of this article as well.

That's because it's pretty darn hard.

Some Rescue, Bub

SQL is a language designed specifically for interacting with relational databases. It's a rich, expressive language, almost like English, and it gives you the ability to ask fine-grained questions about your data.

So in the olden days, client writers could simply issue a one-line SQL query in their code, sort of like consulting the Oracle at Delphi, and they'd get an instant response7.

However, with the introduction of Services, things got a little more complex. Today, client writers have to call an API that probably doesn't exist yet. There's a new dependency here: the data is right there, in the database, but we can't get to it anymore. We have to ask the service owner to add the API for us. More on this later.

Sadly, even if the API did exist, it's quite a hurdle to call it. It's not as simple as issuing an SQL statement in your code. You have to do a truly remarkable number of things in order to invoke the service API. Here's an incomplete list of steps you need to perform, as a client writer, to make a service call:

  1. Figure out which protocol the service speaks. There are lots of distributed-computing protocols out there, and although it's the subject of another essay, we've brought the entire Tower of Babel in-house. Our services all speak different protocols: CORBA, Publish/Subscribe, Iquitos, Peru, SOAP, XML/RPC, NFS, and who knows what else.

  2. Figure out the language bindings from your language to the protocol of the service. If you're lucky, the bindings actually exist.

  3. Call some sort of Service Locator to find the service-specific URL of the service you're invoking.

  4. Dork endlessly with your subnet-fabric configuration, or your firewall, or your Linux version, or whatever, until you can actually contact the service.

  5. Create proxy objects that know how to marshal and unmarshal the parameters that have already been marshalled and unmarshalled by the service you're calling, just not in the language you're using (even if it's C++). The service will use something weird and non-helpful, like IDL or XDR or PilferGrommet, and you need to do another round of transformations before it's usable in your program.

  6. Grub around in the data that's returned, hoping that what you're looking for is in there somewhere.

Ah, for the good old days. Was it really worth splitting those databases?

In any case, after days of slogging through the various tomes and grimoires needed to explain the protocol you're using to talk to the service, you finally manage to connect, and NOW you're finally ready to get the "API not found" message.

So you email the service owner and say: "Hello from CS Apps. I have a strong business need to look up customer orders by customer nose size, on account of this recent busted promotion. Can you add the API call for me?"

Their standard response, of course, is to laugh so hard that they crack a rib. You see, nobody else has the slightest need to look anything up by nose size, and besides, they're busy deprecating8 versions 1.7062 through 21.9738 of their service, and they don't have time to talk to the likes of you.

2-Tier to the Rescue

Ha ha, just kidding. We wouldn't sneak behind the Service Owners' backs and do direct database access. It would be better to just send all those long-nosed customers a gift certificate, or something. But gosh, we can't, since we can't get to the data unless the Service Owner provides an API call for us.

So we have a problem, one that divides service owners into at least two philosophical camps. The problem is:

Who is going to add all those API calls?

And, more to the point, how do they avoid spending all their time doing it? Because there are dozens if not hundreds of client writers, each writing dozens if not hundreds of custom SQL queries, and someone has to convert all those queries to OO Service APIs. Right?

There are two solutions to this problem in use at Amazon today. Both of them suck, and that's through no fault of the service owners. I said in the beginning that writing services is a hard problem, and make no mistake: it is our hardest problem, for more reasons than I'm going to point out in this editorial - but then, my intention is simply to show that services aren't curing our complexity problem, and maybe to suggest one or two simplifications that will ease the pain. But we'll get there later.

The two solutions in use at Amazon are:

  1. Thousands of Teensy Teller Calls That Never Get Written. (OMS)
  2. One Gigantic Database Slurp That Gives You Everything. (CMS)

It's interesting that two of our biggest services chose these two very different approaches. The service owners can't possibly keep up with the demand imposed by the client writers, and both solutions offer tidy solutions, at least if you're on a services team and not an app team.

Let's examine both of them in detail, and see what turns up.

Teller Calls

I'm using an oversimplification of the term teller call, but it would take a series of long books to cover these issues in the detail they deserve.

For our purposes, a teller call is a fine-grained API call that maps directly to some fine-grained SQL query. If the query is asking for all customers who ordered elephants last Thursday using super-saver shipping, then the API call will, of course, be:

findAllCustomersWhoOrderedElephantsLastThursdayUsing("super-saver shipping")

Or maybe:

findAll("customers", WHO_ORDERED, new Date("last Thursday"), SHIP_METHOD_SSS)

Or even:

doQuery("SELECT name FROM customers, orders WHERE date=today()-5 AND ...")
Har. Just kidding about that last one.

The point is that with super fine-grained queries like this one (and we DO need queries with this kind of granularity on a regular basis, particularly for mass-cleanup efforts), there isn't a convenient way to structure the method so that it's reusable9.

Moreover, these queries are often one-offs. The same goes for updates and other backfills. There's no point in adding a complicated service call for a query you're only going to use once (or once in a while).

Gosh, it's sure seeming like it would be nice to get occasional access to our databases via SQL. You know, for those elephantine emergencies. Unfortunately, you give those client writers an inch, and they'll take a mile. Before long, they'll have a giant /bin directory filled with command-line tools called "thursday-elaphants"[sic], and "UPS-truck-caught-fire", and "sorry-ruined-christmas", and everything else that's ever gone wrong that CS had to clean up. And they'll scream bloody murder if any one of those precious tools breaks.

You can see why OMS isn't "finished" yet. It never will be. Even if our business stopped changing, the scenarios we generate will not.

What's a service owner to do?10

Getting More Than You Asked For

There is another approach to the problem. Imagine you're the owner of a restaurant, perhaps a steakhouse. You have a menu of offerings: customers can order their steaks prepared anywhere from charcoaled to still mooing. You feel you've offered a good set of choices.

However, your customers insist on ordering subtle variations that aren't on the menu: they want to choose the cut, the weight, the preparation, the sauces, and so on. It's becoming a problem, because your waiters are spending all their time taking notes to give to the chefs, who now have no time for cleaning the kitchen, inventing dishes to attract vegetarian customers, and so on.

You hit upon a clever solution: lead a live cow out on a leash, give it to the customer, and tell them: "Here's a cow. Have at it."

There are, of course, a few minor downsides to this approach:

  • The cow may not have room to navigate through the restaurant.
  • Eating the cow is somewhat less convenient in its current form.
  • The customer may not be hungry enough to eat the entire cow.

However, it does address the basic problem. The customer can have their steak any way they like it, and the restaurant owner barely needs to be involved.

This is the approach used by the Customer Master Service (CMS). It's a bit of an oversimplification, of course; they also give you a machete, so you don't have to use your knife and fork to subdue the cow. But it's an apt comparison. When you ask for any customer information that's not on the "menu" of existing API calls, you get back the entire customer: all of their addresses, credit cards, wish list settings, ignored unsubscribe options - everything. It comes back in one giant tree structure, and you can then use your Computer Science data structures and algorithms knowledge to start faking table joins and whatnot. You can Have It Your Way.

This has end-to-end performance implications, of course:

  • It requires some expensive and potentially complex database joins to snarf up all the data.

  • You're using a lot more memory and network bandwidth than you'd need for a fine-grained call.

  • The client has to do computationally-expensive (and probably error-prone) tree traversals to "fake" the original SQL query.

But in the grand scheme of things, the performance isn't killing us. You can build fancy caching so you don't have to re-do the query all the time, and we're running gigabit networks, and fast machines, and so on. The performance is certainly an issue, but it appears to be a tractable one.

The real problem is that application development just got a lot harder at Amazon. Every time we want to launch a new feature that requires looking up customer information, the programmers need to do a lot more work to get the info, because they no longer have the expressiveness of SQL at their disposal.

Moreover, the code the clients are writing for these custom queries and transactions is unlikely to be sharable. That means it's getting duplicated (always with slight variations) in all the applications. The implication is that if we find a bug in the code, it may have to be found and fixed in multiple places, on multiple teams.

This is a hidden cost. It's hard to measure. You can't see it, but it's there.

So Which Way Is Best?

I've now outlined three models for doing database access:

  1. Direct SQL access from your client code. This is bad because it means changing the database layout means potentially breaking hundreds of client applications.

  2. Create an API call for every possible client query. This is bad because it doesn't scale: the service owner can't keep up with the demand for new calls, and it doesn't make sense to add one-off calls into a service interface.

  3. Give the entire database back to the client. This is bad because it pushes additional complexity out to the app developers in a way that's non-sharable, and because it has potential performance problems. In fact, some people argue that it's in some ways a throwback to a 2-tier model, but I'm not sure I agree with that.

A fourth model, one that I don't think is in use at Amazon today, might be to provide a way for client writers to add their own API calls to your service. A Self-Service Service Service, as it were. This actually does happen here and there, on an ad-hoc basis. A truly desperate client group can figure out how to build the service locally, add in the code for the call, and request a code review from the service team. It's easier for a service team to allocate a few hours for a code review than a few days for writing the code. But it doesn't happen very often, for various reasons beyond the scope of this article.11

Summary and Recommendations

First, there's a lot of FUD being spread about Services having been launched to reduce our systems complexity. It's not true. We launched specifically to solve the 2-tier problem. The services are making our systems much more complex, at least for application developers.

That means that having Services isn't going to magically fix our systems-availability problems, nor will it intrinsically make us more flexible. We'll have to tackle those problems directly.

Building services is really hard, particularly at Amazon, because of our massive scale. There doesn't seem to be one ideal approach: all approaches involve serious trade-offs. It's one of the industry's thorniest problems.

Query languages like SQL (and their equivalents in the XML world) are rich and beautiful. They handle transactions seamlessly, they provide almost infinite flexibility in the granularity of user queries, they require no builds or linking cycles, and they're exceptionally well-documented.

Most developers, both worldwide and at Amazon, don't seem to realize that an Object-Oriented service API is a pretty piss-poor replacement for a query language. But let's face it: APIs require way more work from both the service owners and the clients.

It stands to reason that once you have a reasonably well-defined service interface to a set of databases, with all the caching and transactional semantics worked out in production, the next logical step is to build a new query language that abstracts the APIs away for the clients. At least, it seems pretty clear to me. Became GraphQL when Facebook shipped it in 2012, built on exactly the two failure modes named here — over-fetching and chatty fine-grained calls.

If you go to the effort of building N-tier services, you should make sure you actually realize all the potential benefits. One of the potential benefits is language neutrality. But you have to make sure of it yourself. If you choose to use CORBA or Tibco publish-subscribe or any other protocol, you need to make sure there's a way to create language bindings.

And I mean it, too. Because in the middle of the night, when we're trying to clean up after a major catastrophe by writing programs that fix our database data, it would be pretty frigging stupid to write the cleanup code in C++, or even Java.

If you create a service to encapsulate your data store, be aware that small interfaces will inevitably become large ones, because, hey, you didn't create a general-purpose query language, did you? And once your API is large, with hundreds or even thousands of methods, you're going to be tweaking it all the time. Versioning, deployment, QA, and other aspects of production pushes will start to become a nightmare. So make sure you refactor your services constantly! It's always better to push a small service than a big one.

And don't make any new services with the word "Master" in them! Or "Kitchen Sink", for that matter. They'll just bloat up into unmanageable leviathans.

I could offer lots more tips, but you get the idea. Just be aware that no matter how carefully you plan your service, it's going to suck. It's OK, though, because nobody's figured out how to do it right yet. Nobody.

This is a hard problem.

Notes

[1] Before you security engineers have a heart attack, I'll confess that we mooooostly don't do this anymore.

[2] [Editor's Note: That was heavy sarcasm. It actually happens on essentially every promotion Amazon runs, or at least it used to.]

[3] In DBA terminology.

[4] Also called "N-tier architectures", because N, as we all know, is way cooler than 3.

[5] It's sort of like adding a teller to a bank. Instead of going and getting my cash directly out of the vault, I ask a teller to do it for me. The bank could decide to invest all my cash in South American junk bonds, but I'd never know, since I'm still talking to a teller. We'll revisit this metaphor later.

[6] Or C++, I guess.

[7] "Database not found. TNSNAMES.ORA missing or corrupted."

[8] Deprecation is a software term meaning "asking people not to do something anymore." It's roughly as effective as a liquor-store owner posting a sign saying "Robbing our store is now deprecated. Thank you."

[9] Or even usable, if you want to be just plain mean.

[10] OK, a real footnote for once. There are other problems with the teller model, beyond the exponentially increasing number of API calls required by the clients. For one thing, splitting the APIs makes it much more difficult to maintain cache consistency and transactional integrity, since the calls can now be made outside the context of a transaction. It's harder to maintain state across API calls, and it's easier for ill-behaved clients to invoke the calls zillions of times and bring the service down via "friendly fire". There is virtually no end to the headaches, but it's really way beyond the scope of this article to talk about them. Just feel sorry for both the client writers and the service writers, and your karmic debt will be paid in full.

[11] I'm sure you can imagine what they are. For one thing, client writers are going to spend much longer writing a service call than the service owner would, because they need to come up to speed on a lot of service-specific "stuff" before they can even get started. And the negotiation usually involves more than just getting a code review; the service owner has to agree to take operational responsibility for the new code, and possibly participate in the design, if it's more than a non-trivial pass-through call. It's usually more convenient for client writers to put their request in the queue and wait (possibly a very long time) for it to be added.