Scaling a Microblogging Service – Part III

There used to be a big difference between API access and regular human-oriented HTML access: the speed in which requests are made. When a request is made via a browser, there is inherit delay from human interaction, browser response, page rendering, and fetching of images. Most of this is gone once a machine makes the call. However, with recent improvement in browser technology and the wide use of AJAX techniques on the client side, even the human-readable pages can make API calls to render pages.

Scalability plays a central role when designing the ways in which data can be requested from a service, be it via an API call or HTML page request. Both types fetch raw data, process it, and then format it into a presentation format such as HTML, XML, JSON, etc. In the case of a server-rendered HTML page, all the different requests are made internally, hidden from the user, and a single page is returned. If the page uses AJAX scripts, the browser makes multiple API calls to fetch individual data sets, but the server still has to fetch the raw data, process it, and format it. It is the size of the batch that makes the difference.

Making an API call is like asking a question and if the question is simple enough, the answer is easy to come by. For example, asking to describe a single user is simple. A single database lookup using a single unique key is fast and generally easy to scale. This is the most optimized use case in relational databases. However, asking for the last 20 messages of all the friends I am following is not a simple question. It is complex because it is really a list of questions batched together. To find the last 20 messages, we must ask:

  • Who do I follow?
  • What are their last 20 messages?
  • Of all the messages found, which are the most recent 20?
  • What is the content and other metadata for each of these 20 messages?

This list of questions becomes pretty inefficient with a very large set of friends being followed. However, the data set can be reduced by first finding the top 20 friends with the most recent updates, and only requesting their 20 latest messages. Even with this optimization, we are looking at fetching up to 400 messages in order to display 20. This gets much worse when the request is for the second page of messages (messages 21-40) where we can fetch up to 1600 messages from up to 40 friends in order to display 20 messages. By the 5th page we are fetching up to 10,000 messages to display 20 (which might explain why Twitter temporarily disabled the paging feature of their API).

All the popular microblogging services offers a web interface to request this personalized timeline, as well as an API call. They all let you ask the direct question: what is the latest with my friends? But that comes at a significant performance hit. In theory, there should not be any difference in the amount of resources needed to fulfill this request. At the end all the sub-questions must be asked and if you only ask this once, the server side batch solution will be faster. But when the client starts asking this once a minute or more repeatedly, resources go to waste and scalability is more expensive.

What makes breaking this single request into multiple smaller ones more efficient is the fact the client can be state-full and store information in between calls. For example, the client can keep track of the list of friends, can remember the last message id retrieved, and can fetch all 5 pages at once and page them locally. If the client keeps track of its last request, it can refresh its status using simple questions, such as: has anything changed for this user since message id 4839? To which the server can reply quickly. It also makes usability changes easier to make, for example allowing to temporarily hide verbose friends.

A big part of Twitter’s success came from opening up their infrastructure via a dead-simple API. For the most part, the API design was a result of converting existing HTML pages into machine-readable representation of the same data. The rest was driven by their active developers community. To a large degree, providing this super-simple API is one of Twitter’s biggest scaling challenges, and is a pattern repeated by all the other providers in the space.

Offering simple API is important, and in no way is this post promoting the idea of sacrificing easy-of-use. But developers can be motivated to build better applications using a somewhat more complex set of requests instead of the easy but very expensive ones. For example, Twitter can enforce stricter limitations on the /friends_timeline API call, but allow a much higher quota for the other more restrictive calls such as asking if a specific user has new updates since a given time or id. But even more importantly, services should change their websites to use their API with some client-side scripts, similar to how Google Reader works. Not only will the user experience improve, but it will give a live demo of how to use the API (as well as unify the server platform to a single method of access).

Once the server is only serving data with limited scope, usually providing the messages of a single person with the optional perspective of the reader (to enforce access control), scaling becomes a much easier task as data can be segmented easier. In a world where APIs are becoming a necessity, developers should make sure their platform not only allows API access, but supports the behavior patterns it dictates. Finding the perfect balance between a short learning curve and a highly scalable platform is key to long term success.

7 thoughts on “Scaling a Microblogging Service – Part III

  1. regarding paging, if you set up a table index to be msg_date, user_id then I think you avoid unnecessary disk/io. with mysql for example, you would use limit 20. so the initial query would be like… msg_date < current time and user_id in (?). and if the user requested the next page then you just use the date of the last displayed msg for the next lookup.
    anyway, it was interesting writeup.

  2. It seems to me that one of the main problems has already been identified: the database. Specifically, the mismatch between the kinds of queries you want to do, and the relational model. As another commenter said, using the proverbial hammer on a screw. Teaching an elephant to tap-dance is a difficult task, at best,
    I would suggest that you look at other models that are more suited to the problem domain, such as network databases, where the main modeling is done with nodes and properties instead. Specifically, there is a database that implements this model and which is insanely performant for exactly the kind of queries you mention, and it’s called Neo4j ( I think you will find that you can express the queries quite easily.
    I’ve seen one demo (“The Facebook Demo”) where they populated a database with a shitload of users, each having relations to each other, and then implemented it using Postgres and Neo4j respectively. Finding relations between users through SQL was non-trivial, and the queries took seconds and sometimes more than a minute to compute, whereas Neo4j did it with sub-second responses. No magic to it, really, since the model is designed to do precisely that, but still cool in this era of social networking, where the current technology is not really able to match the new requirements. I would strongly suggest that you check it out.

  3. I don’t see how neo4j solves this problem. Navigating graphs in a relational DB is tricky, yes. And trivial in a network database, assuming you already have your finger on the node(s) of interest. But that does not mean there’s no cost, nor that the cost of traversing a couple of relations away and then filtering the timeline of messages goes away. Network databases have their own issues and they don’t automatically make things like NP-hard problems go away.

  4. I have thought of a solution to the twitter problem and have posted it (well a simple version) on Techcrunch. Before we changed the direction of my company we had a real time friend feed similar to Facebook’s – but real time. We designed it to be distributed across multiple MySQL databases using a query distribution engine but we never needed to scale it which means I am not 100 percent sure that the solution would have worked. I wrote a little article on Techcrunch and will write a full one next weekend when I finally get my own blog live.
    I do agree that it is a very hard problem to solve if you still want to provide all the functionality that twitter currently do i.e. a real time view of the information and not a delayed view of it. I anticipate that my solution to the problem would have some latency issues and would also increase the demand on the processor.
    Post 137 – Please let me know what you guys think

  5. Users ask for timelines every few seconds but they don’t post every few seconds. So, to me, if makes sense to get as much work done on the write side of things as possible since you have lesser load there.
    Imagine a million users with a million buckets. When user A makes a post that post is dropped in each of A’s subscriber’s bucket. That way when any user logs in all the need to do is look in their own bucket for the timeline. There is no searching friend-by-friend for updates. Also, paging a bucket is straightforward if you maintain order as you drop in the posts.
    This idea produces lots and lots of duplicates but that’s not really so bad since storage is the cheapest part of a system. You could try and minimize dups by dropping a pointer instead of the entire post but then you’re left with dereferecing the pointer on each timeline request. I say spend your money on storage and keep reading dead simple and fast.

Comments are closed.