Twitter, an Evolving Architecture

Posted by Abel Avram on Jun 26, 2009

Community Topics Tags

Evan Weaver, Lead Engineer in the Services Team at Twitter, who’s primarily job is optimization and scalability, talked about Twitter’s architecture and especially the optimizations performed over the last year to improve the web site during QCon London 2009.

Most of the tools used by Twitter are open source. The stack is made up of Rails for the front side, C, Scala and Java for the middle business layer, and MySQL for storing data. Everything is kept in RAM and the database is just a backup. The Rails front end handles rendering, cache composition, DB querying and synchronous inserts. This front end mostly glues together several client services, many written in C: MySQL client, Memcached client, a JSON one, and others.

The middleware uses Memcached, Varnish for page caching, Kestrel, a MQ written in Scala, and a Comet server is in the works, also written in Scala and used for clients that want to track a large number of tweets.

Twitter started as a “content management platform not a messaging platform” so many optimizations were needed to change the initial model based on aggregated reads to the current messaging model where all users need to be updated with the latest tweets. The changes were done in three areas: cache, MQ and Memcached client.

Cache

Each tweet is tracked in average by 126 users, so there is clearly a need for caching. In the original configuration, only the API had a page cache that was invalidated each time a tweet was coming from an user, the rest of the application being cacheless:

Twitter, an Evolving ArchitectureTwitter, an Evolving Architecture

The first architectural change was to create a write-through Vector Cache containing an array of tweet IDs which are serialized 64 bit integers. This cache has a 99% hit rate.

The second change was adding another write-through Row Cache containing database records: users and tweets. This one has a 95% hit rate and it is using Nick Kallen’s Rails plug-in called Cache Money. Nick is a Systems Architect at Twitter.

The third change was introducing a read-through Fragment Cache containing serialized versions of the tweets accessed through API clients which could be packaged in JSON, XML or Atom, with the same 95% hit rate. The fragment cache “consumes the vectors directly, and if a serialized fragment is currently cached it doesn’t load the actual row for the tweet you are trying to see so it short-circuits the database the vast majority of times”, said Evan.

Yet another change was creating a separate cache pool for the page cache. According to Evan, the page cache pool uses a generational key scheme rather than direct invalidation because clients can

send HTTPs if-modified-since and put any time stamp they want in the request path and we need to slice the array and present them with only the tweets they want to see but we don’t want to track all the possible keys that the clients have used. There was a big problem with this generational scheme because it didn’t delete all the invalid keys. Each page that was added which corresponding to the number of tweets people were receiving would push out valid data in the cache and it turned out that our cache only had a 5 hour effective life time because of all these page caches flowing through.

When the page cache was moved into its own pool, the cache misses dropped about 50%.

This is the current cache scheme employed by Twitter:

Twitter, an Evolving ArchitectureTwitter, an Evolving Architecture

Since 80% of the Twitter traffic comes through the API, there are 2 additional levels of cache, each servicing up to 95% of the requests coming from the preceding layer. The overall cache changes, in total between 20 and 30 optimizations, brought a

10x capacity improvement, and it would have been more but we hit another bottleneck at that point … Our strategy was to add the read-through cache first, make sure it invalidates OK, and then move to a write-through cache and repair it online rather than destroying it every time a new tweet ID comes in.

Message Queue

Since, on average, each user has 126 followers, it means there are 126 messages placed in the queue for each tweet. Beside that, there are times when the traffic peaks, as it was during Obama’s inauguration when it reached several hundreds of tweets/second or tens of thousands messages into the queue, 3 times the normal traffic at that time. The MQ is meant to take the peak and disperse it over time so they would not have to add lots of extra hardware. Twitter’s MQ is simple: based on Memcached protocol, no ordering of jobs, no shared state between servers, all is kept in RAM and it is transactional.

The first implementation of the MQ was using Starling, written in Ruby, and did not scale well especially because Ruby’s GC which is not generational. That lead to MQ crashes because at some point the entire queue processing stopped for the GC to finish its job. A decision was made to port the MQ to Scala which is using the more mature JVM GC. The current MQ is only 1,200 lines and it runs on 3 servers.

Memcached Client

The Memcached client optimization was intended to optimize cluster load. The current client used is libmemcached, Twitter being its most important user and contributor to the code base. Based on it, the Fragment Cache optimization over one year led to a 50x increase in page requests served per second.

Twitter, an Evolving ArchitectureTwitter, an Evolving Architecture

Because of poor request locality, the fastest way to deal with requests is to precompute data and store it on network RAM, rather than recompute it on each server when necessary. This approach is used by the majority of Web 2.0 sites running almost completely directly from memory. The next step is “scaling writes, after scaling reads for one year. Then comes the multi co-location issue” according to Evan.

The slides of the QCon presentation have been published on Evan’s site.

///

Twitter是目前為止最大的Ruby on Rails應用，幾個月間頁面點選由0增長到幾百萬，現在的Twitter比今年月快了10000%

平台

Ruby on Rails

Erlang

MySQL

Mongrel

Munin

Nagios

Google Analytics

AWStats

Memcached

狀态

成千上萬的使用者，真實數量保密

每秒鐘600請求

每秒鐘平均200-300個連接配接，峰值為800個連接配接

MySQL每秒鐘處理2，400個請求

180個Rails執行個體，使用Mongrel作為Web伺服器

1個MySQL伺服器(one big 8 core box)和1個slave用于隻讀的統計和報告

30+程序用于處理其餘的工作

8台Sun X4100s

Rails在200毫秒内處理一個請求

花費在資料庫裡的平均時間是50-100毫秒

超過16GB的memcached

架構

1，遇到非常常見的伸縮性問題

2，最初Twitter沒有監聽，沒有圖，沒有統計，這讓解決問題非常困難。後來添加了Munin和Nagios。在Solaris上使用工具有點困難，雖然有Google Analytics但是頁面沒有loading是以它沒什麼用

3，大量使用memcached作緩存

-例如，如果獲得一個count非常慢，你可以将count在1毫秒内扔入memcached

-擷取朋友的狀态是很複雜的，這有安全等其他問題，是以朋友的狀态更新後扔在緩存裡而不是做一個查詢。不會接觸到資料庫

-ActiveRecord對象很大是以沒有被緩存。Twitter将critical的屬性存儲在一個哈希裡并且當通路時遲加載

-90%的請求為API請求。是以在前端不做任何page和fragment緩存。頁面非常時間敏感是以效率不高，但Twitter緩存了API請求

4，消息

-大量使用消息。生産者生産消息并放入隊列，然後分發給消費者。Twitter主要的功能是作為不同形式(SMS，Web，IM等等)之間的消息橋

-使用DRb，這意味着分布式Ruby。有一個庫允許你通過TCP/IP從遠端Ruby對象發送和接收消息，但是它有點脆弱

-移到Rinda，它是使用tuplespace模型的一個分享隊列，但是隊列是持久的，當失敗時消息會丢失

-嘗試了Erlang

-移到Starling，用Ruby寫的一個分布式隊列

-分布式隊列通過将它們寫入硬碟用來挽救系統崩潰。其他大型網站也使用這種簡單的方式

5，SMS通過使用第三方網關的API來處理，它非常昂貴

6，部署

-Twitter做了一次review并推出新的mongrel伺服器，還沒有優雅的方式

-如果mongrel伺服器替換了則一個内部錯誤抛給使用者

-是以的伺服器一次殺死。沒有使用rolling blackout方式因為消息隊列狀态保持在mongrel裡，這将導緻剩餘的mongrel被堵塞

7，誤用

-系統經常當機，因為人們瘋狂的添加任何人為朋友，24小時内有9000個朋友，這将讓站點崩潰

-建構工具來檢測這些問題，這樣你可以找到何時何地發生這些錯誤

-無情的删除這些使用者

8，分區

-将來計劃分區，目前還沒有。目前所做的改變已經足夠

-分區的計劃基于時間，而不是使用者，因為大部分請求都是本地的

-由于memoization分區會很難。Twitter不能保證隻讀的操作真的為隻讀，有可能寫入一個隻讀的slave，這很糟糕

9，Twitter的API流量是Twitter站點的10倍

-Twitter所做的最重要的事情就是API

-保持服務簡單允許開發人員在Twitter的基礎組織上建構一些比Twitter自己所想到的更好的主意。例如，Twitterrific是一個使用Twitter優美的方式

學到的東西

1，和社群交流。不要隐藏并嘗試自己解決所有問題。如果你提問，有許多聰明的人士願意幫忙

2，将你的伸縮計劃當成一個商業計劃，聚集一幫顧問來幫助你

3，自己建構它。Twitter花費大量時間來嘗試其他人的似乎可以工作的解決方案，但是失敗了。自己建構一些東西會更好，這樣你至少可以控制它并且建構你需要的特性

4，在使用者的限度上建構。人們可能嘗試弄垮你的系統。提高理由的限度和檢測機制來保護你的系統不被殺死

5，不要讓資料庫成為首要瓶頸，并不是所有東西都需要一個很大的join，緩存資料，考慮其他創造性的方式來獲得結果。一個好例子在裡Twitter, Rails, Hammers, and 11,000 Nails per Second談到

6，讓你的應用一開始就很容易分區。這樣你會一直有一種方式來伸縮你的系統

7，認知你的系統是很慢的，馬上添加報告來跟蹤問題

8，優化資料庫

-索引所有東西，Rails不會為你做這件事

-解釋你的查詢是怎樣運作的，索引可能不是按你想像的去做

-大量的非正常化。例如，Twitter一起存儲使用者ID和朋友ID，這預防了大量的開銷昂貴的join

9，緩存所有東西，個别的ActiveRecord對象目前沒有被緩存。目前查找已經足夠快

10，測試一切

-你想知道當你部署時一起工作正常

-Twitter現在有一個完整的test suite。是以當緩存失效時Twitter可以在go live之前找到問題

11，使用異常提示和異常日志來獲得立即的錯誤提示，這樣你可以發現正确的方式

12，不要做傻事

-伸縮改變了傻東西

-嘗試一次加載3000個朋友到記憶體中可能帶來伺服器崩潰，但是當隻有4個朋友時它工作的很好

13，大部分性能不是來自語言，而是來自應用設計

14，通過建立一個API來讓你的站點開放服務。Twitter的API是它成功的一個大原因。它允許使用者建立一個擴充和生态系統。你可以從不做你的使用者可以做的工作，這樣你就不會有創造性。是以開發你的系統并且讓其他人将他們的應用與你的應用內建變容易

Twitter, an Evolving ArchitectureTwitter, an Evolving Architecture

Twitter, an Evolving Architecture

Cache

Message Queue

Memcached Client

繼續閱讀

spring3.2.2與xmemcached-1.3.7的優雅內建

Spark的RDD轉換算子-雙value型Spark的RDD轉換算子-雙value型

Scala中的match(模式比對)

百度、新浪、Mixi、Apache社群贊助的開源key-value分布式存儲系統[轉載]

timesten系列五：如何定義cache，和背景oracle資料庫同步資料

[zz]The Most Important Algorithms (in CS and Math)

Intel® 64 and IA-32 Architectures Software Developer's Manuals

西部資料的新品PCIe4.0SSD固态硬碟WDBlueSN580，速度真的不錯！1TB順序讀取速度高達4150MB/s6

HP Proliant 系列伺服器使用 SmartStart CD光牒配置陣列卡過程

《快學Scala》——基礎

《快學scala》第13章練習答案

叽歪網創始人李卓桓：叽歪的微資訊模式叽歪網創始人李卓桓：叽歪的微資訊模式叽歪的發展方向：商業服務還是媒體李卓桓：叽歪的網際網路過冬政策李卓桓：中國微網誌營運模式需共同探索

9.spark Core 進階2--Cashe

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method