Napkin math for MongoDB performance

文章来源：http://rickosborne.org/blog/2010/02/napkin-math-for-mongodb-performance/

As we all know, there are lies, damned lies, and statistics. What I’m about to present shouldn’t even qualify as statistics—it’s just a bunch of damned lies. I’m not set up to do any sort of rigorous performance testing, so these should not be construed as anything but what they are: one guy’s half-assed and probably flawed measurements.

I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column) functionality. The short answer is: don’t do it . Or, if you do it, figure out a better way than I did. Along the way, I gathered some metrics on what types of operations cause what kinds of performance hits.

The Setup

My set up is a database of 3,397,115 records, all of which look something like this:

{

"_id" : 3002827,

"mm" : 7,

"stars" : 5,

"date" : "2005-07-18",

"dd" : 18,

"cust" : 2213,

"movie" : 14889,

"yy" : 2005,

"title" : "Species",

"year" : 1995

}

Yeah, I just took the Netflix prize data and inserted ~3M records. I did the inserts across 3 shard services, all running on the same machine, which led to 9 chunks of roughly equal size. I let MongoDB handle the sharding—I didn’t manually split the shards. I ensured one index on the collection, over movie and cust , which isn’t really used for the query in question, but I thought it was worth mentioning.

Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.

I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks while running the tests, but nothing that should have interfered greatly.

The Queries

Here’s the starting query’s SQL equivalent:

SELECT releaseYear,

COUNT(*) AS nRecords,

COUNT(DISTINCT movie) AS mMovies,

COUNT(DISTINCT cust) AS cCustomers,

SUM(stars) AS totalStars,

AVG(stars) AS avgStars

FROM training

WHERE (releaseYear = 1990)

GROUP BY releaseYear

And the MapReduce query itself, as I wrote it:

db.runCommand({

mapreduce: "training",

query: {

year: 1990

map: function() {

var m = {}, c = {};

m[this.movie] = true;

c[this.cust] = true;

emit(

this.year,

{ "stars": this.stars, "n": 1, "m": m, "c": c }

)},

reduce: function(key, vals) {

var stars = 0, n = 0, m = {}, c = {};

for(var i = 0; i < vals.length; i++) {

var v = vals[i];

stars += v.stars;

n += v.n;

for (var im in v.m) m[im] = true;

for (var ic in v.c) c[ic] = true;

}

return { "stars": stars, "n": n, "m": m, "c": c };

finalize: function(key, val) {

val.avg = val.stars / val.n;

var m = 0, c = 0;

for (var im in val.m) m++;

for (var ic in val.c) c++;

val.m = m;

val.c = c;

return val;

out: "result1",

verbose: true

});

Those nasty bits with the for-in loops are for the COUNT(DISTINCT column) logic. This query produces the following result set:

{

"_id" : 1990,

"value" : {

"stars" : 593179,

"n" : 154617,

"m" : 7,

"c" : 120259,

"avg" : 3.8364410123078314

}

The Results

All times below are in mm:ss format. (Minutes, not hours.)

Query	Total Time	Shards Time	Final Function
1	10:44	03:46	06:58
This was the starting query above, as written.
2	90:48	36:26	54:22
I widened the release year restriction from just 1990 to 1990-1999, via { year: { $gte: 1990, $lte: 1999 } } . That's close to a linear relationship between emitted records and time elapsed.
3	21:33	13:53	07:40
I used movechunk to consolidate all of the chunks on one shard server, then shut down the other two. I reduced the release year restriction back to just 1990. It takes 2x longer than the first query, presumably due to disk bottlenecks? One shard trying to reduce 9 chunks at once?
4	02:08	02:08	-
I removed the for-in loops and COUNT(DISTINCT) logic, leaving only the plain record count and average, but was still on the one shard server, implying a 10x slowdown for that type of logic.
Query	Total Time	Map Time	Emit Loop
5	00:13	00:06	00:13
I connected to the one remaining shard directly, instead of through mongos , and ran the previous query (no for-in ). Again, this implies a 10x slowdown due to trying to process chunks simultaneously.
6	05:24	00:15	01:14
Still connected directly to the one shard (no mongos ) with all of the records, I ran the original query (with for-in logic). A slowdown of 25x seems a little high, but I ran the query twice to verify it.

Lessons Learned

Queries scream when a single shard is left to its own devices—but when parallelism is attempted on the same shard you get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.
Don't try to emulate COUNT(DISTINCT) . Really.

I have to wonder if mongos can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?

推荐阅读：MongoDB: Terrible MapReduce Performance

MongoDB's performance on aggregation queries

Is this Map Reduce performance normal or I am missing something

Napkin math for MongoDB performance

The Setup

The Queries

The Results

Lessons Learned

继续阅读

《Hive权威指南》第八章：HiveQL索引8 HiveQL：索引

MapReduce运行Wordcount时一直卡在INFO mapreduce.Job: Running job，web查看一直处于accepted阶段

MapReduce(一)：入门级程序wordcount及其分析

HiveQl语句应用实例：WordCount具体步骤如下：

用mapreduce计算wordCount和手机流量统计程序运行过程WordCount统计手机流量统计

Hadoop之运行wordcount

Eclipse运行WordCount（详细版）相关连接Eclipse运行WordCount

ROS机器人Diego 1#制作（五）base controller---角速度的标定

可变参数宏， Variadic Macros

Windows系统Qt5/mingw-64配置GSL科学计算库

GSL科学计算库——计算高斯-勒让德积分

专家访谈：搜索开源力量：Lucene技术前景

不用iconv函数实现UTF-8编码转换GB2312的PHP函数

MapReduce的几个企业级经典面试案例MapReduce的几个企业级经典面试案例

Ubuntu14.04 LTS下安装mongodb

无组件上传图片到数据库中，最完整解决方案