Sunday, October 7, 2012

MongoDB 2.2 Aggregation And MapReduce Performance

TL;DR - 1 million documents with readings across 20 sensors (1 reading per document), find the number of readings and their average per sensor. Aggregation framework does it in 4.1 seconds, map/reduce in 67.2 seconds, map/reduce with jsMode=true in 45.5 seconds, and reading all documents into PHP and doing it there takes 3.3 seconds. Conclusion - MongoDB's map/reduce has horrible performance for some reason.

Load the initial data with
time mongo load_initial.js
which on my laptop took 1m32.271s.

// load_initial.js
//
// Create a collection of documents where each document is
// { "sensor_id": 4, "ts": 123456.78, "reading": 2.3 }

for(var i=0; i < 1000000; i++) {
  sensor_id = Math.floor(Math.random() * 20 + 1); // Random int 1-20.
  ts = Math.random() * 3000000; // Random float from 0-2999999.9999
  reading = Math.random() * 100; // Random float 0.0-99.9999
  obj = {"sensor_id": sensor_id, "ts": ts, "reading": reading};
  db.sensors.save(obj);
}
Run
mongo --eval 'db.sensors.count()'
as a sanity check, which on my laptop printed
MongoDB shell version: 2.2.0
connecting to: localhost:27017/test
1000000
For each sensor, let's count the number of readings and find the average reading. First let's do it using the aggregation framework:
// count_avg_by_sensor_aggregate.js
//
// For each sensor, find the number of readings, and their average value.

res = db.sensors.aggregate(
  {
    $group: {
      _id: "$sensor_id",
      "readings": {$sum: 1},
      "average": {$avg: "$reading"}
    }
  }
).result.sort(
  function(a, b) {
    return a._id - b._id;
  }
);

printjson(res);
Run it with
time mongo count_avg_by_sensor_aggregate.js
I omitted the output, but this took 0m4.111s. Now let's do this with a map/reduce:
// count_avg_by_sensor_mr.js
//
// For each sensor, find the number of readings, and their average value.

if(typeof jsMode === "undefined") {
  jsMode = false;
}

res = db.sensors.mapReduce(
  function() {
    emit(this.sensor_id, { "readings": 1, "total": this.reading });
  },
  function(key, values) {
    var result = { "readings": 0, "total": 0 };
    for(var i = 0; i < values.length; i++) {
      result.readings += values[i].readings;
      result.total += values[i].total;
    }
    return result;
  },
  {
    "out": { "inline": 1 },
    "finalize": function(key, value) {
      return {
        "readings": value.readings,
        "average": value.total / value.readings
      };
    },
    "jsMode": jsMode
  }
).results;

newres = [];
for(var i = 0; i < res.length; i++) {
  newres.push(
    {
      // When jsMode is true, _id becomes a string for some reason.
      "_id": parseInt(res[i]._id),
      "readings": res[i].value.readings,
      "average": res[i].value.average
    }
  );
}

newres = newres.sort(
  function(a, b) {
    return a._id - b._id;
  }
);

printjson(newres);
Run it with
time mongo count_avg_by_sensor_mr.js
This took 1m7.171s . Now do the same thing, but use jsMode=true:
time mongo --eval jsMode=true count_avg_by_sensor_mr.js
This took 0m45.459s . And finally, let's do the same calculation in PHP:
<?php

// count_avg_by_sensor.php
//
// For each sensor, find the number of readings, and their average value.

$mongo = new Mongo('mongodb://localhost:27017', array(
  'db' => 'test'
));
$db = $mongo->selectDB('test');
$sensors = $db->selectCollection('sensors');

$cursor = $sensors->find();
$result = array();
foreach($cursor as $doc) {
  $_id = (int)$doc['sensor_id'];
  if(!array_key_exists($_id, $result)) {
    $result[$_id] = array("readings" => 0, "total" => 0);
  }
  $result[$_id]["readings"] += 1;
  $result[$_id]["total"] += $doc['reading'];
}
ksort($result);

$newresult = array();
foreach($result as $_id => $value) {
  $newresult[] = array(
    "_id" => $_id,
    "readings" => $value["readings"],
    "average" => $value["total"] / $value["readings"]
  );
}

echo json_encode($newresult, JSON_PRETTY_PRINT);
Run it:
time php count_avg_by_sensor.php
This took 0m3.313s .

No comments: