Thursday, May 26, 2011

Better than linear scaling is possible

As part of my research for my Ph.D. thesis, I spent a lot of time
understanding the impact of CPU caches on the performance of a DBMS.
I concluded that in a parallel data server it is actually possible
to get better than linear scaling in certain workloads.

When executing a benchmark with 2 machines consisting of 8 cores where
those 8 cores share a 2 MByte cache has a total of 4 MByte CPU cache.
Assuming that the benchmark executes with a data set of 2 GByte, then
0.1% of the data fits in the CPU cache. As the number of machines grow,
the available CPU caches also grows, this means that when we have
32 machines, we have 64MByte of cache available. This means that we can
now store 1.6% of the data set in the CPU cache.

For benchmarks one mostly tries to scale the data set size when increasing
the number of nodes in the system. This is however not necessarily true in
real-life applications. For real-life applications the working set is
constant, the working set can grow in time as more customers join the service
or for other reasons. But the working set of a real life application doesn't
grow when you grow the number of machines in the database cluster.

It's very well known that there are many things that drives sublinear scaling,
the most important of those is the extra cost of communication in a larger
cluster. The number of communication lanes in a fully connected cluster is
n * (n - 1) / 2. This means that the number of communication lanes grow by
the square of the number of machines, O(n^2). The communication only
increase linearly in number of machines which means that each lane gets
linearly less bytes to communicate in a larger cluster. Given that
communication cost is fixed_cost + #bytes * cost_per_byte, this means
that the cost per byte sent will increase in a larger cluster since there
will be smaller packets and thus fewer bytes to pay for the fixed cost.

The above is one reason why sharding is a good idea, this means that we
partition the problem, thus we only use a subset of the communication lanes
and thus we avoid the increased cost of communication as the number of
machines grows. Obviously sharding also imposes limitations to the type of
queries you can handle efficiently.

Now to some specific facts about MySQL Cluster and why we can obtain
bettter than linear scaling here (reported in earlier blogs here and here).
For reads here we got 1.13M on 8 nodes, 2.13M on 16 nodes and 4.33M
reads on 32 nodes. For updates we got 687k on 4 nodes, 987k on 8 nodes
and finally 2.46M on 16 nodes. All the data in this benchmark was also
replicated.

The data nodes in MySQL Cluster use an architecture where we have up to
4 threads that handle the local database handling. These 4 threads handle
their own partitions. Next we have one thread that handles the transaction
coordinator role, we also have one thread that takes care of the receive
part of the communication. Finally we have a set of threads taking care
of file system communication. What this effectively means is that as we
grow the cluster size and the cost of communication grows, each data node
will consume more CPU power, however the architecture of MySQL Cluster
is done in such a way that this extra CPU power is spent in its own
CPU cores. Thus we simply use a bit more of the CPU cores for
communication when the cluster size grows.

The benefit of this approach is that it is easy to scale the number
of CPU cores used for communication. Given that modern machines often
comes with quite high number of CPU cores, this means that as machines
gets beefier, we can actually deliver better than linear scaling of
the workload one can achieve by growing the number of data nodes in
MySQL Cluster.

In MySQL Cluster each execution thread has its own scheduler.
This scheduler becomes more and more efficient as load grows for
two reasons. The first is that as the load grows, the queue is
longer and thus we need to refill the queue fewer times,
this means that we spend more time executing the same code over
and over again. This means that the instruction cache for that
code will be very hot and we will train the branch predictor
subsystem in the CPUs very well. This benefit we get both in
the code refilling the queue and the code to execute the actual
database workload. Given that the load is high we also avoid
running code that checks for messages and there is no messages
around. Thus as load increases the efficiency increases and
the actual number of instructions to execute per message also
decreases.

So when I presented this theory at the presentation of my
Ph.D. thesis this was only a theory. In the real world it's
very uncommon to see the effect of CPU caches and other effects
being greater than the added burden of a larger cluster. However
I have seen it twice in my career. The first was a benchmark
performed in 2002 on a very large computer where we hosted 32 nodes
(single CPU nodes in those days) and 23 benchmark applications.
Here we scaled from 0.5 million to 1.5 million going from
16 to 32 nodes. Now also in the results presented at the
MySQL Users conference and in my previous blogs we achieved better
than linear scaling in particular for the write benchmark, but also
to some extent for read benchmarks. I am sure the above isn't the
entire explanation of these effects, but the effects of the things
explained above certainly plays a role in it.