Thursday, April 23, 2009

Join Executor for MySQL Cluster

Jonas in the Cluster team reported on his work on executing
joins in the NDB kernel for MySQL Cluster here.

This is a very interesting work we have in progress at MySQL.
We are working on an extension of the Storage Engine API
where the MySQL Server will present an Abstract Query Tree
to the Storage Engine. The Storage Engine can then decide to
execute the query on his own or decide that the MySQL Server
should execute it in the classic manner. In the first prototype
the optimisation will be done as usual and only after the
optimisation phase will we present the join to the storage
engine. However the specification also covers work on
integrating this with the optimiser and also enabling the
possibility for the storage engine to execute parts of the
query and not the entire one. The specification of this
work can be found here.

Jonas is working on the backend part for this interface in
MySQL Cluster.

What is interesting with pushing joins to the NDB kernel is that
it becomes very easy to parallelize the join execution. So what
will happen when this feature is ready is that MySQL Cluster
will shine on join performance and enable very good
performance on all sorts of application using SQL.

The reason that MySQL Cluster can so easily parallelize the query
execution of the join is due to the software architecture of the
NDB kernel. The NDB kernel is entirely developed as a message
passing architecture. So to start a thread of execution in the
NDB kernel one simply sends two messages when executing one
message and to stop a thread one simply doesn't send any messages
when executing a message. The problem then is more on that one
should not parallelize too much to run out of resources in the
system.

So with this development MySQL Cluster will also be shining at
Data Mining in an OLTP database. MySQL Cluster is designed for
systems where you need massive amounts of read and write
bandwidth (the cost of writing your data is close to the cost
of reading the data). So with the new features it will be
possible to do Data Mining on data updated in Real-time. Most
Data Mining is performed on a specialised Data Warehousing
solution. But to achieve this you need to transfer the data to
the Data Warehouse. With MySQL Cluster it will be possible to
both use the database for OLTP applications with heavy updates
always occuring while still querying the data with parallel
queries in parallel. MySQL Cluster is very efficient at
executing individual queries in the NDB kernel and can also
scale to very many machines and CPU cores.

Data on MySQL Performance

If you like to sift through tons of benchmark data about various
MySQL versions, Dimitri at the Sun Benchmark Labs have published
a serious amount of benchmark data in a report published here.

The report shows that the new MySQL 5.4.0 release
have a very good performance. The report also shows how the day
of a developer of performance improvements and the massive amount
of benchmark data that needs to be analysed and sifted through
to understand the impact of new performance improvements.

I personally met Dimitri the first time in 2002 when I was working
together with him for a couple of weeks on a benchmark on NDB Cluster
(the storage engine of MySQL Cluster). Our goal then was to perform
1 million reads per second on a 72-cpu SPARC box with UltraSparc-III
CPU's @900MHz. We struggled a lot at the time but finally we managed
to achieve the numbers we were hoping for. We actually surpassed the
goal and reached 1.5 million reads per second and we also tried an
update benchmark where we managed to do 340.000 update transactions
per second (generating a disk write bandwidth of 250 MByte per second).

This benchmark was interesting from a scientific point of view. When
I defended my Ph.D thesis I claimed that one could get superlinear
performance increases when adding more CPU's to a problem in the
database world. To achieve this the workload needs to be constant and the
number of CPU's increased. By increasing the number of CPU's and keeping
the workload constant more CPU cache memory is used on the problem.
This means that each CPU will execute more efficiently.

In the above benchmark we managed to verify my claim that I made when
defending my Ph.D thesis which I found very positive. The results we
achieved on a 16-node cluster was 500.000 reads per second and on a
32-node cluster we reached 1.500.000 reads per second.

Dimitri has a background from the past of developing his own
homegrown database, so we have had many interesting discussions both
then and now on how to achieve the best performance of NDB and
the MySQL Server.

Wednesday, April 22, 2009

MySQL 5.4 Patches: Improvements to spin-loop

In InnoDB there is an implementation of both mutexes
and RW-locks. The RW-locks implementation have been
improved by the Google SMP patches. Both of these
implementation relies on spin-loops as part of their
implementation. The defaults in InnoDB is to check
the condition, if it's not ok to enter to spin for
about 5 microseconds and then come back to check the
condition again.

If one reads the Intel manual how to do spin-loops
they propose to use a PAUSE instruction and then
check the condition again, so a much more active
checking of the condition. When we tried this out
using the sysbench benchmark we found that using
the Intel approach worsened performance. So instead
we tried an approach of putting the PAUSE instruction
into the InnoDB spinloop instead.

This approach turned out to be a success. Even on
machines with only one thread per core we were able
to get a 3-4% increase in throughput. We also tried
various settings of the defaults of the time of
spinning in the spinloop and found that the original
default values were very close to the optimum values.
We found the optimum about 20% from the old default
values and made this slight change to the default
values of the spinloop.

It's my expectation that as we remove locks and the
mutexes and RW-locks gets less contended and there
are more locks where the threads are waiting that
this optimum value will change. The current best
setting is very likely to be governed by the fact
that the most waiting happens on very hot locks.
So with improvements of the mutexes and RW-locks
we should expect to see better performance with
a shorter time in the spinloop.

On the new SPARC CPU's that Sun has developed, the
CMT boxes, we used the results from the paper:
www.ideal.ece.ufl.edu/workshops/wiosca08/paper2.pdf
which stated that the optimum instruction to use
is a cache miss instruction, however as I don't
know how to program a cache miss instruction we
opted for the second best instruction which was a
dummy test-and-set instruction. So the PAUSE
instruction is replaced by a test-and-set instruction
on SPARC CPU's.

We expect that the improvements due to this small
change is even bigger when there are multiple
threads per core since the contention on the
CPU pipeline is higher in those cases and it is
important that the spinloop stays away as much
as possible from being active executing
instructions.

Tuesday, April 21, 2009

MySQL 5.4 Scaling to 16 way x86 and 64-way CMT Servers

The release of the MySQL 5.4 contains patches which
increases the scalability of the MySQL Server. I am planning to blog
about those changes in some detail over the next few days. This blog
will give an introduction and show what the overall results we have
achieved are.

The changes we have done in MySQL 5.4 to improve scalability and
the ability to monitor the MySQL Server are:

1) Google SMP patch
2) Google IO patches
3) Update of many antiquated defaults in the MySQL Server
4) New InnoDB Thread Concurrency algorithm
5) Improved Spinloop in InnoDB mutexes and RW-locks
6) A couple of performance fixes backported from 6.0
7) Operating system specific optimisations
8) Ported the Google SMP patch to Solaris x86 and SPARC and work
underway for Windows and Intel compiler as well
9) Introducing DTrace probes in the MySQL Server
10) A build script to make it easier for community to build an efficient
MySQL Server based on source code

As an example of the improvements made available through this work we
have some benchmarks using sysbench readwrite and readonly.

We have consistently seen improvements in the order of 30-40% of
sysbench top numbers and on large number of threads 5.4.0 drops
much less in performance than 5.1. The new InnoDB Thread Concurrency
patch makes the results on high number of threads even more
impressive where the results have gone up by another 5-15% at the
expense of 1% less on the top results (there are even some DBT2
runs that gave 200% improvement with the new algorithm).

There is also a benchmark on EAStress which shows a 59% increase in
performance from 5.1 to 5.4 using the new 16-way x86 Nehalem servers.

MySQL Cluster 7.0 scales linearly in two dimensions

As recently reported on my blog we have managed to get MySQL Cluster CGE 6.3 to scale linearly with the addition of more nodes into the system.
In MySQL Cluster CGE 6.3 each node has a single thread handling most of
the work together with a set of file system threads.

In MySQL Cluster 7.0 the data nodes are now multithreaded. The design in
7.0 follows the very efficient design of 6.3 where each thread has absolutely no lock contention with other threads. All communication
between threads happens through messages. This means that scalability
of the data nodes is excellent. The single thread have been split into
up to four local data threads, one transaction handling threads,
and one socket communication thread plus the already existing file
system threads. With this set-up each data node can process 4.6X more
DBT2 transactions compared to 6.3.

This means that a 2-node cluster in 7.0 has the same performance as a
10-node cluster for 6.3 and a 4-node cluster similar performance to a
20-node cluster in 6.3. As earlier blogged each data node can handle
many hundreds of thousands of operations per second, so a cluster of
such nodes can handle many millions of operations per second.

The efficiency of the data node is such that one data node can handle
the traffic from a set of MySQL Servers residing on a 24-core MySQL
Server. So an example of a basic set-up for the MySQL Cluster 7.0 is
to use 2 8-core boxes with lots of memory and lots of disk bandwidth
for the data nodes. To use 2 24-core servers for the MySQL Servers that
mostly require CPU and networking bandwidth.

An important consideration for setting up a MySQL Cluster 7.0 is to
ensure that interrupts from the network stack doesn't kill performance
and also to have separate network infrastructure between the data
nodes in the cluster since it is very easy to overload the network
given the capabilities of the MySQL Cluster software.

MySQL 5.4 Acknowledgements

The work started when MySQL was acquired by Sun has now started to bear
fruit. Very soon after the acquisition a Sun team was formed to assist
the MySQL performance team on improving the scalability of the MySQL
server. At the same time also Google have been very active in improving
scalability of InnoDB. MySQL 5.4 scalability improvements is very much
the result of the efforts from the MySQL Performance team, the Sun
performance team and the Google efforts.

It's extremely fruitful to work with such a competent set of people. The
Sun team has experience from scaling Oracle, DB2, Informix and so forth
and knows extremely well how the interaction of software and hardware
affects performance. The Google patches have shown themselves to be of
excellent quality. From our internal testing we found two bugs in the
early testing and both those had already been fixed by the Google team
and so turnaround time was a day or two. For the last months we haven't
found any issues. The MySQL performance team have also been able to add
a few small but effective improvements on top of the Google patches.

MySQL 5.4 also introduces DTrace support in the MySQL Server. This code
is a result of a cooperation with the MySQL 6.0 development team, the
original patch was developed for 6.0. We have spent quite some time
on getting the DTrace support working on all variants of Solaris and
Mac OS X platforms. For anyone interested in getting DTrace probes into
their application I think the MySQL example is probably the most
advanced example currently available on user-level DTrace probes and
building such DTrace probes into a complex build system.

Working with competent and motivated people is always great fun, so this
has been a very rewarding project for me personally. I have always liked
to work on performance improvements, in my work on founding
MySQL Cluster we were involved in many such efforts and so far they have
almost always been successful. Actually we're releasing a new version
of MySQL Cluster 7.0 now as well with its own set of extreme performance
improvements which I will mention in a separate blog.