Wednesday, December 03, 2008

LOCK_open, THE mutex :)

In all my days at working at MySQL the
LOCK_open mutex have always been a
key mutex to understand, now that I'm
working on scalability improvements of
the server it's as important to change
this mutex into something less contentious.

So last week I finally decided to start
thinking about how we can resolve this
mutex which is at the heart of the MySQL
Server. In principle the idea is that
LOCK_open has been used to protect a
hash table with all the open tables in
the MySQL Server. However it has been
used for many other purposes as well.
So it's not trivial to move around it.

However the main scalability problem
with LOCK_open is the hash lock it
provides. So what to do about it?

My current thinking is that a four-thronged
approach will do the trick.

1) Divide and concquer, perform the hash calculation
outside of the mutex and divide the hash into
e.g. 16 smaller hashes. This creates one problem
which is how to prune the open table cache.
Obviously there is no longer a simple linked
list where I can find the oldest entry. This
problem I'm still contemplating, there's
probably already a number of known good solutions
to this problem since I find it popping up in
almost every similar design. So it's a problem
looking for a solution pattern.

2) Shrink the amount of data it protects
by only allowing it to protect the hash table
and nothing more. This means e.g. that some
counters need to be updated with atomic
instructions instead.

3) Shrink the time it is protected by inserting
the table share into the hash rather than the
table object (this is actually Monty's idea).

4) Use a different technique for the lock that
works better for short-term locks (usually
spinlocks are more successful here).

A combination of these techniques will hopefully
make it possible to decrese the impact of
LOCK_open on the server code.

Recovery features for ALTER TABLE of partitioned tables

A feature which hasn't been so public about the implementation
of partitioning is the support for atomicity of many ALTER TABLE
statements using partitioned tables.

This atomicity exists for
ALTER TABLE ADD PARTITION ....
ALTER TABLE REORGANIZE PARTITION ...
ALTER TABLE DROP PARTITION ...
ALTER TABLE COALESCE PARTITION

Given that partitioning often works with very large tables it
was desirable to have a higher level of security for ALTER TABLE
of partitioned tables. To support this a DDL log was implemented.
This DDL log will in future versions be used also for many other
meta data statements. The DDL log will record all files added,
renamed and dropped during an ALTER TABLE command as above.

The design is done in such a way that the ALTER TABLE will either
fail and then all temporary files will be removed (even in the
presence of crashes of MySQL Server). Otherwise the ALTER TABLE
will succeed even if not all old files have been removed at
the time of crash. The DDL log will be checked at restart of
MySQL Server and will REDO or UNDO all necessary changes to
complete the ALTER TABLE statement.

Given that MySQL Server crashes isn't likely to happen very often
in customer environments it was also desirable to add error
injection to the MySQL Server for testing purposes.

Here is a short cut from the file sql_partition.cc that displays
what happens here:

if (write_log_drop_shadow_frm(lpt) ||
ERROR_INJECT_CRASH("crash_drop_partition_1") ||
mysql_write_frm(lpt, WFRM_WRITE_SHADOW) ||
ERROR_INJECT_CRASH("crash_drop_partition_2") ||
write_log_drop_partition(lpt) ||
ERROR_INJECT_CRASH("crash_drop_partition_3") ||

At each ERROR_INJECT_CRASH it is possible to prepare
MySQL Server such that it will crash at this point in
the next statement using dbug statements that can
be issued also as SQL statements now.

So here one can see that we first log preparatory
actions, insert a test point, continue with the
next step of ALTER TABLE, insert a new test point,
write the next log entry, insert new test point,
and so forth.

With this recovery mechanism the new ALTER TABLE
statements should not cause problems with the
partitioned table after the ALTER TABLE even in
the presence of crashes in the middle of the
ALTER TABLE statement.