Infinite MySQL : #1 : Infinite MySQL

The Puzzle of Infinite MySQL



The status of this is 'write up', but the puzzle is real and the solution is real too

  • 'Infinite' is a strong word. No resource is infinite.
  • The original puzzle was given to me by one of yahoo executives several years ago (Y 2007/2008) and it was in a form "how can we scale mysql to billions of rows?"
  • At that moment I had no pragmatic answer. Today I do have a pragmatic answer.
  • There are some bottlenecks, obviously, so the "right" architecture has to deal with those bottlenecks.
  • There is not one bottleneck, but many.
  • Some bottlenecks can (should) be eliminated by only allowing subset of SQL. That's OK.
  • I've seen some people dealt with some of the bottlenecks, but I am certain that at least one (important) bit of the puzzle is missed by many.
  • Maybe it's missed by everybody.
  • The bottleneck is on software/physical layer - the hardest one usually.
  • Principal architecture is of course map/reduce. Controlling node maps and then reduces the result.
  • At the moment I know about at least 3 existing solutions for controlling node.
  • One component - I wrote it myself - in 1-2 weeks - in year 2006. I was about to place it Open Source, but Open Source people were already in "hired bullies" mode (2008 wave), so it did not happen.
  • It does not matter in modern world, though, because in modern world there are already several mysql map/reduce components to pick from, so if I had to do it today - I would have piggy-backed one of those, instead of writing my own.
  • Progress is good.
  • Resulting solution consists of several parts.
    • Controlling node not written by us (but it has to be tweaked by us).
    • Bunch of mysql workers (as-is. InnoDB) - we can't afford writing storage engine from scratch.
    • OS layer - we do tweak it.
    • Little set of orchestration scripts.
    • Backup and/or replication - is handled by orchestration layer. Could be seen as separate subsystem.
    • There is a twist that is related to 'many writes' scenario. It can be addressed, but it implies more coding (and it would need integration of another component).
  • This is design that is similar to design of HDFS replacement.
  • The design worked for 'infinite file system' - it should work for 'infinite MySQL'.

December 14. 2014

Background

  • Univca bashing on memcached mailing list

    In 2006, while employed by Live Nation, I was asked to write a memcached replacement that would actually be robust and 'just work'. At that point memcached was rather questionable piece of software with zero support. Zero support was because original author of memcached abandoned it's code. You can find it all in archives, I guess. Like I did in 2006. Anyway. There was no support for memcached, there was not a single stress test, but there were (many) bugs. There were bugs in memcached home grown network protocol. There were bugs in memcached home grown memory allocator. The only expiration policy was LRU - so it pushed implementation of complex expiration policies to outside app. Also, the re-fetching of data and placing it into cache - memcached is pushing it to outside. That creates race conditions - especially if you need to deal with expiration of objects that depend on each other. Comparing this to the 'universal' cache that can handle all those complexities on it's own and free developers from dealing with those issues. The Live Nation engineering management (the best I've met in SFBA) decided not to gamble their enterprise careers by deploying memcached as a cornerstone for an enterprise website. Comparing to Digg engineering management, who at some point made a comparable gamble on Cassandra. And lost. You can find it all in archives. Fun stuff.


    Long story short. I implemented simple HTTP 1.1 (because keep-alives) protocol - to make use of Netscaler, if we would need load-balancing (we turned out fine without it). Univca version 1 was supporting everything mecmached did (get/put etc.) The main difference was that Univca was stress-tested/benchmarked with AB. Which was possible only because Univca used http for a protocol - so I got the stress-test system for free. Stress-test system means that I can deploy this component into enterprise website and sleep well, knowing that the website will not collapse. Engineering approach vs bullshit approach, basically. That was before 2008, of course. I've also compared the php/memcached vs php/Univca benchmarks running for days - there was no performance differences. So all that noise about memcached being super efficient - it's mostly bullshit. It is not super efficient. Anybody can take stock STL's hashtable, slap HTTP server on top and get (robust) memcached clone in 1-2 weeks. Try it. Another interesting thing that came out of running C++ code for days was the impact the time has on performance of HashMap vs BinaryTree. I suggest you try that too - was a bit surprising. But then again - not many people have a task to stress-test caching server, right?


    After version 1 - I implemented MySQL read-through - so that Univca goes directly to MySQL server if the key/value is missing in cache. Then in another couple weeks I've implemented XSP - php / C++ cross-compiler of sorts. I re-used caching component from Univca, but I also had to piggy-back one more C++ component for that. That was before HipHop. Performance gains of XSP were measured in 1000%. We switched parts of Live Nation production website from Zend+memcached to XSP and went from 50 hps to 5000 hps - measured by AB on local network - with hit/miss ratio close to 100% - on purpose, because that was the business case. That is, naturally, way better, than HipHop. That is, primarily, because HipHop is not optimizing things that are worth optimization.


    I'm writing this down because I'm being asked about it a lot recently. Also, because it shows how long it took me to arrive at 'infinite MySQL' architecture - there was some preceding work. To be precise, it all started several years before, when a little SFBA startup asked me to implement the fastest possible XSLT engine. The way we did it was cross-compiling XSLT stylesheets into C code. One can do some cool things that way. That's why Steve Jobs was trying to shit on cross-compilers. Steve Jobs says cross-compilers (like Flash CS5) make sub-standard apps


    Maybe somebody finds this interesting. Eventually.

  • Update

    For a while now I was waiting that somebody (anybody) would make the next natural generalized step after Memcached. Alas. Nobody did it. They all are rehashing Memcached, not adding much power to it. The next step is (naturally) to blend caching layer with declarative layer, like 'make' did. Patentable, clearly.

    September 08 2019