TeraCortex DVTDS versus FoundationDB

DVTDS as key value store

When Apple aquired FoundationDB nothing was published about the price but a couple of venture capital companies had some 22 million dollars invested before in FoundationDB. Hard to imagine them going home with less money.

We were curious because we never before restricted the usually complex data models in DVTDS to simple flat tables with a single column. In 2014 FoundationDB had reached 14.4 million transactional updates on a cluster of 32 c3.8xlarge virtual machines in the Amazon cloud, hosting one billion subscribers. Meanwhile the respective blog article was removed, probably on behalf of Apple's information policy. But traces of it can still be found at several places. Just google for "FoundationDB" and "14.4 million".

Extrapolation from our existing benchmark suites showed that DVTDS would be faster when operated as simple key value store, same as FoundationDB. But the proof was somewhat harder than expected. At these levels of throughput hardware memory architectures and disk subsystems get to their limits and as database experts we also breath thinner air, same as others.

One problem was the RFC 5805 compliant transaction protocol. In the LDAP world this is the standard any directory server should comply to. But it not only contains redundant elements. Even worse these parts of the protocol require synchronous operation across the distributed database, thus introducing superfluous latencies. The larger the cluster the more they show up.

Next thing was the behavior of Amazon's virtual machines. A c3.8xlarge has 32 virtual cores mapped to 16 physical Intel Xeon CPU cores along with their hyperthreads. Nothing special, DVTDS's internal multi threaded design should make maximum use of such a hardware, we believed. But in contrast to Amazon's newer c4.8xlarge (Haswell / DDR4) instances the old Ivy Bridge / DDR3 system did not scale above eight threads. Last year we were at the same point with a physical two socket / 24 core server but hoped that the virtualization software would mitigate this problem. We were wrong.

Finally there was Amazon's disk quota system. Somehow it's fair, because you get more performance the more you pay. But it binds IO performance to amounts of allocated space. It requires to lease immense storage capacities if you want acceptable throughput and, by the way, leads to high charges.

So the whole thing took longer than expected. We made the following adjustments:

Transaction protocol: DVTDS now supports a streamlined protocol that is not vulnerable anymore to the number and distance of database nodes. The solution is fully backward compatible to RFC 5805 and complies of course to the ACID paradigm. But it removes the redundancies of the standard protocol and avoids the latencies associated with it.
NUMA aware database startup: It turned out that the c3.8xlarge machines can be squeezed to their full potential when several database shards run on the same machine with CPU affinity. The optimum was reached with four shards, the first running on CPU 0 ... 7, the next on 8 ... 15 and so on. This behaviour is very strange. This quarter of something scheme cannot have to do with the notoriously weak inter - socket connect (Intel QPI) which must be used when data has to be communicated between CPU's in a two socket system. Oracle's Sparc and IBM's Power servers should be better off in this regard because their engineers dedicated a lot of attention to bandwidth and latency of the inter - processor links. With Intel processors we would expect a half of something problem at this point, unless Amazon maps their virtual machines to 4 socket hardware. It reminds more to the four channel memory architecture of Xeon processors. But we never experienced something like this with single Xeon machines featuring the same four memory channels. Anyway, started with CPU affinity the database engines scaled as expected.
Abandon Amazon network attached storage (EBS): We conducted the cluster scaling tests with the In Memory version of DVTDS and made a single machine test with four database shards on a i2.8xlarge with the On Disk version. The latter machine type features more memory and eight internal SSD's at full speed. The rest of the hardware is identical to c3.8xlarge instances. Further we performed the same single machine test with the In Memory version of DVTDS. By relation of In Memory results to On Disk results we could extrapolate from the single machine test because the inter - cluster communication across different machines features linear scaling. Further it is completely independant from DVTDS' hard disk storage subsystem.

Now the bottom line:

Operated in memory DVTDS needed half the number of virtual machines to outperform FoundationDB. On disk DVTDS reached 640000 transactional updates per second on a single machine hosting eight times the number of subscribers compared to FoundationDB. Scaled to a setup with 32 machines it would score at 19 million transactional updates per second. All in all a 33% advantage in throughput and a 800% advantage in data capacity. Please read the detailed results on our benchmark page.