On snake oil, physics and databases

A while ago, I let out kind of a rant over at Netways' blog (in German).

As the topic keeps popping up, I'll put a somewhat condensed English version of it here.

In Neal Stephenson's wonderful novel Cryptonomicon, there's a little side story on a bandsaw. My favorite sentence is this one:

"Anecdotes about accidents involving the bandsaw were told in hushed voices 
and not usually commingled with other industrial-accident anecdotes."

At any PostgreSQL conference, there will be some kind of "war story" round over a couple of drinks. Entity–Attribute–Value (EAV) schemas, untested backups, dynamic SQL using unmasked user input etc. Most of these problems can simply be blamed on the users/designers/architects. Of course, we tend to #blamemagnus for some others... ;-)

However, many of the catastrophic experiences with MySQL or MariaDB can not simply be blamed to users. As they did everything "by the book" and still ended up with garbage in their databases.

But why? Maybe these things happen due to a general lack of thinking (on your own)? Sometimes even e general understanding of physics and natural laws would help...

We've seen questions to the PostgreSQL Performance Mailinglist about e.g. a query that Oracle (allegedly) finishes in 0,009 ms while PostgreSQL needs more than 2 minutes. Which is something that could potentially happen!
But then it turned out that the table in question was read end to end ("SELECT * FROM table1;") and had 9.649.110 rows, summing up to ~ 3,5GB. Even DDR5 DRAM peaks at 5,2 GT/s. In 0,000009 seconds, light travels "barely" 2,7 kilometers! You couldn't get that transfer rate by dropping a hard disk... ;-)

Which leads us to Multi Master (MM) or "Multi-Queen" (MQ), "Multi-Primary" (MP), "Write-Anywhere" or ... replication.
As the bee scheme ("queen(s)" and "princess(es)") is generally accepted and non-discriminating, I'll go with "MQ" here (huhu, Lætitia!).

A question often seen on mailing lists, the Telegram or Slack channels etc., usually by people coming over from MySQL, MariaD, CockroachDB, Oracle,… is for a "cookbook recipe" to set up a MQ cluster with PostgreSQL.

And the conversation usually goes like:
Q: “I'm coming from <DBMS xyz> and run a MQ cluster with it. How do I set this up with PostgreSQL?”
A: “You can't without extra products like BDR. But you probably don't want to anyway.”
Q: “Oh yes I do! I need the high availabilty! And the write balancing (one node doesn't deal with my write load)!”
A: “MM costs performance and doesn't really improve your availability. Just use streaming replication and a proper design...”

That sounds rude at first glance, but in essence it means "don't use a bandsaw to cut cardboard".

Why is it hard to run a "write anywhere" cluster?

Let's elaborate on what it means for a DBMS (similar issues arise for any distributed system like clustered filesystems etc.):

sequences (for primary keys or “auto-increment/identity” columns) have to be synchronised to prevent double use of values that are to be unique
writing statements (esp. UPDATE and DELETE) could destroy the referential integrity (tuple DELETEd on on node is referred to by an UPDATE on another node); so there has to be a way to LOCK tuples (or even tables) across all nodes in a cluster
conflicting UPDATES on the same tuple need to be resolved. If that resolution is not to be left to basically chance (the replication lag), there has to be some kind of policy in place.
every cluster node needs to have the same data at some point, so the write load per node is just the same. Even in a shared storage solution!

So, there's a lot of coordination that is necessary here, and the required effort scales exponentially with the number of nodes. And each and every coordinative activity costs at least 1-2 network roundtrips. So, the farther the nodes are away from each other, physics alone will have a huge impact on your performance!

2ndQuadrant, current maintainer of basically all FOSS MQ implementations for PostgreSQL (Postgres-XL, Postgres-XC and BDR, ignoring Bucardo here), essentially states that the limitations and additional measures needed to get MM working are so intrusive that MQ will probably never make it to the PostgreSQL core. The implementations also are explicitly not designed to provide High Availability or High Throughput, but to achieve eventual consistency on geographically distributed systems.

So, in essence, the effort to get a proper MQ system is remarkable, which everyone thinking about the requirements (and a general understanding of DBMS and networking) should be able to understand.

MQ-Clustering is a bit like driving the Autobahn. At 250 km/h. At night. With heavy rain. As the front-seat passenger...

You sure hope the person driving

isn't night-blind (i.e.: “has RTFM”)
knows the car and its - or better his - limitations (that "seine" pun only works in German...)
has had at least one, better a couple of safe driving training(s)

And yet you're subject to powers you have no influence on. And sure enough, most times, it will all go well.

But then again, you may end up in the hands of someone who set up the MQ cluster using some "10 easy steps" tutorial from the net...

Considering that certain DBMS' have a reputation of not rating data integrity too high even without bidirektional replication, this gets us back to the bandsaw theme and the spectacular accidents happening from time to time... ;-)

But the availability!

When I read claims like this at galeracluster.com: “Transparent to Applications: Required no or minimal changes to the application” (Oracle talks similarly about RAC), I always ask myself "how do they manage to get around the connection reset if a host goes down?". Turns out they don't, you get a different client library (for RAC at least, I reckon Galera etc. do it similarly?).

Now, seriously, if you're unable to write code that is able to handle a "server down" event, including repeating a failed transaction, in a typical "one queen that is backed by 1+N princess(es) and has some kind of HA solution in place" setup, you'll probably not be able to run a RAC/Galera/youNameIt cluster without hitting the ground hard (the later, the harder, btw.)!

But... performance!

Yeah, right. So, scaling read-only queries out to read-only princesses is a common technique. Sure requires a bit of thinking, but anyway...

I remember Oracle claiming essentially "every node you add to a RAC cluster adds 90% of the performance of the node added before" (might have been more or less %, but definitely below 100%). That's not exactly linear:

# WITH perf AS (
SELECT nodes, 0.9^(nodes-1) AS addednodes_performance
FROM generate_series(1,8) AS nodes
)
SELECT *
,avg(addednodes_performance) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS average_performance
,nodes * avg(addednodes_performance) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS overall_performance
FROM perf;
 nodes | addednodes_performance |  average_performance   |  overall_performance
-------+------------------------+------------------------+------------------------
     1 |     1.0000000000000000 | 1.00000000000000000000 | 1.00000000000000000000
     2 |     0.9000000000000000 | 0.95000000000000000000 | 1.90000000000000000000
     3 |     0.8100000000000000 | 0.90333333333333333333 | 2.70999999999999999999
     4 |     0.7290000000000000 | 0.85975000000000000000 | 3.43900000000000000000
     5 |     0.6561000000000000 | 0.81902000000000000000 | 4.09510000000000000000
     6 |     0.5904900000000000 | 0.78093166666666666667 | 4.68559000000000000002
     7 |     0.5314410000000000 | 0.74529014285714285714 | 5.21703099999999999998
     8 |     0.4782969000000000 | 0.71191598750000000000 | 5.69532790000000000000

Mind you, the price per node is 100% of the former one. Maybe putting a little bit of effort into read scaling your software pays off pretty quickly?

Galera in turn claims to essentially offer (my wording) synchronous, bidirectional replication with basically no performance loss. I wonder how they defeat the natural laws (e.g. speed of light)?!?

Summary

Multi-node computing is never easy.
Database scalability is a problem that you can only solve by throwing money at it to a certain degree.
After, thinking is necessary. Realising the general challenges is absolutely necessary.
Some people will not get the latter done. Those are helped by companies offering ~~snake oil~~ turnkey products and solutions.
Very few use cases (and thus users) can make serious, good use of MQ solutions.

In other words:

The likelyhood that you need MQ is ridiculously low!

The likelyhood that a MQ solution will bring you much more trouble than benefit is high.

This is true not only for PostgreSQL.

On snake oil, physics and databases

Why is it hard to run a "write anywhere" cluster?

But the availability!

But... performance!

Summary

Neuen Kommentar hinzufügen