Digging Down into Tech – Lots of unnecessarily complicated tech

Featured

Todo

Exim for all my nodes. Gotta keep emails from getting on the internet except for a few specific recipient addresses.
Kubernetes cluster. Maybe, just maybe I can learn to not hate it with such a burning passion.
Give Percona XtraDB Cluster another whack. Use the pcmk cluster to run a three-node setup and then use sysbench to stress it. Would be real nice to have Btrfs snapshots of the data dirs to test recoverability. If I could only find a way to recover from all three nodes disagreeing…
Test NixOS on a VM.

Done

Master-slave MySQL cluster on cluster1, cluster2, cluster3 to keep database load off Ceph. Pacemaker+Corosync automatically promotes and demotes nodes on failures, fencing if necessary. [ Still testing stability of setup. ]
WHM/cPanel testbed available via internet using separate VLAN.
Routing, traffic control and firewall lab. I wish VirtualBox VMs could have more network interfaces… Gonna run it on my Windows 10 workstation. [ Set up. Done with static and dynamic routing. ]

Scrapped

Setup Clonezilla server. [ Easier to just run Clonezilla from USB when needed. Didn’t particularly like the fact that Clonezilla has to be a DHCP server. ]

Featured

I like high availability, probably more than I should. Banking, phone systems and the electrical grid do a good job with this but we have a lot of complex stuff at home nowadays which makes it trickier to keep them up and running reliably. It was more than ten years ago that I first tried to build a server setup that could keep things up and running even when individual servers failed or had to be brought down for maintenance. I didn’t have enough hardware with virtualization support so I had to use Xen(32-bit edition) with its paravirtualization support.

I used DRBD for shared storage and OCFS2 I think on top of that. It worked not so well. For various reasons I ended up having a single server with Solaris 11 and zRaid-5 storage so at least I had data redundancy even if the system as a whole wasn’t replicated. I later ended up with a master-slave setup with two identical Core i5-based “servers” where the master node replicated data over to the slave using Btrfs snapshots. The filesystems used Btrfs RAID1 on both nodes so there was a LOT of redundancy. There’s a story with zRaid-5, a weird hard drive failure and a couple of days of worrying behind this 2-server – each with internal 2-way replication – setup.

One of the servers in the master-slave setup gave up a few months ago so it was time to replace it with something new and now I think it’s pretty much complete.

Hardware

Node name: pve1
Microtower Atom C3558
16GB RAM
4 3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
1 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
4 Gbit LAN
IPMI

Node name: pve2
Microtower Xeon-D 1541
32 GB RAM
4 3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
2 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
2 Gbit LAN, 2 10GbaseT LAN
IPMI

Node name: pve3
Microtower Atom C3558
32GB RAM
4 3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
1 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
4 Gbit LAN
IPMI

Node name: nearline
Self-assembled Core i5-system
16GB RAM
Tandberg LTO3 tape drive
Hodgepodge of hard drives

Networking equipment:
Cisco RV082 Router
Netgear LB2120( for backup internet connection )
HPE 1920S switch

The tiny little screen to the top right is sort of a crude monitoring system. I don’t much feel like running a big LCD screen just to show the load on my servers. So this tiny screen tells me if any important hosts are down and what the load is on each Proxmox machine.

Structure

The nodes pve1-pve3 form a Proxmox 5.4 cluster with Ceph Luminous as shared storage. The journals for all Ceph OSDs is stored on NVMe partitions which took a while to set up since Proxmox’ own Ceph tools don’t want to do that. They say they do, but they don’t.

Some storage is kept out of Ceph because of reliability reasons. Basically I think of Ceph as a single point of (unlikely) failure. So virtual machines I want running even when I try to figure out why Ceph refuses to work are stored on LVM-thin volumes.

The nearline machine(not shown below since it is mostly turned off and so it makes little sense to monitor) is also a manifestation of my distrust in Ceph. I used to rsync data over to Btrfs-volumes once a day that I then snapshot. But the drives I put into the machine were junk so that had to stop. Now I got Bacula up and running again so therefor store backups on my cherished LTO-tapes. *hugs*

The Proxmox nodes use bonded network ports to connect to the HP Enterprise switch that serves mainly to connect the cluster together but it’s also the core switch of the network. The HP switch connects to my good old Cisco RV082 router which in turn connects to the fiber-modem that gives a nice 100 Mbit connection out. Now it also has 4G modem connected to WAN2 as a fallback.

The nodes with green links to the cloud symbol are stored on Ceph so can be migrated from physical hosts while running. Some nodes are not shown in the graphic above. Mostly they’re testbeds like my CloudLinux install with WHM and cPanel, a copy of my pacemaker-cluster and so on.

Software

Proxmox

A Debian-based virtualization platform with cluster and HA-support? Yes please… Has a great GUI and integration with Ceph. Kind of a pain to install new SSL certificates but it can be done. Basically Proxmox is an alternative to VMware vSphere and whatever Xen offers nowadays. Wish it had a way of configuring fencing for cluster nodes integrated with its own HA functionality. Has pretty good built-in performance monitoring as well.

Ceph

So I took the plunge to start using Ceph. It wasn’t entirely easy since I already had my cluster set up to use GlusterFS. It’s great to be able to move virtual machines from node to node using live migration but you can’t do that between separate shared storage systems now can you? I can handle some downtime but thought of it as an interesting experiment. Since I had DNS servers and MySQL servers set up in a cluster of virtual machines those virtual machines could be shut down one at a time, recreated on a new cluster with Ceph as a backend and then the process could be repeated one physical node at a time.

I didn’t need to create a new cluster but I figured I might as well go the entire way and upgrade to Proxmox 5.4. All in all there was like 5-10 minutes of down time inherent in the move and 1-2 hours of downtime because I’m a klutz who configures two servers to use the same IP address and then wonder why things don’t work so well.

Many decisions made about this setup reflect my lack of trust in Ceph but by now I’ve actually come to trust it quite well. It performed remarkably well when I screwed up the IP-addresses and other things. Haven’t encountered a split brain situation yet, which is more than I can say for GlusterFS(note: when increasing the node count in a GlusterFS setup, you have to change quorum levels manually…).

Ceph also has its own monitoring system which is nice.

Audible.com

At approximately 18:40 I thought I might acquire some kind of audiobook… thing. Maybe Audible.com? They have some subscription thing, right? At 19:35 I gave up trying to access Pandora’s Lab, part of the content included in my trial month, to have a nap.

Re-envigorated by my nap I was able to listen to the book I acquired using my monthly credit. Success! Does that mean I can access that Pandora’s Lab thing as well? Does it bollocks. I can buy it when I visit audible.com using Google Chrome – the same browser which let me listen to the aforementioned book, so I’m logged in it would seem – but listen to it as included content? Nope.

The Windows 10 App does show it as included, which seems like a step in the right direction:

This doesn’t work quite the entire way:

After reading up on this it turns out that their Windows 10 App doesn’t yet have support for Audible Plus content. Wow! So that’s where the Windows platform is at right now, getting the cold shoulder from content providers?

Well, okey then. I guess they only really support mobile apps? Nope, not Android at least. It also just let’s me buy that book.

So I’m coming away from this somewhat unclear on the benefits of Audible Plus. I’m not entirely confident in buying books from them either given this evening’s little expedition. Maybe I need to buy an iPhone, a Kindle or Alexa to have solid access to their content? I’ll admit that a Kindle would probably be okey but I wouldn’t be able to show my face at work carrying an iPhone. Techies don’t use iPhones! I’m not even sure I would be able to look myself in the mirror if I used an iPhone. Yet it would be better than an Alexa which in turn is only slightly better than being kidnapped by an Eastern European organ-harvesting syndicate.

What is it with providers of audio content? Because this is largely my experience with iTunes as well. I’ve set up a multi-master MariaDB cluster in the time it takes me to jump through their hoops of updates of the iTunes executable, my account details, managing which units can access me content et cetera. I haven’t played any of the content on my iTunes account for at least two years at this point.

Netflix? My mom watches lots of their content on her television and she’s not exactly a grade A hacker. Sure I had to set it up for her but it’s largely painless from that point. So for some reason video content is easy to access from many devices while audio content is provided by organisations that compete with one another in who can most closely resemble a Soviet era bureaucracy. “Here is your car. Please wait 8-10 years for your petrol purchasing-license to clear.”

*§#% Ceph

So I updated my Proxmox servers a few weeks ago. One worked fine but for two of them Ceph couldn’t start OSD:s for two drives. The processes segfaulted as it read through each disk and started over. I had to blank the drives and resync them. Well that’s not kosher! I was going to update Proxmox anyway so maybe it will be better with Ceph Nautilus which comes with Proxmox 6.2. Does it bollocks!

I upgrade one server at a time. One server is blanked and installed with Proxmox 6.2 and a new Ceph cluster is created with replica size 1. I then rsync data over from the old cluster. I had run this setup for 48 hours when a sudden reboot of the server left Ceph unable to read from two hard drives containing the CephFS. This time there was at least an error logged, not just a segfault:

2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119) initiated 2020-09-08 21:06:50.724397 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)

Well, you get the idea. So that’s #¤§* it! It’s not I can’t work around these issues and blank drives and bring them back into the cluster, but I have zero tolerance for buggy file systems and database engines. I run the old Luminous release for which there has been ample time to iron out kinks and OSD:s can’t start. I switch over to Nautilus: same but different.

I’ll run a toy Ceph-cluster in a couple of VM:s but for actual workloads? How about no? This means standard logical volumes for VM:s and btrfs for my 2TB of files. I’ll go back to my old btrfs script with the “send snapshot” functionality activated. I’ve run it without that send-snapshot thing for a while as a DB backup solution(a plan that turned out less than ideal which I should have predicted for such high IOPS applications).

Memo to self: if running Ceph in production, set up a staging cluster that can be upgraded and beaten about before rolling out the same release to the real cluster.

HA RDBMS

With pacemaker and corosync not really handling a MySQL master/slave the way I want it to, I’m working on my own high availability manager, this time for MariaDB since that’s the direction in which the world is going.

Challenge 1

Testing MariaDB master/slave characteristics manually is kind of important before writing code to do perform the same operations automatically. So I thought I’d dust off my old script for testing Percona XtraDB Cluster. But it didn’t work. Hmm, maybe it’s because I’m running the MySQL connector against a MariaDB instance? Well let’s change that. Now it crashes at a specific cursor.execute? No error message? Not even strace indicates an error.

Turns out the handling of database cursors in threads has changed somewhere along the way. Apparently passing a database connection to a thread-object and then starting the thread messes things up and execution just dies when you try to execute a query. Those were three wonderful hours of debugging.

Challenge 2

A cluster manager needs to consist of software running on independent nodes capable of communicating with each other. I was first thinking about Python sockets but Erlang has a wonderful setup for message-passing. No, that’s to awkward. I don’t have time to get back into Erlang properly. Shell script? No, proper structuring is what we want so we’re back to Python.

But sockets? Really? No, let’s try some streaming layer like Kafka. No! ZooKeeper! That’s a better fit since it serves as a consistent data space for multiple nodes. It turns out ZooKeeper is written in Java and to interact with it properly you need to write Java code. I love Java but for scripting starting and stopping databases? What alternatives are there? Ah, etcd…

Turns out the Ubuntu package comes with a default configuration file for a single node cluster. Nice! But I’ll change the settings to be for a proper cluster since that’s the actual use case for me. Hang on, why does it not connect? No matter what changes I make the two test nodes run as separate clusters. Fast forward a couple of hours and it turns out that when etcd starts with a configuration that in some way constitutes a cluster, the configuration file settings for clustering is largely ignored.

By deleting /var/lib/etcd/default you clear the node and make it actually peer using the configuration file settings.

MySQL safeguards

As we saw in part 1 of this series, MySQL is pretty robust in the face of hard resets as long as you use InnoDB as the underlying engine rather than MyISAM. Heck, InnoDB even held together with the cache-setting “Writeback (unsafe)” in Proxmox chosen for the underlying disk. But an ib_logfile getting corrupted or going missing somehow leaves us in a pickle. Some percentage of tables are likely to be impossible to export if that happens. So we better set things up so that we have something to fall back on if that happens.

Let’s start with my favorite, Btrfs snapshots. Some background:Ext4 is pretty much the default file system for Linux today and it’s pretty darned good. I remember back in the day when we only had Ext2 and it was a big deal when Ext3 introduced journals that made recovery from crashes much faster. Now we’re at Ext4. Btrfs on the other hand is a completely new file system heavily inspired by ZFS. It’s Copy-on-Write amongst other things which is good for snapshots. So Ext4 is tried and tested and Btrfs feature-rich.

Let’s switch from Ext4 to Btrfs for a MySQL-deployment. Note that I’m completely wiping the data stored in MySQL here! This is a test-rig.

I’ll make cron run a script that makes snapshots of the MySQL data directory every five minutes. It could be done in a simpler way but my script cleans up old snapshots automatically based on an assigned lifespan which is handy. This is what we end up with after slightly less than an hour of running the standard workload-script and also taking snapshots every five minutes:

root@mysqlrecov1:/var/lib/mysql# ls -lh Snapshots/data 
total 0 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2340_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2345_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2350_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2355_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0000_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0005_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0010_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0015_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0020_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0025_12h

So let’s say something bad happened after the last snapshot was taken and we want to get data out of it. Can we just start a MySQL-instance with the snapshot directory as the datadir? No:

I make all my snapshots read-only because they’re meant to be backups. The more immutable they are the better. MySQL can’t run against a read only copy. But all we need to do is to make a writable copy of the read-only snapshot we’re interested in:

I guess I could have run mysqlcheck to demonstrate that things were okey but they were just fine. We could have taken whichever of these snapshots we wanted and gone back to that point in time. As much as I like this setup I do recognize some weaknesses.

First, backups are necessarily part of the same filesystem as the original. So file system corruption can leave us completely boned. There is a mechanism for sending snapshots from one Btrfs file system to another – even on another server – which can remedy that situation to some extent. But Btrfs snapshots are not great to write to tape however you twist and turn things.

Second, we’re not actually saving snapshots that MySQL gives guarantees for. MySQL has to do a sort of crash recovery process when starting up these snapshots because they look just like what you would get if the server had crashed unexpectedly and left lots of databases open.

One way to remedy this is to use a backuptool like xtrabackup. There’s a version for MariaDB called mariabackup that’s basically the same. I’ll use xtrabackup in this case. We’ll make three snapshots, consisting of one full backup and two incremental backups. Then we’ll do a restore by applying each incremental backup to the full backup and then moving the result into the place MySQL expects data to be.

This has the benefit of saving cleaner snapshots and you get them in distinct files that are easier to save to tape or backup to some external storage provider.

Now let’s try something a bit more robust, replication! I have that set up for my own databases but with an older method of keeping the various copies in sync. I’d like to try out GTID. I wonder if it’s enabled by default on this version of MySQL?

mysql> SHOW GLOBAL VARIABLES LIKE '%gtid%'; 
+----------------------------------+-------+ 
| Variable_name                    | Value | 
+----------------------------------+-------+ 
| binlog_gtid_simple_recovery      | ON    | 
| enforce_gtid_consistency         | OFF   | 
| gtid_executed                    |       | 
| gtid_executed_compression_period | 1000  | 
| gtid_mode                        | OFF   | 
| gtid_owned                       |       | 
| gtid_purged                      |       | 
| session_track_gtids              | OFF   | 
+----------------------------------+-------+ 
8 rows in set (0.00 sec)

Nope. Okey, I wonder if I can add it after the fact? Well activating the feature after the fact can obviously be done, but will it work? Will slaves only see the modifications that happened after GTID was activated? We can test it. Let’s enable GTID and create a replication user.

Now for a tricky move. It’s not strictly necessary since we know writes aren’t going on but for the sake of doing things by the book, we flush all table and lock them in one terminal window and then dump all databases in another. As long as the “FLUSH TABLES WITH READ LOCK;”-session is open in MySQL the lock is held and the dump is consistent.

Then we copy the dump to mysqlrecov2 where we import it into a blank setup. Then we configure that server for replication with it’s own server-id. Then we set up the replication and run a test to see if it works. We should be able to run our testscript and find that the maximum id in the Purchase-table(for instance) is the same for mysqlrecov1 and mysqlrecov2.

That seems to work fine. Let’s level this up one more notch. I’ll schedule snapshots to be made of the /var/lib/mysql/data directory every five minutes just like on mysqlrecov1 before. This will be a variant of what we did before. So the testscript is running against mysqlrecov1 for a while and mysqlrecov2 is replicating all the changes in real time. Snapshots are being stored in /var/lib/mysql/Snapshots/data/ on mysqlrecov2. There are actually good reasons for not storing snapshots on mysqlrecov1 which we can get to later. For now, let’s just note that the relay-log from mysqlrecov1 is ticking up nicely on mysqlrecov2:

Now let’s simulate a bonehead mistake. Something like deleting all customers in the database with an id below 5000? That should leave people quite sad. As we saw before we can restore from a Btrfs snapshot but we can lose up to 5 minutes of transactions. In fact, if we don’t recognize our bonehead mistake immediately and take MySQL down on mysqlrecov1 we could rack up even more transactions that will be lost when we go back to our earlier state.

Replication worked just as advertised. Within milliseconds of me deleting a bunch of customers at random on mysqlrecov1, mysqlrecov2 followed suit. As SHOW SLAVE STATUS illustrated at the end, I kept the testscript running for a while longer so more and more gtid_sets were stored on mysqlrecov2 full of legitimate transactions.

The test script started printing a lot more error messages after some random customers got deleted but kept running anyway. So what are we going to do now? Well restoring the last Btrfs snapshot from before the deletion is a good start. We’ll turn off mysqlrecov1 and set mysqlrecov2 so that it can only be reached from the server itself, to keep people from tinkering with the data while we’re working on it.

Seems I chose the right snapshot. We have customer with ids below 5000. Now let’s check in the relay-log from the data_borked directory. We don’t want MySQL to use that directory any more because the data in ib_logdata-files and the sysbench-directory reflects the state where thousands of customers are missing. But the relay-log file is still very valuable. If we run mysqlbinlog –verbose /var/lib/mysql/data_borked/mysqlrecov1-relay-log.00005 we see all the changes made to the database since replication began. This data gets rotated out eventually but here neither time nor space-constraints led any of the changes to be lost before the “mistake” was discovered. We only really need the changes made from the point the latest good snapshot is from to make this work though.

So how does the binlog look?

SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:1586'/*!*/; 
# at 472 
#200328 22:41:23 server id 1  end_log_pos 337 CRC32 0x633797ec  Query   thread_id=3     exec_time=0     error_code=0 
SET TIMESTAMP=1585435283/*!*/; 
SET @@session.pseudo_thread_id=3/*!*/; 
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/; 
SET @@session.sql_mode=1436549152/*!*/; 
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/; 
/*!\C utf8 *//*!*/; 
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/; 
SET @@session.lc_time_names=0/*!*/; 
SET @@session.collation_database=DEFAULT/*!*/; 
BEGIN 
/*!*/; 
# at 550 
#200328 22:41:23 server id 1  end_log_pos 399 CRC32 0x2b1e9803  Table_map: `sysbench`.`Customer` mapped to number 108 
# at 612 
#200328 22:41:23 server id 1  end_log_pos 480 CRC32 0xc9c162b2  Write_rows: table id 108 flags: STMT_END_F 
 
BINLOG ' 
k9J/XhMBAAAAPgAAAI8BAAAAAGwAAAAAAAMACHN5c2JlbmNoAAhDdXN0b21lcgAEAw8DDwT/AP8A 
CAOYHis= 
k9J/Xh4BAAAAUQAAAOABAAAAAGwAAAAAAAEAAgAE//CF4AMAFGNiZnVkbmJmbWxtd2h1YWptbHht 
AAAAAA94c3dqZnh1ZmhwZ3NpbWGyYsHJ 
'/*!*/; 
### INSERT INTO `sysbench`.`Customer`
 ### SET 
###   @1=254085 
###   @2='cbfudnbfmlmwhuajmlxm' 
###   @3=0 
###   @4='xswjfxufhpgsima' 
# at 693 
#200328 22:41:23 server id 1  end_log_pos 511 CRC32 0x0a071603  Xid = 46 
COMMIT/*!*/; 
# at 724 
#200328 22:41:24 server id 1  end_log_pos 576 CRC32 0x05680f0e  GTID    last_committed=1        sequence_number=2       rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:1587'/*!*/; 
# at 789 
#200328 22:41:24 server id 1  end_log_pos 654 CRC32 0x79a206c9  Query   thread_id=12    exec_time=0     error_code=0 
SET TIMESTAMP=1585435284/*!*/; 
BEGIN 
/*!*/; 
# at 867 
#200328 22:41:24 server id 1  end_log_pos 716 CRC32 0xa2bb0bb0  Table_map: `sysbench`.`Customer` mapped to number 108 
# at 929 
#200328 22:41:24 server id 1  end_log_pos 797 CRC32 0xf9f2cb8b  Write_rows: table id 108 flags: STMT_END_F
<...>

Okey, slightly obscure. Can we find our DELETE FROM Customer WHERE id < 5000? No. We only find thousands of these:

SET @@SESSION.GTID_NEXT= ‘b6efad3f-6c71-11ea-9af3-eaa32fa23308:26067’/*!*/;
# at 14852899
#200328 23:02:33 server id 1 end_log_pos 14852762 CRC32 0xb87efd39     Query   thread_id=4     exec_time=0     error_code=0
SET TIMESTAMP=1585436553/*!*/;
BEGIN
/*!*/;
# at 14852975
#200328 23:02:33 server id 1 end_log_pos 14852825 CRC32 0x6a5f0dd7     Table_map: `sysbench`.`Purchase_list` mapped to number 110
# at 14853038
#200328 23:02:33 server id 1 end_log_pos 14852894 CRC32 0x53b1b9b6     Delete_rows: table id 110 flags: STMT_END_F

BINLOG ‘
idd/XhMBAAAAPwAAANmi4gAAAG4AAAAAAAEACHN5c2JlbmNoAA1QdXJjaGFzZV9saXN0AAQDAwMD
AADXDV9q
idd/XiABAAAARQAAAB6j4gAAAG4AAAAAAAEAAgAE//Dw5AEAlMQAAKJQAAAEAAAA8PLkAQCUxAAA
n3MAAAYAAAC2ubFT
‘/*!*/;
### DELETE FROM `sysbench`.`Purchase_list`
### WHERE
###   @1=124144
###   @2=50324
###   @3=20642
###   @4=4
### DELETE FROM `sysbench`.`Purchase_list`

So what is the latest transaction before the first of these thousands of DELETE-operations? Well, I know approximately what time we’re interested in. Then we can just look for DELETE FROM `sysbench`.`Customer`-statements.

So this probably contains the data we’re interested in:

###   @2='trirwseikshnmaeuqgzr' 
###   @3=3522 
###   @4='peahbpigywdapfo' 
# at 14574860 
#200328 23:01:06 server id 1  end_log_pos 14574678 CRC32 0xab48065f     Xid = 758546 
COMMIT/*!*/;
# at 14574891 
#200328 23:01:07 server id 1  end_log_pos 14574743 CRC32 0x18eab10c     GTID    last_committed=24333    sequence_number=24334   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:25919'/*!*/; 
# at 14574956 
#200328 23:01:07 server id 1  end_log_pos 14574821 CRC32 0xe10dc87f     Query   thread_id=16    exec_time=0     error_code=0 
SET TIMESTAMP=1585436467/*!*/; 
SET @@session.foreign_key_checks=0/*!*/; 
BEGIN 
/*!*/; 
# at 14575034 
#200328 23:01:07 server id 1  end_log_pos 14574883 CRC32 0x6e71054b     Table_map: `sysbench`.`Customer` mapped to number 108 
# at 14575096 
#200328 23:01:07 server id 1  end_log_pos 14583060 CRC32 0x65200e62     Delete_rows: table id 108

All right, what is the last state found in the last good snapshot? SHOW SLAVE STATUS says that Read_Master_Log_Pos is 14507560. That’s a reasonable number given that we just found the marker “end_log_pos 14574678” just above where all the bad deletes started. It’s important to know if 14507560 is an end_log_pos or a start position. Searching through the binlog-output reveals that it is an end_pos:

###   @3=1682 
###   @4='vlntfrggrwhykju' 
# at 14507742 
#200328 23:00:00 server id 1  end_log_pos 14507560 CRC32 0x333e7f0b     Xid = 753352 
COMMIT/*!*/; 
# at 14507773

So we will want to replay from the start position 14507773 that comes directly after the number we found in SHOW SLAVE STATUS. And the final position to include is 14574678.

Okey, so we were able to apply parts of the binlog without running into “DUPLICATE KEY” errors or losing lots of customer data. Shall we try to apply the transactions after the erroneous deletes? Should be as simple as replaying the binlog from position 14766269 to the end.

root@mysqlrecov2:/var/lib/mysql/data_borked# mysqlbinlog --skip-gtids --start-position=14766269 mysqlrecov2-relay-bin.000005 > after_deletes.sql                                   
root@mysqlrecov2:/var/lib/mysql/data_borked# mysql < after_deletes.sql  
root@mysqlrecov2:/var/lib/mysql/data_borked#

Seems okey. According the output from the testscript one of the last transactions was a purchase registered to customer 8786 with a cost of 351.

root@mysqlrecov2:/var/lib/mysql/data_borked# mysql 
Welcome to the MySQL monitor.  Commands end with ; or \g. 
Your MySQL connection id is 9 
Server version: 5.7.29-0ubuntu0.18.04.1-log (Ubuntu) 
 
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. 
 
Oracle is a registered trademark of Oracle Corporation and/or its 
affiliates. Other names may be trademarks of their respective 
owners. 
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. 
 
mysql> USE sysbench;        
Reading table information for completion of table and column names 
You can turn off this feature to get a quicker startup with -A 
 
Database changed 
mysql> SELECT * FROM Purchase WHERE customer_id = 8786; 
+-------+-------------+------+--------+ 
| id    | customer_id | cost | weight | 
+-------+-------------+------+--------+ 
| 58333 |        8786 |  351 |    745 | 
+-------+-------------+------+--------+ 
1 row in set (0.00 sec) 
 
mysql>

If this had been a live system I would have been okey with bringing this online again. Since this was a small database a mysqldump and restore back on the master would work. For a bigger system maybe promoting the fixed slave is better and making the previous master into the new slave. Anyway, the testscript agrees that the data is correct:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7f05b26329d0> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

Oh right, I promised to get back to why we would want to avoid making Btrfs snapshots on the master server. Btrfs and ZFS are not considered ideal for database storage. See for instance the wonderfully titled blog post https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

It’s a few years old by now but I don’t doubt that Btrfs is still terrible for your database storage if you want to maximize throughput. My real world databases sees maybe 10-20 queries per seconds. My backup software can peak at hundreds per second(I blame the Java ORM-layer, not the me the programmer) for short bursts. Btrfs doesn’t slow that down noticeably. If someone asks me to setup a system that sees sustained loads of 5000 transactions per second I’ll ditch Btrfs in an instant on the master and only use it on the slaves. Because slaves typically don’t do much work. They just stream the results that result from all the work done on the master.

Anyway, I think those are the building blocks for safeguarding MySQL databases. There are some additional tweaks like having one slave intentionally trail behind the master a certain number of minutes to make recovery from bonehead mistakes easier. There are also more synchronous solutions like Galera(I’ve used Percona’s XtraDB variant) to minimize the gap during which a write to a master might be lost because it hasn’t reached any slaves yet.

MySQL replication notes

NOTE! These are mostly for my reference working on this post.

root@mysqlrecov2:~# mysqlbinlog /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less

Well that didn’t produce very readable information. This turned out to be better for viewing:

root@mysqlrecov2:~# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less

#200322 21:14:57 server id 1  end_log_pos 894 CRC32 0x8df90820  Query   thread_id=9     exec_time=0     error_code=0 
SET TIMESTAMP=1584911697/*!*/; 
BEGIN 
/*!*/; 
# at 1107 
#200322 21:14:57 server id 1  end_log_pos 956 CRC32 0x18b0e814  Table_map: `sysbench`.`Customer` mapped to number 116 
# at 1169 
#200322 21:14:57 server id 1  end_log_pos 1037 CRC32 0xa46a830a         Write_rows: table id 116 flags: STMT_END_F  
BINLOG ' 
UdV3XhMBAAAAPgAAALwDAAAAAHQAAAAAAAMACHN5c2JlbmNoAAhDdXN0b21lcgAEAw8DDwT/AP8A 
CBTosBg= 
UdV3Xh4BAAAAUQAAAA0EAAAAAHQAAAAAAAEAAgAE//B64AMAFHZ5YmdveWd0bWhwdHd6eGd3bWZy 
AAAAAA9maWhmeW9sYndha21wYm0Kg2qk 
'/*!*/; 
### INSERT INTO `sysbench`.`Customer` 
### SET 
###   @1=254074 
###   @2='vybgoygtmhptwzxgwmfr' 
###   @3=0 
###   @4='fihfyolbwakmpbm' 
# at 1250

How to find current log position? Or some way of figuring which position in the bin log a specific file system snapshot matches. This?

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                     | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| mysql-bin.000001 |      786 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769 | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec)

But it says this at the end of mysqlrecov2-relay-bin.000002…

#200322 21:23:07 server id 1  end_log_pos 1739591 CRC32 0xa5d8a74e      Xid = 82080

So what’s all this about mysqlrecov2-relay-bin.000001? And position 786 is suspect. Well, I guess File and Position are separate from GTID. So I shouldn’t be looking at File and Position because I use GTID? But I can’t very well make out anything sequential from a GTID. They’re UUIDs.

Okey, okey. Now I get it. So we have a GTID for the data synced from the master, that’s _b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769_ and we can find matching information in mysqlrecov2-relay-bin.000002. Indeed the last segment in that file reads:

last_committed=2766     sequence_number=2769

So that checks out. The other GTID refers to the local machine and so we find that information in /var/log/mysql/mysql-bin.000001 . Notice that this is not a binlog we replicated in from the master. So nothing relay about this file. This GTID set only contains the part where I added the sysbench user so that I could run the script against the slave to make sure it was consistent. I’m not replicating the mysql-database since I don’t know if that’s safe.

All right, let’s break the database and try to recover using a combination of snapshots and binlog.

mysql> SET FOREIGN_KEY_CHECKS=0; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> DROP FROM Customer WHERE id < 5000;                 
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'FROM Customer WHERE id < 5000' at l
ine 1 
mysql> DELETE FROM Customer WHERE id < 5000;     
Query OK, 4285 rows affected (4.87 sec)

Let’s see what the sanity check says:

192.168.1.154 
Connection to 192.168.1.154: <mysql.connector.connection.MySQLConnection object at 0x7fa1487cf8d0> by X 
self.conn.in_transaction: False 
Checking correctness 
Sums of credit(753818484), costs(-20278036 and deposits(787452535) don't check out. Something is wrong. 
Credit exists for 1 but no such customer exists. 
Credit exists for 2 but no such customer exists. 
Credit exists for 4 but no such customer exists. 
Credit exists for 5 but no such customer exists. 
Credit exists for 6 but no such customer exists. 
Credit exists for 9 but no such customer exists. 
Credit exists for 10 but no such customer exists. 
Credit exists for 11 but no such customer exists. 
Credit exists for 14 but no such customer exists. 
Credit exists for 15 but no such customer exists.
 <...>

And so on for thousands of lines. Found the last bit before the big delete block:

### INSERT INTO `sysbench`.`Product` 
### SET 
###   @1=57331 
###   @2='awcioutbwumewtopjnhx' 
###   @3=54 
###   @4=9 
# at 15852327 
#200323 21:54:32 server id 1  end_log_pos 15852145 CRC32 0x4918e08a     Xid = 731899 
COMMIT/*!*/; 
# at 15852358 
#200323 21:54:27 server id 1  end_log_pos 15852210 CRC32 0x2b6858df     GTID    last_committed=32482    sequence_number=32483   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32483'/*!*/; 
# at 15852423 
#200323 21:54:27 server id 1  end_log_pos 15852288 CRC32 0xe0637123     Query   thread_id=93    exec_time=0     error_code=0 
SET TIMESTAMP=1585000467/*!*/; 
SET @@session.foreign_key_checks=0/*!*/; 
BEGIN 
/*!*/; 
# at 15852501

So the sequence number is 32483. So let’s try to start up the latest snapshot and see what the last sequence number was then.

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                      | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| mysql-bin.000003 |      194 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-19756 | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec)

All right, between 19756 and 32483 is the koscher sequence to replay:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=19756 --stop-position=32483 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql 
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 2054581866, event_type: 103 
ERROR: Could not read entry at offset 19756: Error in log format or read error.

Huh? Oh for the love of… I have to use log_pos instead of sequence number? All right, so 32483 has end_log_pos 15852288 and 19756 has an “# at 10105794:

# at 10105794 
#200323 21:50:01 server id 1  end_log_pos 10105646 CRC32 0x6ab793c0     GTID    last_committed=19751    sequence_number=19756   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:19756'/*!*/;

Let’s try this again:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10105794 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql

All right, let’s try it.

root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay.sql  
ERROR 1062 (23000) at line 35: Duplicate entry '54389' for key 'PRIMARY'

Okey, must have brought with us too many. Let’s see if I can find the next position:

root@mysqlrecov2:/var/lib/mysql# less replay.sql  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10106096 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay2.sql      
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay2.sql

Success! Maybe… Let’s try the sanity check:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb052407810> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

Aw, yeah! So let’s dump out what we got here as valid snapshot:

root@mysqlrecov2:/var/lib/mysql# mysqld --daemonize --log-error-verbosity=3 --pid-file=/tmp/mysqld2.pid --socket=/tmp/mysqld2.sock --datadir=/var/lib/mysql/backup --port=3306 
root@mysqlrecov2:/var/lib/mysql# mysqldump sysbench > after_bin1.sql 
mysqldump: Got error: 2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) when trying to connect 
root@mysqlrecov2:/var/lib/mysql# mysqldump -S /tmp/mysqld2.sock sysbench > after_bin1.sql 
Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to res
tore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events.  
root@mysqlrecov2:/var/lib/mysql# mysqldump --set-gtid-purged=OFF -S /tmp/mysqld2.sock sysbench > after_bin1_no_gtid.sql

Now let’s see if we can apply the events that followed after the “unfortunate” deletion of customers.

# at 16050579 
#200323 21:54:32 server id 1  end_log_pos 16050431 CRC32 0xd46e91ef     GTID    last_committed=32482    sequence_number=32484   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32484'/*!*/; 
# at 16050644 
#200323 21:54:32 server id 1  end_log_pos 16050509 CRC32 0x6f245997     Query   thread_id=66    exec_time=0     error_code=0 
SET TIMESTAMP=1585000472/*!*/; 
SET @@session.foreign_key_checks=1/*!*/;

Okey, so we only need start-position this time since we want everything from the end of the DELETE block:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=16050579 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay_last.sql                                    
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay_last.sql  
root@mysqlrecov2:/var/lib/mysql#

And is this consistent?

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb1bd586710> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer. 
cjp@kdemain:~/sysbench$

The last legit operations(after I did the big “accidental” delete but just before I brought down the database master to contain the damage) were:

Purchase: Deducted 958 from customer 218820 landing at 1442 
Deleted (83708,) 
Credit id 270922: Added 6224 to 261395 
Deleted (98105,) 
Credit id 270952: Added 7045 to 261281 
Purchase: Deducted 630 from customer 91130 landing at 2144

So is that last purchase now in the restored copy?

mysql> SELECT * FROM Purchase WHERE customer_id = 91130; 
+-------+-------------+------+--------+ 
| id    | customer_id | cost | weight | 
+-------+-------------+------+--------+ 
| 55193 |       91130 |  630 |    550 | 
+-------+-------------+------+--------+ 
1 row in set (0.00 sec)

And do we have customers with an ID below 5000?

mysql> SELECT COUNT(*) FROM Customer WHERE id < 5000;  
+----------+ 
| COUNT(*) | 
+----------+ 
|     4285 | 
+----------+ 
1 row in set (0.00 sec)

Indeed we do! So we have successfully cut out a particular disastrous sequence of operations, piecing together a coherent view of customers, purchases, credits etc. And it only took me an hour and a half. Probably not the response time considered ideal by a trader seeing 100 or so purchases per second.

Note that there’s no guarantee that this method will work out this well in practice. Indeed I may well have lost purchases by doing this but in the interest of un-screwing things it would probably have been the best way forward. So don’t view snapshots and binlogs as a license to clown around with a production database.

On a personal note: would it not be beneficial to have less brittle data structures than this? In a relational database we might have the state of Texas as a row in the table States and if it goes missing then every row representing customers from Texas will generate an error on read. Foreign key constraint error presumably. Maybe MongoDB is going too far in the other direction but DUPLICATE KEY and FOREIGN KEY CONSTRAINT errors make relational databases so sensitive. I guess this is a struggle each IT department has to have with their respective accounting departments.

MySQL crash recovery

Let’s crash some MySQL-databases, shall we? Well that’s harder than it seems. Especially if you’ve had to recover crashed databases on production servers. If they – set up to be very reliable – can crash during a network failure or hard reset, how come I can’t replicate the same when I tweak all settings to be at their most unreliable? Maybe the uptime of the server, the size of the database on disk, number of distinct databases and so on play a role? Well, let’s try to investigate MySQL crash recovery as best we can.

Hmm, maybe we should first establish some method of simulating a workload? Well, I have a script that simulates the sort of workload that you would see in high volume eCommerce. It makes full use of transactions to make multiple updates to multiple tables happen atomically(i.e. completely or not at all). Basically it pretends that there are customers that add credit to their account and then make purchases using that credit. Let’s try it out:

The script ownsysbench.py starts on process per IP or hostname provided and each runs 12 threads that each simulate the workload. As we saw in mytop, the number of queries per second varies between 700 and 100. This is on a dual core VM running on an Intel Atom server. So I’m kind of impressed. But in fairness I ran this on an ext4 filesystem with nobarrier on and relaxed commit rules(https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit). Let’s try it again with standard barrier for ext4 and default commit rules. Oh, maybe I should turn off Cache(unsafe) for the disk in Proxmox? That probably gave this run a performance advantage. Okey, so no extra disk caching, default flush of transactions to disk and ext4 has its barrier again:

So 600 to 800 queries per second. That’s not so bad. It’s not like it’s a tiny database either:

Well, 90 MB is kind of tiny but 200 000 customers is on the big side. With a smaller database we get up to 1200 and 1300 queries per second. Now, the point here is actually just to get a feel for what kind of load we’re putting on the system when we tweak variables. Even with all safeties on we have several hundred queries per second, of which many are INSERT and UPDATE.

So one could be forgiven for thinking that rebooting the server would cause problems. Whenever you pull the plug there will probably be a couple of transactions running. Let’s try it with these safe settings. I’m using a slightly smaller database size but we can try the big one later on:

I have set MySQL to not start automatically, to make it easier for me to spot error states. But as you can see, MySQL starts up without issue. It’s only when we look in the error.log(it’s a general log, not really sure why they named it that) that we see that some recovery-work had to be done. But it was entirely automatic.

5 transaction(s) which must be rolled back or cleaned up in total 6 row operations to undo 
2020-03-21T21:19:59.885968Z 0 [Note] InnoDB: Trx id counter is 373760 
2020-03-21T21:19:59.885997Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
<...>
2020-03-21T21:20:00.715726Z 0 [Note] InnoDB: Rolling back trx with id 373313, 1 rows to undo 
2020-03-21T21:20:00.715914Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 
2020-03-21T21:20:00.715974Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 
2020-03-21T21:20:00.716079Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 
2020-03-21T21:20:00.724026Z 0 [Note] InnoDB: Rollback of trx with id 373313 completed 
2020-03-21T21:20:00.724074Z 0 [Note] InnoDB: Rolling back trx with id 373312, 1 rows to undo 
2020-03-21T21:20:00.731498Z 0 [Note] InnoDB: Rollback of trx with id 373312 completed 
2020-03-21T21:20:00.731537Z 0 [Note] InnoDB: Rolling back trx with id 373309, 1 rows to undo 
2020-03-21T21:20:00.734520Z 0 [Note] InnoDB: Rollback of trx with id 373309 completed 
2020-03-21T21:20:00.734557Z 0 [Note] InnoDB: Rolling back trx with id 373286, 1 rows to undo 
2020-03-21T21:20:00.737645Z 0 [Note] InnoDB: Rollback of trx with id 373286 completed 
2020-03-21T21:20:00.737681Z 0 [Note] InnoDB: Rolling back trx with id 373213, 2 rows to undo 
2020-03-21T21:20:00.890034Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 
2020-03-21T21:20:00.891339Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active. 
2020-03-21T21:20:00.891411Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active. 
2020-03-21T21:20:00.892242Z 0 [Note] InnoDB: Waiting for purge to start 
2020-03-21T21:20:00.892640Z 0 [Note] InnoDB: Rollback of trx with id 373213 completed
2020-03-21T21:20:00.890034Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 
2020-03-21T21:20:00.891339Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active. 
2020-03-21T21:20:00.891411Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active. 
2020-03-21T21:20:00.892242Z 0 [Note] InnoDB: Waiting for purge to start 
2020-03-21T21:20:00.892640Z 0 [Note] InnoDB: Rollback of trx with id 373213 completed 
2020-03-21T21:20:00.892756Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed

mysqlcheck gave the tables a clean bill of health and the internal consistency is correct according the benchmark software itself:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.154 --check true                 
192.168.1.154 
Connection to 192.168.1.154: <mysql.connector.connection.MySQLConnection object at 0x7fbbda456710> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

So I’ve tried turning off ext4’s barrier(letting writes be done a bit more sloppy), flush of innodb logs on transaction commit(they are then done once a second instead) and even that caching-mode in Proxmox called “Writeback (unsafe)”. InnoDB hasn’t given me a single complaint… As mentioned earlier, I have seen InnoDB crash in the wild so it’s not completely safe but I at least can hit it with a shovel for a couple of days and fail to make a dent.

I was planning on provoking an error which could be solved by starting MySQL in InnoDB force recovery mode level 1 and demonstrate how that can fix things. This setting(https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html) is something you hope not having to fiddle with but when InnoDB does crash in some way, innodb_force_recovery is the way forward. We would move on to something that requires recovery level 2 or 3. All the way up past level 4. Levels 4 and up typically involve some data loss and usually requires exporting and importing all the databases.

As noted above however, no go on generating these kinds of errors as intended. But if we can’t get one of those elusive errors that can be solved with innodb_force_recovery=3 then I know a way to generate errors that can be “solved” with innodb_force_recovery=6. So let me reset the database server while it is running again, with 200 000 customers and maximum danger settings on. Then we can do some damage ourselves and see what happens.

Okey, that actually didn’t go to plan… I was expecting to have to go to level 6 and for mysqldump to fail at some stage. Maybe I deleted the wrong ib_logfile? I have a snapshot of the whole mysql partition after the reset. I can try again. No, now not even level 6 worked. All right, let’s start over. I have – as previously mentioned – been hitting this setup with a shovel for a couple of days.

So I switched to Btrfs instead of Ext4, for no other reason than feeling like it. Now let’s try that intentional break again:

Again we only had to go to level 5 of innodb_force_recovery but importantly we got stuck when trying to dump a specific table from the database. In a real life situation it’s easy to imagine how losing the list of items for all purchases would be a bad thing. I’m pretty sure audits by the IRS go better when you can show what you sold and to whom.

So while this doesn’t show tremendously well how to deal with more subtle forms of InnoDB crashes you get the general idea. Notice how I created a Btrfs snapshot before starting to tinker with stuff. That can be a good idea if the option is available. Here I intentionally broke something but the process of fixing things could actually break things even more. And the most important take-away here is that not all corruptions are recoverable. Even if all the data in the sysbench-folder is correct a failure in one of the ib_logfiles can leave you with entire tables being unrecoverable. You got to have backups!

So what about MyISAM? Well it’s a lot easier to break MyISAM. In fact, since the benchmark tool is based on transactions – and MyISAM doesn’t support transactions – I only need to run it for a few seconds to get garbage data out. Let’s first dig ourselves out of the hole we dug for ourselves(fine, the hole I dug for ourselves).

But MyISAM says this clearly on the box so we can’t very well complain. Need transactions? Don’t use MyISAM. Need crash tolerance? Don’t use MyISAM. Have large table? Don’t use MyISAM. It’s surprising how many MySQL-databases in the world don’t conflict with those constraints and merrily run MyISAM without issue. But let’s break a MyISAM setup just for kicks!

This run was done with maximum safe settings. Btrfs doesn’t have a nobarrier equivalent as far as I know. But Proxmox added no caching and MySQL’s settings were conservative. The errors that arose were recoverable but we can’t tell if the contents of the databas is consistent with the application because as we saw before, MyISAM can’t guarantee that even when there’s no one doing a hard reset every couple of minutes.

So that was a slightly disjointed tour of MySQL failure and recovery. The next installment will be about how to cover our backs. Step 1 is clearly to use InnoDB as the underlying engine and I guess step 2 is not deleting ib_logfiles. Snapshots, binlogs and replication will follow.

Pacemaker cluster

Pacemaker is a piece of software for running a cluster of computers. It uses corosync to provide cluster communication and membership management in most deployments I believe. The command line tool on Ubuntu Linux is pcs. pcs status tells your what the cluster is doing and which nodes are currently members. pcs cluster start node2 starts node2, even if the command is executed on another server in the cluster.

What makes Pacemaker a bit different from other cluster management systems is that it handles those awkward mutual exclusion scenarios quite well. You may have heard about Keepalived used to provide high availability and Kubernetes is a popular cluster platform today. But neither can safely handled resources that require mutual exclusion. It’s perfectly possible for a Kubernetes node running a MariaDB instance to freeze, at which point Kubernetes starts up the same instance on another server, only for the original to suddenly come alive again.

Pacemaker supports fencing which makes sure that something that had a unique resource can not run it again without approval from the cluster. This is typically done by killing the previous runtime(virtual machine, physical server or whatever it may be) and waiting for a response message that gives us the assurance the previous runtime isn’t going to do something unexpected.

For physical hardware IPMI is typically used to do fencing, cutting the power to all the parts of the computer that aren’t the IPMI-board itself. It is necessary for the IPMI-unit to continue operating because fencing requires both the kill action and some verification that the kill action worked. A common method is to let IPMI reset the power to the server and the verification is simply the server waking up and asking the cluster software if it can join again.

If implemented properly fencing should give complete assurance that two instances of a database server won’t be accepting writes concurrently, or instances of a file system writing to shared storage without being synchronized. But that means that if we lose contact with a server and its IPMI module, then the node can’t be fenced and we can’t safely bring up a new instance of whatever software failed along with the server. A human being can typically resolve the situation but that may take time.

Note that it is highly desirable to have clusters that do not rely on mutual exclusion. Ceph is a storage system with no need for fencing as it is quorum based(a majority of nodes have to agree on things, leaving some wonky node that freezes for a minute every now and then out in the cold) and each node has it’s own local storage(so none of that troublesome shared storage). Similarly Galera is an addition the MySQL-based databases that allow for multiple masters simultaneously thanks to a system of mutual certification of each proposed transaction so that no two changes to the contents of the database are ever in conflict.

Setup

There is good documentation on setting up a Pacemaker cluster on the website for the Pacemaker project: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/index.html

But creating somewhat elaborate configurations isn’t super-easy. Let’s look at creating a MySQL cluster. I’m no expert on this but I’ll explain it as best I can.

# pcs cluster cib sveaha.xml
# pcs -f sveaha.xml resource create db_master_ip ocf:heartbeat:IPaddr2 ip=192.168.1.67 cidr_netmask=32 op monitor interval=30s on-fail=restart
# pcs -f sveaha.xml resource create mysql ocf:heartbeat:mysql \
binary=”/usr/bin/mysqld_safe” config=”/etc/mysql/my.cnf” \
datadir=”/var/lib/mysql” pid=”/var/lib/mysql/mysql.pid” \
socket=”/var/lib/mysql/mysql.sock” \
replication_user=”replication_user” replication_passwd=”MYPASSWORD” \
–master \
additional_parameters=”–bind-address=0.0.0.0″ op start timeout=60s \
op stop timeout=60s op monitor interval=20s timeout=30s \
on-fail=standby
# pcs -f sveaha.xml resource clone mysql clone-max=3 clone-node-max=1
# pcs -f sveaha.xml constraint colocation add master mysql with db_master_ip score=INFINITY
# pcs cluster cib-push sveaha.xml –config

First we have pcs cluster cib sveaha.xml which simply dumps the configuration of the cluster to an XML-file called sveaha.xml. We’ll make pacemaker store our changes in that file for now and then when we’re done we load it all into the running cluster as a single transaction.

Next pcs -f sveaha.xml resource create db_master_ip ocf:heartbeat:IPaddr2 ip=192.168.1.67 cidr_netmask=32 op monitor interval=30s on-fail=restart creates an IP-address 192.168.1.67 that we call db_master_ip. We’re not telling the cluster how and when to use it yet, merely that it exists and should use the IPaddr2 resource script from the ocf:heartbeat library.

Now the big one:
# pcs -f sveaha.xml resource create mysql ocf:heartbeat:mysql \
binary=”/usr/bin/mysqld_safe” config=”/etc/mysql/my.cnf” \
datadir=”/var/lib/mysql” pid=”/var/lib/mysql/mysql.pid” \
socket=”/var/lib/mysql/mysql.sock” \
replication_user=”replication_user” replication_passwd=”MYPASSWORD” \
–master \
additional_parameters=”–bind-address=0.0.0.0″ op start timeout=60s \
op stop timeout=60s op monitor interval=20s timeout=30s \
on-fail=standby

This creates a mysql database resource named mysql based on the ocf:heartbeat:mysql resource script. We specify the path of the binary, the mysql config file, the directory that mysql should use to store the data files, what pid-file to use and what path should be used for the socket for local communication. Phew, next bit: we have to specify the replication user and the password used. The ocf:heartbeat:mysql script uses these values when turning server nodes into slave, executing queries like “CHANGE MASTER TO”. The –master part is very important and declares that this resource shall be divided into master/slave instances. Otherwise all nodes would be the same, either masters or slaves. Finally we say that the servers should listen for traffic intended for any IP-address(–bind-address=0.0.0.0) and give some reasonable values for starting, stopping and so on.

So far we’ve basically said that these MySQL instances should be present on all nodes but we need to be more precise. That’s the next bit: pcs -f sveaha.xml resource clone mysql clone-max=3 clone-node-max=1

This states that the mysql resource which is made up of clones should only have a total of three active clones at any time and no node should have more than one clone. My cluster has three nodes so maybe I could have skipped the clone-max setting but it seemed better to include.

What about that IP-address? pcs -f sveaha.xml constraint colocation add master mysql with db_master_ip score=INFINITY
This states that the master mysql node and the db_master_ip should always be colocated on the same node, no exceptions. If the score is 5 and some other rule with score 6 says that the db_master_ip should be somewhere else, then it will win.

Finally, let’s state that the IP-address for the mysql master should be started until after the master node is ready: pcs -f sveaha.xml constraint order start mysql-master then db_master_ip

I also run three DNS-servers on these nodes and they aren’t divided into master/slave so the configuration looks slightly different:

# pcs -f sveaha.xml resource create bind ocf:heartbeat:named named_user=”bind” named_config=”/etc/bind/named.conf”
# pcs -f sveaha.xml resource clone bind clone-max=3 clone-node-max=1
# pcs -f sveaha.xml resource create bind_service_ip ocf:heartbeat:IPaddr2 ip=192.168.1.68 cidr_netmask=32 op monitor interval=30s on-fail=restart
# pcs -f sveaha.xml constraint order start bind-clone then bind_service_ip
Adding bind-clone bind_service_ip (kind: Mandatory) (Options: first-action=start then-action=start)

So nothing here about –master or saying that the IP-address used to access the DNS-service has to be placed on a specific node. As long as some bind-clone runs on a node Pacemaker can put the IP-address there. But no IP-address is allowed to run on multiple nodes at the same time by design.

Let’s add some fencing. I rewrote the demonstration fencing-script that uses SSH to have my server hypervisors kill the cluster nodes. It’s not super-pretty but it works: https://static.hemsida.eu/wp-content/uploads/2019/11/proxmox

You will need to set up passwordless SSH to the hypervisor nodes and place the script linked above in the directory for external scripts, on Ubuntu: /usr/lib/stonith/plugins/external/

To add these fencing devices to your Pacemaker-cluster:

# pcs -f sveaha.xml stonith create proxmox_fence_pcmk1 external/proxmox proxmoxhost=pve1 vmid=117 pcmk_host_list=”pacemaker1.svealiden.se”
# pcs -f sveaha.xml stonith create proxmox_fence_pcmk2 external/proxmox proxmoxhost=pve2 vmid=120 pcmk_host_list=”pacemaker2.svealiden.se”
# pcs -f sveaha.xml stonith create proxmox_fence_pcmk3 external/proxmox proxmoxhost=pve3 vmid=122 pcmk_host_list=”pacemaker3.svealiden.se”
# pcs -f sveaha.xml property set stonith-enabled=true
pcs cluster cib-push sveaha.xml –config

This just says that the pcmk1 fencing device based on script external/pve shall fence the virtual machine with vmid 117 and that this has to be done on the Proxmox host pve1. The host list is what Pacemaker will look at when it tries to fence something. So pacemaker1.svealiden.se is now fenced by pcmk1 by running a command on Proxmox host pve1 using 117 as an argument to that command. Simiarly for pcmk2 and pcmk3. Finally we enable stonith(Shoot the Other Node in the Head) and load the configuration into the cluster.

I’ve been running a live MySQL and bind cluster like this for some months and by and large it works. I think I’m going to set Pacemaker to autostart itself on each node after power on though. By default it requires nodes to be started manually. One interesting and infuriating thing is that MySQL sometimes ends up with split brain requiring manual intervention. You can see if a slave is broken by running:

MariaDB [(none)]> SHOW SLAVE STATUS\G;

If you see error messages about duplicate keys and stuff, then you’re also in trouble. I don’t understand why this happens. I’ve read through the scripts used by Pacemaker and they all seem to make sense. If you’re sure nothing important got written to the node that Pacemaker can’t bring back into functioning slave-mode you can just delete everything in the MySQL data directory and create a blank template(example has node3 as a master, node2 as a broken slave that can’t catch up and node1 as a functioning slave):

node2 # pcs cluster stop –node2
node2 # rm -rf /var/lib/mysql/*
node2 # mysql_install_db

You then create a dump of the database from one of the functioning slaves:

node1 # mysqldump –dump-slave=1 –all-databases > dumpfromslave.sql

Then import it into the blank MySQL install on the broken slave, making sure to start the MySQL instance yourself, not through Pacemaker:

node2 # systemctl start mysql
node2 # mysql < dumpfromslave.sql

Next you need to manually get this node to catch up with the rest of the cluster before getting Pacemaker to manage the node. Not sure why this is but I spent hours with split brain issues even using this new “reset broken slave with SQL dump file” until I followed this advice. Let’s find out what the coordinates for synchronization were in the SQL dump file:

node2 # mysql
MariaDB [(none)]> SHOW MASTER STATUS;
+——————-+———-+————–+——————+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+——————-+———-+————–+——————+
| mysqld-bin.000015 | 328 | | |
+——————-+———-+————–+——————+
1 row in set (0.00 sec)

Then run the whole command for making this node synchronize with the master:

> CHANGE MASTER ” TO MASTER_LOG_FILE=’mysqld-bin.000015′, MASTER_LOG_POS=328, MASTER_HOST=’node3′, MASTER_USER=’replication_user’, MASTER_PASSWORD=’MYPASSWORD’;

Note that in my example node2 was the broken slave, node1 was a functioning slave and node3 was the master. We didn’t get an SQL dump from node3 because that would lock tables, preventing writes. But once the dump was read into node2 we set it to synchronize data from node3. It is possible to get a slave to synchronize from a slave but Pacemaker will try to set things up with all slaves synchronizing from the master so that’s how I prepare node2.

With this done I can close the MySQL instance started via systemd and let Pacemaker take over again:

node2 # systemctl stop mysql
node2 # pcs cluster start

This makes node2 start its own Pacemaker instance and join the cluster. Note that pcs status and pcs resource won’t tell you if MySQL slaves are running correctly, merely if they are live and labelled as slaves by Pacemaker. I use a Zabbix plugin to keep track of whether or not slaves are actually synchronizing. It warns “Slave SQL Not Running” if a slave isn’t keeping up with the master.

I’d like to try ClusterControl and MySQL Orchestrator at some point. We’ll see what time allows.

Percona XtraDB Cluster again

So I got fed up with Percona XtraDB Cluster some time back. It was the backend for Zabbix and a bad crash left me with a cluster that just wouldn’t form a quorum and that was it for me. Other problems had included lots of errors on commits due to inconsistencies between nodes which I felt was difficult to justify since only one of the nodes actually received any writes…

Well I have since written a python script that can write two multiple master servers compatible with MySQL. It complies with Percona’s Strict Mode and after some grinding I got it to maintain consistency of data. The code is absolutely awful but it works! One of the neat tricks is that it can at any point go through the database to determine that the “business logic” is maintained. That’s how I dug up tricky corner cases where a customer could have a purchase registered to him after he had been removed from the Customer-table.

So what happens when I run the script at full tilt(it’s not fast but 15-30 transactions per second isn’t bad given only three queries can run in parallel due to the aforementioned awful code) and I pull the plug on all three Percona XtraDB Cluster VMs? Well the script stops printing out information about completed operations because it can’t continue. So I kill the script.

Next I reboot all three nodes and they just automatically get back into working order. No fiddling with extraneous tables or having to set innodb_force_recovery. I was surprised. I’ve tried it a few times and it just works. Running the consistency-checker shows that atomicity has been maintained. I even took minute-by-minute snapshots of the mysql partition(running Btrfs) and forced a node back in time so it looked like it had lost five minutes of updates. Still no problem!

Now I find myself in the surprising position of having to roll up my sleeves and meticulously engineer a scenario to replicate what I once encountered in the wild. I will basically have to:

Reset the database back to a shared state X
Advance the cluster by writing from my script
Snapshot node 1
Reset the database back to the previous state X
Write some more
Snapshot node 2
Repeat for node 3

Eventually I will have three snapshots – one on each node – which constitutes a completely split brain. No quorum and no automatic syncing possible. Then I can test how to recover manually. Hmm,. I should probably snapshot all three nodes in each of the split brain versions. Then I can try a 2-1 situation and see how Percona XtraDB Cluster will handle it. It should let the 2 nodes that are in agreement start up but the 1 should be kept out, not resynced or overwritten – just excluded.

Well, that’s what I think anyway.

Addendum 1: Now something interesting happened for some reason. I had shut down the cluster cleanly and yet standard start via systemctl didn’t work:

root@pcmk1:~# systemctl start mysql
Job for mysql.service failed because the control process exited with error code.
See “systemctl status mysql.service” and “journalctl -xe” for details.

I had to cat the contents of /var/lib/mysql/data/grastate.dat to find the node with either the highest sequence number or safe_to_bootstrap: 1. I found the latter:

root@pcmk1:~# cat /var/lib/mysql/data/grastate.dat
GALERA saved state
version: 2.1
uuid: ee445958-ed07-11e9-a650-a6dcefe074b9
seqno: 458661
safe_to_bootstrap: 1

The other two nodes had the same seqno. Anyway, with this information in hand pcmk1 could be started with /etc/init.d/mysql bootstrap-pxc and then I could start the other two nodes with systemctl start mysql and then all was well again.

I’m getting the idea that the startup script for Percona XtraDB Cluster does some self-bootstrapping that isn’t just done by running systemctl start mysql. Or this is just one more mystery.

Notes to self:
Point X: Snapshots/data/data@auto-2019-10-29-2111_48h
Start of brain 1: Snapshots/data/data@auto-2019-10-29-2123_48h
End of brain 1: Snapshots/data/data@auto-2019-10-29-2126_48h
Start of brain 2: Snapshots/data/data@auto-2019-10-29-2134_48h
End of brain 2: Snapshots/data/data@auto-2019-10-29-2137_48h
Start of brain 3: Snapshots/data/data@auto-2019-10-29-2143_48h
End of brain 3: Snapshots/data/data@auto-2019-10-29-2146_48h

Okey, testing brain 2 on pcmk3 and pcmk2 and let pcmk1 use brain 1. pcmk2 and pcmk3 had slightly different seqno which isn’t so weird since the snapshots can’t be initiated at the exact same microsecond and can’t really be expected to complete in exactly the same number of hard drive writes. So I had to edit the grastate.dat file for pcmk3 so that safe_to_bootstrap was set to 1 and then ran /etc/init.d/mysql bootstrap-pxc. I needed to run systemctl start mysql a couple of times to get pcmk2 to sync but eventually it got into place.

As expected pcmk1 couldn’t join and was left in its own partition. So far so good. Now to let each node use their own brain.

2019-10-30T21:21:22.923454Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2019-10-30T21:21:22.923806Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924357Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924830Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 0 (pxc-cluster-node-1)
2019-10-30T21:21:22.925159Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 1 (pxc-cluster-node-2)
2019-10-30T21:21:22.925188Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():324: Reversing history: 464082 -> 462077, this member has applied 2005 more events than the primary component.Data loss is possible. Aborting.
2019-10-30T21:21:22.925212Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Seems reasonable. One node bailed out real quick but two kept trying.

| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | ee445958-ed07-11e9-a650-a6dcefe074b9 |
| wsrep_cluster_status | non-Primary

But the cluster didn’t process queries during this attempt to form a quorum. Just as it intended. So let’s say we know for a fact that brain 3 is the “best”. Brain 1 and brain 2 are just wrong, not actually holding information that needs to be preserved. Percona XtraDB Cluster shouldn’t make this determination, as it hasn’t. But we need to be able to do this when we’ve reviewed the situation(and need to get things back up and running). So let’s run this on pcmk2:

rm -rf /var/lib/mysql/data/*
systemctl start mysql

And a check of the STATUS:

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306
wsrep_cluster_weight 2
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid 0a119cbd-fb5e-11e9-a854-a317f06d04a7
wsrep_cluster_conf_id 2
wsrep_cluster_size 2
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

And the logic check passes when running my Python script. Okey, so let’s join the last node… And yeah, works the same. So if one is patient all you need to do is completely blank the datadir(which in my case is /var/lib/mysql/data but for most people will be /var/lib/mysql) and start the node again. It will do state transfer to initialize itself and then join.

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306,192.168.1.64:3306
wsrep_cluster_weight 3
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid ce76d2c9-fb5e-11e9-a04e-926b86eb0f23
wsrep_cluster_conf_id 3
wsrep_cluster_size 3
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

What to do if you have a split brain scenario and no node is actually “wrong” but there actually is a discrepancy in valuable data? Well then you’re boned! Databases do not have a convenient way of merging data like that. I’m of the opinion that any application that writes to a database should have some fallback mechanism whereby it can scan versions of the database that diverged away from the master and then inject the changes into the master.

In my dummy application this would involve retrieving all purchases and all deposits of credit from the diverged copy. Any purchase or deposit of credit unique to the diverged copy would then be processed as if though they were brand new transactions coming in from customers, with the customers’ credit balance being changed accordingly. In some systems this may not be possible but it’s really a desirable feature when you don’t want a split brain to be an insurmountable obstacle to getting your records straight.

Debugging and profiling quick guide

strace -c -f -p 2021

Count( -c ) the system calls of process with pid 2021 and all the children spawned by that process( -f ). Produces nice set of stats:

<Ctrl-C> 
 strace: Process 31867 detached
 strace: Process 31868 detached
 strace: Process 31869 detached
 strace: Process 31870 detached
 strace: Process 18996 detached
 % time     seconds  usecs/call     calls    errors syscall
  31.06    0.854875        3143       272           io_getevents
  26.19    0.720795       34324        21         4 restart_syscall
  14.65    0.403190        2965       136        19 futex
  14.38    0.395702        2105       188           poll
  12.12    0.333484       10421        32           select
   0.76    0.020852         348        60           fsync
   0.28    0.007772          13       605       201 recvfrom
   0.21    0.005732          24       236           sendto
   0.15    0.004123          29       140           io_submit
   0.06    0.001635          51        32           pwrite64
   0.04    0.001057          31        34           read
   0.03    0.000827          44        19           madvise
   0.02    0.000640          13        48           lseek
   0.02    0.000582          34        17           write
   0.01    0.000320          80         4           getdents
   0.01    0.000215          15        14           fcntl
   0.01    0.000138          23         6           setsockopt
   0.00    0.000090          23         4           close
   0.00    0.000061          31         2           accept
   0.00    0.000059          30         2           shutdown
   0.00    0.000057          29         2           openat
   0.00    0.000052          26         2           access
   0.00    0.000047          24         2           getpeername
   0.00    0.000032          16         2           fstat
   0.00    0.000030          15         2           gettid
   0.00    0.000023          12         2           sched_yield
 
 100.00    2.752390                  1884       224 total

strace -f -p 2021 -o strace.log -s 500

Write strace of process with PID 2021 and all of its child processes to file strace.log and include up to 500 bytes of output for each line.

mkdir traces
cd traces
strace -ff -p 2021 -s 1000 -tt -o strace

For process 2021 and all its child processes, trace system calls and place them in files with names like strace.pid (that’s the -ff part). Include 1000 bytes of data per call(-s 1000) and let each call be marked with wall clock time(-tt).

strace -p 2021 -T -o strace.log

Trace process 2021 and mark each call by the time it took between start and end of the call. Log it to strace.log.

strace -p 2021 -e trace=file

Trace only file-related calls. See more categories: http://man7.org/linux/man-pages/man1/strace.1.html

Email login failures

An email provider will want to prevent brute force hacking attempts against their servers. Otherwise it will be too easy for hackers to keep guessing which passwords are used for various accounts and eventually break into them. Are there tools for this in the Linux ecosystem? Sure, there’s the Login Failure Daemon(lfd) that’s part of the ConfigServer Security & Firewall(csf). It counts how many failed login attempts are made from an IP address and when a certain limit has been reached within a certain time it blocks the IP address, for a certain time. There’s also OSSEC which is closer to my heart as well as the up-and-coming Imunify.

Before looking at how these measure up it’s worth to point out a few things:

People choose bad passwords
People screw things
Email clients are often extremely badly coded

Depending on your environment you can enforce some password complexity to deal with item 1. Item 2 is impossible to guard yourself again. People will connect multiple devices to a single IMAP accounts in 2016 and upon changing the password in 2019 have no clue that they have the account connected to anything but their smartphone. At which point the detective work begins: what’s trying to log in to my email account and getting me blocked every fifteen minutes?

The third point is the most aggravating. Why would anyone write an email client that keeps sending login attempts to an email account thousands and thousands of times while being constantly told by the server: “Login credentials invalid”. It’s not like a programmer has to guess what the magical error message from the server is going to be because that’s standardized! See: https://tools.ietf.org/html/rfc5530

But developers big and small continue making email clients that just keep spamming the server with the same login credentials, sometimes at a rate of once a second. So how do we deal with this at the server level? Ideally we would never block actual customers trying to login to their own accounts even when they keep sending the wrong password. But the correct username and password combination is precisely how we distinguish between valid users and everyone else.

The obvious answer is to block those IP-addresses that keep guessing new passwords all the time and not block those that just keep using the same credentials over and over again. In Dovecot the setting auth_verbose_passwords=sha1 can be used to display the sha1 hashsum of passwords used in failed login attempts. Courier has DEBUG_LOGIN=2 though that logs the password in plain text. Login failure daemon can only be made to use this information by rewriting the Perl script that is lfd.

OSSEC might be possible to configure this way but I’m not sure. There’s no ready-made field “password” that you can include and no rule “not same password”. There is for source IP and a rule “same_srcip”, but not for password. I’m not necessarily blaming the developers of lfd and OSSEC for not writing their software to handle a world in which email clients are written by monkeys. But given the whole monkeys-writing-software situation we’re in, they fall short.

Imunify doesn’t try to factor in the password used in a failed login attempt either but does at least have some intelligent alternatives. Most notably that a customer logging on to something like cPanel whitelists their IP address. I would propose that a successful login to any email account on a server should also lead the IP address to be whitelisted on that server.

If successful login to an email account from IP address A, ignore failed logins for 1 day.

This does leave open the possibility of a valid customer trying to hack other accounts on the server with his own account on it. This seems to me a very small risk but one could be more restrictive:

If successful login to any email account x@y.z from IP address A, ignore any failed login attempts to email accounts ending with @y.z.

That wouldn’t work with Imunify white lists because it doesn’t have that kind of granularity. I think we must conclude that we have little choice but to write our own code for handling email login failures and if we’re going to do that then we might as well do it right and analyze a log file with information of the password used.