Threat Stack has an event processing pipeline that processes millions of messages per second sent to us by customer agents, our internal services, cloud providers, and more. Some of these messages turn into alerts, others get written to a database to provide deep context into what is happening in a customer environment. When a Threat Stack user or a Security Analyst in our SOC finds a concerning alert, the next step is to dig a little deeper and use our platform to analyze if the behavior is expected or not.
Recently, we noticed that every Sunday morning we’d encounter a backup in writing these raw events to one of our databases. Writes would stop for about 10 minutes, our processing queues would back up, and our on-call engineer for that week would get paged. Generally, this kind of thing would clear itself up with some light human interaction… until the following week. Ideally, you want on-call activities to be actionable and infrequent, so this was a fast ticket to engineer burnout. How’d we fix the event?
It’s easy to get used to looking at the application code for bugs when you’re lucky enough to have access to it, as is the case for our internal services. But this was a red herring: in the end, the issue was related to lower-level system configuration. After we determined that the system wasn’t receiving any load spikes from customers, we started poking around the system itself, and that’s where we found a few weekly cron jobs that run by default.
One of these jobs will update the
man page database, another one would check to see if there was a new distribution version available, and the third one ran a command called
fstrim. This was initially unremarkable to us – we were running in a cloud environment where we’re not selecting our own disks or controllers, so surely something outside of the OS was handling this important function.
As you may guess, this post is here to tell you: the OS was handling this important function in a way that we did not expect – the
fstrim weekly cron command!
A Brief History of Deleting Files
When you delete a file on an actual spinning hard drive, the file gets marked with an identifier that the position on the disk is available for use. Over time, the bits that make up your file get overwritten with the bits of your new files, regardless of their physical position on the disk. Sometimes you could run an optimization tool like a disk defragmenter to make sure that all the individual blocks that made up a file were in order. This would increase read performance, and keep users happy with their desktop systems. In the server space, people running specialized commercial database software could (and did) increase performance by putting frequently accessed data as close to the platter as possible.
Solid State Drives changed this dynamic. SSD controllers abstracted away much of the “physical drive” bits to be compatible with existing computers. SSD’s store files in “pages” vs. block locations on a spinning platter. And, pages need to be contiguous – so the SSD needs to reserve some space to be able to move files around when you create (or delete) files. To facilitate this action, a new hardware operation called TRIM was implemented. This operation was a way to let your operating system tell your SSD to free up space for a file that was no longer needed.
The TRIM operation can be sent to the SSD with every delete (run continuously) or it can be run in batches. On a Linux system, you can use the mount option
discard to run the TRIM operation continuously. Alternatively, you can use the
fstrim command to run it all at once. We had not mounted our drive with the discard option, and had thought a controller might be handling this operation for us – on our directly attached storage provided to us by our cloud provider. We ran fstrim on a single database node, and writes stopped for a certain amount of time. We realized two things at this point:
- CPU IO Wait spiked up as did drive writes.
- There wasn’t any sort of more advanced storage controller between us and the drives – they were in fact (and as advertised) directly attached.
fstrimweekly was the source of our Sunday morning pain.
We deployed configuration to add that flag, and the issue was fixed. Right? Well, not quite – in fact, we posted a service availability note, because deploying that configuration caused delays in making raw events searchable. We reverted and processed the backlog at expected speeds, but we were disappointed: wasn’t continuous TRIM the solution to our Sunday morning wake up call? It was only part of the issue.
How to Store an Event
We leverage a search-optimized database for storing events for later analysis. This database supports storing data across multiple folders – however, we found that writes were not evenly balanced across the drives in that configuration, so we made a RAID 0 array of both drives and had the database use that. We chose RAID 0 because we built the system to replicate data across availability zones and different nodes. Disk errors on one or many nodes can be solved in software – when a node fails, a new node replaces it and any replicas are recreated on the new node.
When initially investigating the disk issue, I wanted to know if the drives actually supported the TRIM operation in the first place. Documentation said yes, but it’s always good to double check. When we ran
lsblk -D, the linux command which prints information about a drive’s support for the TRIM operation, the output looked like this:
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO nvme0n1 0 0B 0B 0 └─nvme0n1p1 0 0B 0B 0 nvme1n1 512 512B 2T 0 └─md0 0 512B 32K 0 nvme2n1 512 512B 2T 0 └─md0 0 512B 32K 0
We had implemented our RAID 0 array using Linux’s multi-device RAID driver, which creates a single logical device (/dev/md0 in this case) that can be formatted and accessed like a normal disk. We expected the root drive to not support the TRIM operation, since it is network attached block storage. Seeing this confirmed that we did have actual drives attached directly to the system.
In another part of our infrastructure, we had used LVM to set up a RAID 0 device. Interestingly to us the
lsblk -D output showed:
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO nvme0n1 0 0B 0B 0 └─nvme0n1p1 0 0B 0B 0 nvme1n1 512 512B 2T 0 └─vg--data-disk 0 512B 2T 0 nvme2n1 512 512B 2T 0 └─vg--data-disk 0 512B 2T 0
While the drives support up to 2TB as part of a TRIM operation, the way in which we created the multi-device raid0 had limited the maximum size of a TRIM operation to 32KB. This made our performance degradation make a lot of sense: we saw many thousands of IO requests per second when the
fstrim command would run, and our database would experience degradation while performing writes. This didn’t happen elsewhere in our infrastructure.
We brought the LVM provisioning code into this database’s provisioning system, tested our changes, and finally applied in production. Finally, the issue was resolved! Indeed, LVM sits on top of the multi-device RAID driver in Linux, so it’s possible we might have been able to address this issue without switching to LVM. Fortunately, one other benefit of this work is that it introduced some consistency in how storage is provisioned on hosts with multiple directly-attached storage devices.
The cloud abstracts a lot away from us in terms of hardware – most of the issues we encounter these days don’t involve optimizing the interaction between the OS and hardware. After all, cloud service providers do a good job of optimizing their hypervisors to be “out of the way,” and things function well enough that further optimization can result in diminishing returns. Every so often, however, an interesting behavior comes up and we like to write about it. If you find yourself in the same situation, remember to poke around the physical systems you’re running on. Beyond that, books like Brendan Gregg’s Systems Performance: Enterprise and the Cloud can help provide valuable methodologies and tools to help debug these kinds of problems.
This was originally posted on the Threat Stack blog.
2020-06-09 17:00 +0000
0e2c9ad @ 2020-08-02