bug hunting in the real world
Tue, Mar 31, 2015Finding the root cause of bugs in the real world is hard. It can be full of false starts and sometimes even after you understand how all the moving parts fit together, you still are left wondering who’s fault (if anyone’s) the bug is. Bugs that are the product of emergent behaviour are the hardest to quantify.
I’ve ran into a two of these in the last few months. Both bugs span multiple open source projects and involved lots of blood, sweat, and tears to finally determine the root cause of.
Bug #1 excessive salt-minion cpu usage in docker
I recently made a few small changes to my team’s salt repo. Since I’m a semi-responsible developer I want to test these changes before unleashing them on the rest of the team. I fired up docker (which has the nice advantage of being able to cache intermediate stages of a build) and installed salt-minion.
Things seemed to work just fine. Salt was installed, functioning, no weird errors were seen in the logs. I had to work around the fact that iptables doesn’t seem to work quite the same way within a container, but no huge issues.
Then I noticed my laptop’s fan wouldn’t shut off. I usually blame my corporate outlook
since it has a habit of pinning a single core for absolutely no reason, but in this case
outlook was blameless. Running top showed me that salt-minion was using all the cpus
on my computer. After several minutes cpu usage settled down.
The first tool I look for in cases like this is strace. Unfortunately strace barfed with
a permission defined error. I tried starting a docker container as the root user, still
got a permission denied error (Note: a colleague later told me I could try and run the container
in something called privileged mode). Gdb gave me similar errors. Eventually I realized that this
was related to Apparmor on ubuntu (don’t ask me how I figured it out, iirc I was googling pretty hard-core at that
point). Apparently the apparmor profile shipped with the docker package wisely considers calling
ptrace() on containerized processes ill-advised. I ended up calling some combination of apparmor unload
and monkeying with some files until strace worked. I could now strace a process in the container. Step one accomplished.
Using strace can be a little tricky when working with a daemon using worker processes. Usually I end up writing something like this: (paraphrasing)
# Generate something like strace -fp 10 -fp 2
strace -s1024 $(pgrep salt | awk '{print "-fp",$1}' | xargs)
The first result shows some pretty odd behaviour:
close(one bajillion) = EBADF Bad file descriptor
close(one bajillion and one) = EBADF Bad file descriptor
close(one bajillion and two) = EBADF Bad file descriptor
One wonders why python feels the need to close thousands of non-existent file descriptors. If we answer that question, maybe we’ll understand why salt runs so slowly in docker. Used some old-school printf debugging and some liberal editing of system python library files (that’s how I roll). I discovered that the closes were all triggered by some salt code that was attempting to find the network interfaces of available.
It was possible that the keyword argument close_fds might have something to do with the actual closing of invalid fds. However
it’s odd that calling subprocess.Popen (a python stdlib call) would perform this poorly. Python’s often used as a glue language
and it’s no stranger to spawning processes. Calling subprocess.Popen(close_fds=True) didn’t seem slow anywhere other than
docker. I now had my test case!
python -c "import subprocess;subprocess.Popen(['/bin/ls'], close_fds=True).wait()"
I’d accomplished something important. I’d created a small, self-contained program to trigger the bug. I’d also shown that the bug didn’t really have anything to do with salt. The bug went from being a salt/ubuntu/docker bug to being a python stdlib/docker bug. With that in mind, let’s find out more about how the close_fds flag works. close_fds is a security feature that allows child proceses to not have access to the filehandles of their parent process. close_fds is implemented differently in different python versions. In many versions of python close_fds is implemented by determining the max possible fd and then calling close() on every fd from 1 to MAX. Computers are pretty fast, so max would have to be really high to cause subprocess.Popen to be that broken and even then it shouldn’t take that long. The C code should look something like:
int i;
int max = getmax();
for (i=0;i<max;i++) {
close(i);
}
However since this is python and the closing code is in pure python it actually looks something like this:
for i in range(0,getmax()):
try:
os.close(i)
except:
pass #ignore whatever exception was just constructed and thrown
So instead of making getmax() syscalls it makes getmax() syscalls and also constructs, throws, and catches getmax() exception objects, complete with stack traces. This is what is so cpu intensive. So the final piece of the puzzle is to determine if getmax() (the number of allowed open files) is exceptionally high in salt. In linux the default number of open files for a user process is usually 1024. Applications such as Cassandra or http load generating apps will ask for more, but it’s relatively rare. It turns out that the docker container runs with a default of 500k!
So now we know exactly why salt performs so slowly in docker on ubuntu:
- Ubuntu’s docker package starts docker with a user file handle limit of 500k
- Containers spawned by docker inherit that high limit
- Python daemons read that limit when spawning a process with close_fds=True
- Slowness ensues
Who’s fault is this? Is it python’s fault for having a stdlib call that can perform so pathalogically? Is it docker’s fault for running their daemon with such high filehandle limits? Is it salt’s fault for running slow?
The blame seems to rest on docker and python. Docker has plans to implement a ulimit api Issue 9876 Python itself has learned some new tricks. On some platforms the /proc filesystem can be used to find the max fd. Also in newer versions of python process spawning goes through c code (posix.forkexec iirc).
Bug #2 random data corruption in go
This bug was reported to me by a colleague who was stumped. He was encountering some sort of ‘spooky action at a distance’ where a call to some function was corrupting data that had nothing do with with that function. I thought it wouldn’t take long to track down the bug. Obviously we were seeing some unintentional sharing of data in a go []byte slice (which is mutable). The code that triggered the error looked something like this (error checking skipped for brevity):
type FooRow struct {
ID int
Payload json.RawMessage
}
original := new(FooRow)
result, _ := db.QueryRow("...", 1)
result.Scan(&original.ID, &original.Payload)
var newRow FooRow
json.NewDecoder(req.Body).Decode(&newRow)
db.Exec("some update statement", newRow.ID, []byte(newRow.Payload))
It seemed that without fail original.Payload was correct prior to the call to Exec, and after
Exec is was corrupt. In our case it looked like the original JSON had been replaced with similar JSON
missing the first few characters.
My first instinct is to create a simple test that reproduces the problem. In our case the code
was being called in the context of a http handler. I wrote a simple main function and supplied
my own http request and response using http.NewRequest() and httptest.NewRecorder() and recreated
the problem. Since I suspected this was the result of unintentional data sharing I traced where the corrupt
memory region was allocated, who it was shared with, etc. Since this is go, I built and ran my code
with the race detector. It’s quite good at finding unintentional data sharing among goroutines. My
code unfortunately had no races. The original data structure was allocated using new() so it started with a nil
Payload. The payload itself was allocated when calling Rows.Scan. That seemed like a dead end unless
there was a driver bug. In no code path was original.Payload touched.
However I knew for a fact that data was being shared. As part of my process to determine how the corruption occurred I printed out a pointer to the first element of the corrupt data. Goo slices have a data pointer, length, and capacity and I needed to verify that the data pointer was not being changed. Only the data pointed to was being mutated. I know that somewhere in the code there is a big byte slice who’s data region (defined by the data pointer and the datapointer + length) must overlap with origin.Payload.
So I started looking into the database/sql code and the mysql driver. I also
started doing some printf debugging to determine where exactly within Exec the data was overwritten. I
had to do something really hacky here which was to create a package level []byte variable in the mysql package
and assign my corrupt/soon-to-be-corrupt memory to it. After much debugging I determined that the corruption occurred
in packets.go
writeExecutePacket() {
...
paramValues = append(paramValues, v...)
...
}
This adds up. The Exec function eventually results in writeExecutePacket() being called. The code I’m looking at has something to do with writing bind parameters to the wire. We now have two dots to connect. How does origin.Payload get connected to the outgoing buffer in the mysql driver?
It turns out that the go mysql driver has a “zero copy-ish” buffer that it uses to send and receive data. When data is received,
original.Payload is basically a pointer into the receive buffer. When the driver package prepares to write data that buffer
is re-used. So now we understand the mechanics of how data is being shared and corrupted.
The question remains why? Scanning data from a database into a byte slice is a relatively common op in go. I’ve never had to worry about the data scanned into one of my structs getting overwritten before. What’s different now? Have I uncovered a bug in the mysql driver? Has all the database go code I’ve written up until now been wrong and I just didn’t know it? I must search the documentation for answers.
The database/sql package talks a little bit about shared []byte slices in
Rows.Scan. It turns out that when you scan into a []byte you get a copy
of the data. If you choose to pass scan a RawBytes pointer you don’t get a copy. It sure sounds like our data is being treated
like RawBytes and not []byte. This seems odd because I’ve never even heard of RawBytes before. I know nobody in our codebase
is using it. However I am not passing a []byte to Scan(), if you read the code above carefully you’ll see that Payload
is actually a json.RawMessage. Rows.Scan doesn’t document what happens when you pass a json.RawMessage in, and some more
println debugging showed that changing the type of Payload from json.RawMessage to []byte solved the problem!
It turns out that apparently json.RawMessage buffers are also shared/zero-copy. This behaviour
is not explicit but falls out of the fact that the database/sql package only copies []byte slices. Passing any other alias
of []byte is probably UB and ends up being treated as a RawBytes value. Problem solved!
I tried to kick off a discussion here. Imo it’s odd that if I write:
type MyType []byte
Then MyType is automatically opted in to the database/sql zero-copy features.
So let’s recap how this bug occurred:
- Someone fetched json in the database and used a json.RawMessage field
- json.RawMessage was only valid for that current database op
- A 2nd database op occurred
- json.RawMessage’s contents had been replaced with a random piece of the driver’s send buffer
- Corruption ensues
Reflections
Sometimes it seems like I must use software in strange ways. I seem to uncover bugs that other people must have run into before. I wonder why I find so many obscure bugs now. I think it’s that as a junior engineer I ran into similar bugs but lacked the skill to find the root cause. I usually ended up just finding a workaround for the general category of failure. As my skills have increased I can now identify things that aren’t working and chase the bugs all the way from application weirdness through the stdlib, strace, apparmor and incorrect or misleading documentation. This isn’t always an efficient use of time. And I can see why some might call it yak shaving. But to me it’s important that the foundational libraries I build my software on work and work in either expected or documented ways. It’s also important for me to be able to deliver software that works end-to-end rather than just pointing to some area I lack knowledge in and saying “the bug must be there”. We’ve all met those people who are constantly pointing to their broken software and saying things like: “caching issue”, “compiler bug”, “kernel bug” without any data to back that up. Don’t be that engineer.