On Nov 15 2022 @cszatmary DM'd me about a flaky test he was having trouble with.
I don't want to get too in the weeds here about the specific problem or the challenges of our codebase, but if you would like a tl;dr of the actual issue we encountered, click show gory details
below:
Click here to show gory details
- A brand new Supertest test would sometimes time out after 60 seconds, but only on CircleCI
- Turning gzip compression off would make the tests pass reliably, this turned out to be a red herring
- Turns out that deep inside one of our internal error handling middlewares, we were calling `next(err)` after sending a response, this is incorrect and leads to racy behavior
- This race meant our openapi response validator only 500'd invalid responses (the desired behavior when running tests) very rarely, when it won a race (which only happened sometimes, on slow CI, and with gzip on)
🤯 More interesting than all of this is some of the tricks I picked up trying to troubleshoot this.
The most valuable tool in your toolbox is the ability to diff logs. If you can get logs from a successful run, and a dirty run, you can isolate what happened... in theory.
If you're lucky, this means just running diff -u bad.txt good.txt
. Often, there's a lot that's accidentally different (timestamps, UUIDs) and the .diff is too noisy to read. Sometimes, it's worth cleaning up the differences by hand in a text editor (column selection is nice for this!), but also it's not terrible to just visually scan the two files side by side in a text editor.
Many (but not all) Node/npm modules use the debug
package. You can opt into receiving debug info from specific packages like: DEBUG="knex:*"
, or multiple: DEBUG=nock:*,knex:*"
It was very helpful to search for ROLLBACK
here, with one transaction per test we can clearly distinguish between test runs without doing any logging in tests, and see what's going on in postgres.
The problem with DEBUG
is, if you don't know where an issue is happening, it's tough to guess which modules you want to debug. You can opt into all modules that support debug
with DEBUG=*
, but this gets very noisy.
We had to disable jest's --coverage
because it would log 200MB of source mappings to stdout. If you can get your logs down to a reasonably sized output (<1MB), some of the techniques below will help deal with the noise.
tee
is a program that writes input from stdout
to both stdout
and a file. e.g.
# This produces no output on the screen
yarn test > output.txt
# This outputs to the screen _and_ to a file for later use
yarn test | tee output.txt
Being able to get files off of the CircleCI box was necessary.
# note: redacting CircleCI IP address with 127.0.0.1
scp -P 64535 [email protected]:/home/circleci/project/output.txt .
Depending on the CircleCI instance I was on, this sometimes failed with: subsystem request failed on channel 0
.
Looks like they have different sshd configurations on different servers. OK, a pain, but we can still just use SSH:
# note: redacting CircleCI IP address with 127.0.0.1
ssh -p 64535 127.0.0.1 cat /home/circleci/project/output.txt > output.txt
On some systems, when you pipe output to tee
, you lose ANSI terminal colors. (And despite Rob Pike's warm take, colors are in fact a very good way to communicate information hierarchy and are invaluable when debugging.)
You can fix this with unbuffer
(ref: https://superuser.com/a/751809/27002)
sudo apt-get install expect
# or: brew install expect
unbuffer yarn test | tee output.txt
The flaky test as described above manifested as a 60 second timeout. When looking at output.txt
after the fact, even with some +10ms
and +44ms
lines here and there, it's still not straightforward to see where the 1 minute delay is. It's easy to see it hang as it's running, but if you're not taking notes, you're out of luck.
With ts
, we can add timestamps to all stdout/stderr output, without changing any code.
sudo apt-get install moreutils
# or: brew install moreutils
# The 2>&1 makes sure stderr also gets timestampped too.
unbuffer yarn test 2>&1 | ts '[%F %.T]' | tee output.txt
Here we can clearly see in a log file a ~60s delay in between individual log lines. This would be very tough to reason about without the timestamp.
We can also pass -s
to ts
to make it show elapsed (wall-clock) execution time on each line, which might be easier to visually distinguish / do mental arithmetic on, see the clear 60s difference between 28
and 88
below:
unbuffer yarn test 2>&1 | ts -s "%.s" | tee output.txt
If you are lucky enough to be in the Github Code Search beta, you can do org-wide or even Github-wide searches for things. This is incredibly useful, but since it's not generally available, I'll show you the same idea with an open source tool, ripgrep
.
Now we have some logs, but we're not super sure where they're coming from.
28.169225 express:router errorCapture : /the/url/was/redacted/by/me +1ms
28.169784 finalhandler default 401 +0ms
28.170991 compression gzip compression +1s
28.174286 finalhandler default 500 +4ms
28.174323 finalhandler cannot 500 after headers sent +0ms
88.145870 knex:query ROLLBACK trx5 +60s
88.146385 knex:bindings undefined trx5 +60s
88.146700 knex:tx trx5: releasing connection +1m
88.146743 knex:client releasing connection to pool: __knexUid2 +1m
Let's search for cannot 500 after headers sent
.
No results, drats. Don't give up here though, search for smaller substrings:
sudo apt-get install ripgrep
# or:
# brew install ripgrep
$ cd node_modules
$ rg "after headers sent"
finalhandler/index.js
93: debug('cannot 404 after headers sent')
127: debug('cannot %d after headers sent', status)
Aha, there it is: cannot %d after headers sent
. We know where in the code things are happening! But how did we arrive here?
Remember, we're desperate here: we can only reproduce this issue inside an SSH session which expires every 2 hours (often less, as I found out... 😡). This is a throwaway environment, but we can reproduce things here. Don't try this in production.
We can see here the exact line that's giving us a clue: https://github.com/pillarjs/finalhandler/blob/5ceb3e3e2482404cb71e9810bd10a422fe748f20/index.js#L127
But we have no idea who's calling that or why. It would be nice if we could edit the code and print a stack trace here.
ls node_modules/finalhandler/
total 36
-rw-r--r-- 1 circleci circleci 4292 Nov 16 20:09 HISTORY.md
-rw-r--r-- 1 circleci circleci 6689 Nov 16 20:09 index.js
-rw-r--r-- 1 circleci circleci 1119 Nov 16 20:09 LICENSE
drwxr-xr-x 4 circleci circleci 29 Nov 16 20:03 node_modules
-rw-r--r-- 1 circleci circleci 1207 Nov 16 20:09 package.json
-rw-r--r-- 1 circleci circleci 4133 Nov 16 20:09 README.md
-rw-r--r-- 1 circleci circleci 1202 Nov 16 20:09 SECURITY.md
Here, we were lucky enough to find the index.js
file immediately. What is published to npm and what is on Github aren't always the same. Source might be .ts while npm will have .js; similar distinction for src/
vs dist/
. You might have to dig a bit to find the exact file.
Keep in mind if you run yarn
commands like yarn add
, yarn del
, or yarn install
, your node_modules changes will be clobbered. Keep it small, targetted, and keep notes of what you've changed (being forcibly logged out every 25 minutes really drives this point home).
Now we edit the file: vim node_modules/finalhandler/index.js
, and add a console.trace
.
console.trace will print a log message and a stack trace at your current position.
I like adding strings like XXXX
, YYYY
, ZZZZ
, to throwaway log output, because I can have iTerm2 highlight them by searching with CMD+F.
Now when we run it, we get a stack trace! Notably, the stack trace is printed before the 60 second delay.
Unfortunately, this trace is almost useless: it's showing Node internals and nothing about my app code or my library code. (I'm not an operating system developer, everything I know about trampolines I learned from The Simpsons.)
Before Node 14, you were kind of screwed if you wanted long stack traces. Now, it's fairly trivial with trace (enables long stack traces) and clarify (suppresses node internals, no more trampolines).
yarn add --dev trace clarify
# We have to pass flags to node directly, so we're not going to use `yarn test` anymore.
unbuffer node -r trace -r clarify --stack_trace_limit=10000 jest file_to_test.ts 2>&1 | ts -s "%.s" | tee output.txt
This is significantly better!
So far, we've done the following:
- Found a log message that we've verified only occurs only when tests break (this is important!)
- Found out where in node_modules the log was being emitted from
- Found out how the suspected log is being called, and where from
To solve the issue, we just repeated the same tactics, down the call stack.
This line gave us a clue:
31.311666 at Object.ResponseValidator.modded_express_mung_1.default.onError (/home/circleci/project/node_modules/express-openapi-validator/src/middlewares/openapi.response.validator.ts:48:14)
This library wasn't logging anything. We would NEVER have known to even look here.
We added some more logs here, and to some of the other places further down the callstack, and we were able to show exactly what was going on.
I won't go further into the solution of our problem (see the show gory details
at the top of the gist), but we just rinse and repeat with these tactics, and eventually figured out what was wrong.