A Heat Map Reporter for Minitest

I built a custom Minitest reporter to help more efficiently address test failures, and it’s called Minitest Heat. I originally hacked together a custom reporter a couple of years ago built on top of the Minitest Reporters gem, but since it was an exploratory concept hacked together in an afternoon, it wasn’t up to fulfilling the vision in my head.

However, it was still really helpful, and using it the last couple of years I genuinely missed having it on projects that didn’t use it. So it felt time to make a proper reporter. Fortunately, Minitest has a solid API for writing extensions. So I started dabbling, and after dog-fooding and adjusting for a little while now, it’s usable enough for others to try out.

It’s still a work-in-progress, but it’s good enough to use as a daily driver if you use Minitest and are alright not being able to adjust settings just yet. Given the diversity of how folks handle automated testing, it would be great to get wider feedback outside of my own usage.

The core idea revolves around generating a heat map that shows the files and line numbers that were most problematic for a given run of your test suite. It really goes beyond the heat map though. The entire reporter strives to present information based on context so that you can focus on only the most important issues and more efficiently identify and correct problems with less scrolling through test failures.

There’s a few elements that work together to get there, and so far it’s primarily based on how I’ve found myself working with test suites. The progress reporter isn’t much different other than adding some additional symbols representing some of the nuance in issue types. 1

Screenshot of the progress reporter for Minitest heat showing green dots and diamonds, E's, B's, F's, and S's
1 The reporter uses dots for successful tests, diamonds for slow tests, 'S' for skipped tests, 'E' when the source code raises an exception, 'B' (for 'Broken') when an exception occurs directly in the test, and 'F' when an assertion fails in a test.

Recognizing Nuance

Before we get too deep, it’s worth addressing that the reports show more than just Failures, Successes, and Skips. Minitest Heat organizes test issues into six different categories based on the following priority order:

  1. Errors represent instances when the source code raises an exception. Exceptions are different from an assertion failure. (Unless the assertion is about exceptions.) So it’s a little different and gets a bolder shade of red.
  2. Broken Tests occur when an exception is raised directly from the test. (That is, the final line of the stack trace is a test file.) In that case, it’s important to fix it, but the fact the exception occurred in the test immediately helps narrow down where to look for the causes of the exception.
  3. Failures are simply the failed assertions that you’re likely already all too familiar with.
  4. Skips are the same old skipped tests you’re familiar with as well.
  5. Slows and Painfully Slows are slow tests based on (eventually) configurable values according to the speeds your project can tolerate. For now, the slow threshold is 1 second and the painfully slow threshold is 3 seconds.

We’ll look closer at how Minitest Heat presents information about each of these categories later, but first, let’s take a look at the summary and heat map it shows at the end of the test suite run.

Heat Map & Summary

Instead of manually scrolling and scanning through a randomized litany of failures to hopefully recognize patterns in failures, what if the reporter could show you the most problematic places? While the test details are critical, it’s nice to have a more thorough overview about which files generated issues.

So while the test suite runs, it looks at the failures lines reported by Minitest as well as the lines in the stacktrace if the test generated an exception. As it progresses, it keeps a list of the files and line numbers with issues as well as the type of issue.

In the case of a test suite with failures, it presents a traditional summary, but it visually downplays the presence of any slow or skipped tests because it presumes that if the test suite isn’t passing, the last thing you’re interested in is addressing skipped or slow tests. 2

A screenshot of the summary of the test run showing counts for each category of issue, timing, and then a list of files and line numbers where the most problematic issues occurred.
2 In this context, you'll notice that while there are slow and skipped tests in the test suite, they're visually muted a bit because the failing tests are the more important element to focus on. At the very bottom, you can see the heat map sorted by files with the most "hits" and the sorted line numbers where those hits occurred. Furthermore, the line numbers are colored to match their corresponding category.

Once you’ve fixed the tests with exceptions and failures, Minitest Heat will next focus on the skipped tests and visually mute the information about slow tests. 3 I’ve found it really handy for getting back on track when I’ve skipped tests. I can more quickly remember where I put them and which ones I should focus on first. I also end up leveraging the ability to mark tests as skipped because they’re less intrusive.

A screenshot of the summary of the test run showing counts only for skips and slows since there are no failures. The performance information and heat map are displayed as well.
3 Once there aren't any failures or exceptions, the summary slims down to focus on emphasizing any skipped tests while visualy downplaying information about slow tests.

And, as you might have guessed, once the skipped tests have been address, the summary shifts its focus to the slow tests and differentiates between tests that are really slow and those that are kind of slow. 4 Slow tests are still passing tests, so they stay green, but the painfully slow tests are a little bolder.

A screenshot of the summary of the test run showing counts only for slows since there are no failures or skips. The performance information and heat map are displayed as well.
4 With slow tests, the summary only emphasizes the number of slow tests when there are no failures or skipped tests. Even without test failures, the heat map comes in really handy by making it crystal clear which tests are slowing you down.

And of course the end goal is a passing test suite without any kinds of issues, and it’s intentionally minimal by design since there aren’t any issues to surface or a need for a heat map. 5

A screenshot of the test suite summary with everything working perfectly. It only shows the total amount of time for the test suite and the tests and assertions rates.
5 When everything goes well, there's really not much to display. It shows the total time it took to run the suite, the number of tests, and the average performance of the tests and assertions.

Showing Issue Details

While the summary can help you narrow down where to start with a test suite that has more than a couple of failures, it wouldn’t be much help without issue details that help you determine what went wrong.

Before we dive into examples, there’s a few things Minitest Heat does with the issue details that are worth talking about.

Priorities

First, it orders issues so that the most-likely-to-be-most-important failures will be at the bottom of the output in order to reduce the need for scrolling. So if any exceptions were raised, those will always be displayed right above the test suite summary. That way, you can always be fairly confident that the final issue detail you see at the bottom of a test run is a good one to start with. Of course, based on the heat map results, you may want to start somewhere else. Either way, you’ll hopefully scroll a lot less choosing where to start with multiple failures.

Stack Traces

Any issues around exceptions do their best to show fairly rich stack traces. These are currently the most volatile area where I’ve been toying with different approaches and continue to adjust how they’re displayed. Currently, they filter the stacktrace to files within the project (and thus under your direct control), but there’s a handful of places where that’s been too limiting.

The stack traces also show the corresponding line of code for each line of the stack trace. 6 That frequently helps loosely trace what’s happening and make a quick decision about which location is most likely to be the source of the problem. At the moment, it also highlights filenames and line numbers to more easily identify them in the wall of text that is a stack trace, but the jury’s still our on whether this helps or creates too much visual noise. I’m still experimenting on a few ideas here.

A screenshot of a stack trace from Minitest Heat with the description and then a selected set of lines with the relevant file names and line numbers highlighted and the source code from the location displayed next to it.
6 At the moment, Minitest Heat strives to condense the stack trace while making it easier to identify which file is most likely to be the key to understanding the exception. It also notes which file from the stack trace was modified most recently because that can occasionally be helpful for determinining which line from the stack trace is most relevant.

Minitest Heat also looks at the displayed lines of the stack trace for an issue and notes the “Most Recently Modified” file. While that doesn’t always mean that file is the source of the problem, I’ve found it adds helpful context that frequently comes in handy helping me focus on the right file and line.

Custom Treatments for Issues

It recognize that different types of issues need different types of information. While it strives to maintain a level of consistency in the structure across issue types, it takes a slightly different approach for each type of issue. So the issue type influences what information is displayed, and like with the summaries, it suppresses information about lower priority issues. For example, if you have test failures, you won’t see any details about skipped or slow tests.

We’ll start with “Errors” (or exceptions). When an exception is raised, we have a stack trace, and Minitest Heat tries to put that stack trace information to good use in a handful of ways. While the final line of a stack trace is usually the most helpful, the steps leading up to it often play a significant role in explaining the problem. 7

With all tests, the first line explains the type of issue and shows the test description. The next line identifies the test with the line number where it’s defined as well as the last line number in the test that started the chain of events leading to the exception. That’s followed by a quick summary of the final line of the stack trace. (It’s a bit redundant with the stacktrace below it, so that’s another area that’s still in flux.)

A screenshot of the details for a test where the source code raised an exception.
7 With exceptions raised from the source code, they're flagged as an "Error" (mainly because it's shorter than "Exception") with a slightly more bold red. It shows the related details about the test that prompted the exception and the source of the exception with a consolidated stack trace.

The next example is a “Broken Test” where an exception occurred from directly within a test. 8 In these contexts, it’s helpful to know right away that the problem isn’t in the source code but within the test. If your curious, it makes this determination based on whether the final line of the strack trace is in a test file.

In these cases, the information is mostly the same, but that “Most Recently Modified File” note often helps make it even more clear that the problem is truly something that’s broken within the test defintion.

A screenshot of the details for a test where the test code raised an exception.
8 When an exception is raised directly from test code, it's labeled as a "Broken Test" to make it clear that the test is the problem. While source code exceptions can stem from details in the test, it's nice to short-circuit the investigation process by knowing the exception came directly out of the test code.

Then we have your standard failures. While these are often fairly consistent, the failure summary can vary significantly depending on the type of assertion. 9 In the case of failures, it still identifies the source of the failure, but then it follows up with the summary.

One nice thing it shares with other issue types is that it exposes the line of source code that triggered the failure. I’ve found that when I have several failures, that often helps me know which one to address first. I can either start with the quick wins with obvious solutions in order to reduce noise in the test suite, or I can skip straight to the less clear failures that may require more time but are more central to everything else running smoothly.

A screenshot of the details for a test where the assertion failed under normal circumstances. It shows failing examples of `assert_raises`, `assert` with a custom message, and `assert_equal`.
9 Test failures get a less-loud red 'Failure' label and replace the stack trace with the details of the failed assertion.

Then there’s skipped tests. 10 These follow the basic structure of the previous issue types, but there’s not much other context to display other than the reason for the skip which is implicitly displayed by showing the offending line of source code from the test.

A screenshot of a skipped test result with source code for the skip shown at the bottom.
10 Skips are pretty simple and labeled with a yellow 'Skipped' and include the source code where the skip was defined.

And last but not least are slow tests. Every project and team has different levels of tolerance for slow tests. However, I’ve found that I personally have two separate thresholds for every project. There’s “yeah, it’s slow, but doesn’t need to be addressed immediately” and then there’s “go ahead and address it because if it’s that slow, something must be really off.” So while the thresholds aren’t yet configurable, there are two separate tiers for slow tests.

11

A screenshot showing the details of a slow test and a 'painfully' slow test with the time each test took displayed out to the side.
11 In the case of slow tests, all that really matters is how slow it was and where it's defined. So details of slow tests are intentionally simple with the only difference being that the painfully slow tests are labeled with a slightly more bold green.

What’s next?

While it’s generally reliable and usable, it’s very much an alpha at this point. I’ve been using it and making adjustments for a while now, but minor issues still surface from time-to-time. In those cases, I disable it and fall back to the standard reporters without much trouble.

My primary next step is a little additional resiliency and adding configuration options. At the moment, it’s not configurable, but it’s designed such that it’s clear which values will most likely need to be configurable. Once I’ve gathered wider feedback, I hope to add configuration options to help folks tune it to their needs. For instance, it will invariably need an option to adjust the color scheme for folks with color-blindness, and the thresholds for slow tests will need to be adjustable for different projects as well.

I also expect to provide features like fast failures so projects with longer-running test suites can start working on a failing test while the rest of the test suite finishes running. Prioritizing these things will depend on getting some perspective from folks other than myself.

It’s worth noting that the current version completely overrides and replaces any other reporters you may have configured. That’s only temporary, though. I fully expect for it to play nice with any other reporters in the long-run. While it’s been under rapid-development, it’s been much more convenient to be confident that any issues are isolated to Minitest Heat and not a side effect of other custom reporters.

Once everything is sorted out with the existing core functionality, I’d like explore factoring in Test Coverage with SimpleCov. It would likely be treated more as a source of additional context than a discrete number or percentage. For instance, with test failures, Minitest Heat might also add some minimal data if an offending file has poor test coverage. Or, if everything else is running smoothly, it might show the top three files that could stand to have some added test coverage.

If you’re a Minitest user and interested in trying something new, it’s ready enough. It would be great to see how it does in other projects and get some ideas for how to further improve it. It’s on RubyGems.org and ready to tinker with now.

Or, if you’ve got some ideas for a totally different direction, it could help you get going with a custom Minitest reporter of your own. Although, I will add that if you’re interested in going that route, give it another couple of months for me to finish tidying up, improving the tests, and documenting the source code. You’ll likely have a much easier time with it. I plan on writing up the various modules that could be useful to help someone else build their own more advanced custom reporter.