Flaky tests

What’s a flaky test?

It’s a test that sometimes fails, but if you retry it enough times, it passes, eventually.

What are the potential cause for a test to be flaky?

State leak

Label: flaky-test::state leak

Description: Data state has leaked from a previous test. The actual cause is probably not the flaky test here.

Difficulty to reproduce: Moderate. Usually, running the same spec files until the one that’s failing reproduces the problem.

Resolution: Fix the previous tests and/or places where the test data or environment is modified, so that it’s reset to a pristine test after each test.

Examples:

  • Example 1: State leakage can result from data records created with let_it_be shared between test examples, while some test modifies the model either deliberately or unwillingly causing out-of-sync data in test examples. This can result in PG::QueryCanceled: ERROR in the subsequent test examples or retries. For more information about state leakages and resolution options, see GitLab testing best practices.
  • Example 2: A migration test might roll-back the database, perform its testing, and then roll-up the database in an inconsistent state, so that following tests might not know about certain columns.
  • Example 3: A test modifies data that is used by a following test.
  • Example 4: A test for a database query passes in a fresh database, but in a CI/CD pipeline where the database is used to process previous test sequences, the test fails. This likely means that the query itself needs to be updated to work in a non-clean database.
  • Example 5: Unrelated database connections in asynchronous requests checked back in, causing the tests to accidentally use these unrelated database connections. The failure was resolved in this merge request.
  • Example 6: The maximum time to live for a database connection causes these connections to be disconnected, which in turn causes tests that rely on the transactions on these connections to in turn causes tests that rely on the transactions on these connections to fail. The issue was fixed in this merge request.
  • Example 7: A TCP socket used in a test was not closed before the next test, which also used the same port with another TCP socket.

Dataset-specific

Label: flaky-test::dataset-specific

Description: The test assumes the dataset is in a particular (usually limited) state or order, which might not be true depending on when the test run during the test suite.

Difficulty to reproduce: Moderate, as the amount of data needed to reproduce the issue might be difficult to achieve locally. Ordering issues are easier to reproduce by repeatedly running the tests several times.

Resolution:

  • Fix the test to not assume that the dataset is in a particular state, don’t hardcode IDs.
  • Loosen the assertion if the test shouldn’t care about ordering but only on the elements.
  • Fix the test by specifying a deterministic ordering.
  • Fix the app code by specifying a deterministic ordering.

Examples:

  • Example 1: The database is recreated when any table has more than 500 columns. It could pass in the merge request, but fail later in master if the order of tests changes.
  • Example 2: A test asserts that trying to find a record with an nonexistent ID returns an error message. The test uses an hardcoded ID that’s supposed to not exist (for example, 42). If the test is run early in the test suite, it might pass as not enough records were created before it, but as soon as it would run later in the suite, there could be a record that actually has the ID 42, hence the test would start to fail.
  • Example 3: Without specifying ORDER BY, database is not given deterministic ordering, or data race can happen in the tests.
  • Example 4.

Random input

Label: flaky-test::random input

Description: The test use random values, that sometimes match the expectations, and sometimes not.

Difficulty to reproduce: Easy, as the test can be modified locally to use the “random value” used at the time the test failed

Resolution: Once the problem is reproduced, it should be easy to debug and fix either the test or the app.

Examples:

  • Example 1: The test isn’t robust enough to handle a specific data, that only appears sporadically since the data input is random.

Unreliable DOM Selector

Label: flaky-test::unreliable dom selector

Description: The DOM selector used in the test is unreliable.

Difficulty to reproduce: Moderate to difficult. Depending on whether the DOM selector is duplicated, or appears after a delay etc. Adding a delay in API or controller could help reproducing the issue.

Resolution: It really depends on the problem here. It could be to wait for requests to finish, to scroll down the page etc.

Examples:

  • Example 1: A non-unique CSS selector matching more than one element, or a non-waiting selector method that does not allow rendering time before throwing an element not found error.
  • Example 2: A CSS selector only appears after a GraphQL requests has finished, and the UI has updated.
  • Example 3: A false-positive test, Capybara immediately returns true after page visit and page is not fully loaded, or if the element is not detectable by webdriver (such as being rendered outside the viewport or behind other elements).

Datetime-sensitive

Label: flaky-test::datetime-sensitive

Description: The test is assuming a specific date or time.

Difficulty to reproduce: Easy to moderate, depending on whether the test consistently fails after a certain date, or only fails at a given time or date.

Resolution: Freezing the time is usually a good solution.

Examples:

  • Example 1: A test that breaks after some time passed.
  • Example 2: A test that breaks in the last day of the month.

Unstable infrastructure

Label: flaky-test::unstable infrastructure

Description: The test fails from time to time due to infrastructure issues.

Difficulty to reproduce: Hard. It’s really hard to reproduce CI infrastructure issues. It might be possible by using containers locally.

Resolution: Starting a conversation with the Infrastructure department in a dedicated issue is usually a good idea.

Examples:

  • Example 1: The runner is under heavy load at this time.
  • Example 2: The runner is having networking issues, making a job failing early

How to reproduce a flaky test locally?

  1. Reproduce the failure locally
    • Find RSpec seed from the CI job log
    • OR Run while :; do bin/rspec <spec> || break; done in a loop to find a seed
  2. Reduce the examples by bisecting the spec failure with bin/rspec --seed <previously found> --bisect <spec>
  3. Look at the remaining examples and watch for state leakage
    • e.g. Updating records created with let_it_be is a common source of problems
  4. Once fixed, rerun the specs with seed
  5. Run scripts/rspec_check_order_dependence to ensure the spec can be run in random order
  6. Run while :; do bin/rspec <spec> || break; done in a loop again (and grab lunch) to verify it’s no longer flaky

Quarantined tests

When we have a flaky test in master:

  1. Create a ~"failure::flaky-test" issue with the relevant group label.
  2. Quarantine the test after the first failure. If the test cannot be fixed in a timely fashion, there is an impact on the productivity of all the developers, so it should be quarantined.

RSpec

Fast quarantine

Unless you really need to have a test disabled very fast (< 10min), consider using the ~pipeline::expedited label instead.

To quickly quarantine a test without having to open a merge request and wait for pipelines, you can follow the fast quarantining process.

Please always proceed to open a long-term quarantine merge request after fast-quarantining a test! This is to ensure the fast-quarantined test was correctly fixed by running tests from the CI/CD pipelines (which are not run in the context of the fast-quarantine project).

Long-term quarantine

Once a test is fast-quarantined, you can proceed with the long-term quarantining process. This can be done by opening a merge request.

First, ensure the test file has a feature_category metadata, to ensure correct attribution of the test file.

Then, you can use the quarantine: '<issue url>' metadata with the URL of the ~"failure::flaky-test" issue you created previously.

it 'succeeds', quarantine: 'https://gitlab.com/gitlab-org/gitlab/-/issues/12345' do
  expect(response).to have_gitlab_http_status(:ok)
end

This means it is skipped in CI. By default, the quarantined tests will run locally.

We can skip them in local development as well by running with --tag ~quarantine:

bin/rspec --tag ~quarantine

After the long-term quarantining MR has reached production, you should revert the fast-quarantine MR you created earlier.

Find quarantined tests by feature category

To find all quarantined tests for a feature category, use ripgrep:

rg -l --multiline -w "(?s)feature_category:\s+:global_search.+quarantine:"

Jest

For Jest specs, you can use the .skip method along with the eslint-disable-next-line comment to disable the jest/no-disabled-tests ESLint rule and include the issue URL. Here’s an example:

// quarantine: https://gitlab.com/gitlab-org/gitlab/-/issues/56789
// eslint-disable-next-line jest/no-disabled-tests
it.skip('should throw an error', () => {
  expect(response).toThrowError(expected_error)
});

This means it is skipped unless the test suit is run with --runInBand Jest command line option:

jest --runInBand

A list of files with quarantined specs in them can be found with the command:

yarn jest:quarantine

For both test frameworks, make sure to add the ~"quarantined test" label to the issue.

Once a test is in quarantine, there are 3 choices:

  • Fix the test (that is, get rid of its flakiness).
  • Move the test to a lower level of testing.
  • Remove the test entirely (for example, because there’s already a lower-level test, or it’s duplicating another same-level test, or it’s testing too much etc.).

Automatic retries and flaky tests detection

On our CI, we use RSpec::Retry to automatically retry a failing example a few times (see spec/spec_helper.rb for the precise retries count).

We also use a custom Gitlab::RspecFlaky::Listener. This listener runs in the update-tests-metadata job in maintenance scheduled pipelines on the master branch, and saves flaky examples to rspec/flaky/report-suite.json. The report file is then retrieved by the retrieve-tests-metadata job in all pipelines.

This was originally implemented in: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/13021.

If you want to enable retries locally, you can use the RETRIES environment variable. For instance RETRIES=1 bin/rspec ... would retry the failing examples once.

To generate the reports locally, use the FLAKY_RSPEC_GENERATE_REPORT environment variable. For example, FLAKY_RSPEC_GENERATE_REPORT=1 bin/rspec ....

Usage of the rspec/flaky/report-suite.json report

The rspec/flaky/report-suite.json report is imported into Snowflake once per day, for monitoring with the internal dashboard.

Problems we had in the past at GitLab

Order-dependent flaky tests

These flaky tests can fail depending on the order they run with other tests. For example:

To identify the tests that lead to such failure, we can use scripts/rspec_bisect_flaky, which would give us the minimal test combination to reproduce the failure:

  1. First obtain the list of specs that ran before the flaky test. You can search for the list under Knapsack node specs: in the CI job output log.
  2. Save the list of specs as a file, and run:

    cat knapsack_specs.txt | xargs scripts/rspec_bisect_flaky
    

If there is an order-dependency issue, the script above will print the minimal reproduction.

Time-sensitive flaky tests

Array order expectation

Feature tests

Capybara expectation times out

Hanging specs

If a spec hangs, it might be caused by a bug in Rails:

Suggestions

Split the test file

It could help to split the large RSpec files in multiple files in order to narrow down the context and identify the problematic tests.

Recreate job failure in CI by forcing the job to run the same set of test files

Reproducing a job failure in CI always helps with troubleshooting why and how a test fails. This require us running the same test files with the same spec order. Since we use Knapsack to distribute tests across parallelized jobs, and files can be distributed differently between two pipelines, we can hardcode this job distribution through the following steps:

  1. Find a job that you want to reproduce, identify the commit that it ran against, set your local gitlab-org/gitlab branch to the same commit to ensure we are running with the same copy of the project.
  2. In the job log, locate the list of spec files that were distributed by Knapsack - you can search for Running command: bundle exec rspec, the last argument of this command should contain a list of filenames. Copy this list.
  3. Go to tooling/lib/tooling/parallel_rspec_runner.rb where the test file distribution happens. Have a look at this merge request as an example, store the file list you copied from step 2 into a TEST_FILES constant and have RSpec run this list by updating the rspec_command method as done in the example MR.
  4. Skip the tests in spec/tooling/lib/tooling/parallel_rspec_runner_spec.rb so it doesn’t cause your pipeline to fail early.
  5. Since we want to force the pipeline to run against a specific version, we do not want to run a merged results pipeline. We can introduce a merge conflict into the MR to achieve this.
  6. To preserve spec ordering, update the spec/support/rspec_order.rb file by hard coding Kernel.srand with the value shown in the originally failing job, as done here. You can fine the srand value in the job log by searching Randomized with seed which is followed by this value.

Resources


Return to Testing documentation