June 22, 2025

How Flakiness Scoring Works Under the Hood

The Silent Killer of CI Reliability

A test that passes 90% of the time sounds reliable. It is not.

If that test runs on every push to main, it blocks your deploy pipeline once every ten runs. Now imagine you have thirty tests with a similar pass rate. The probability that all of them pass on a single run drops fast. Your main branch is never green, not because of real bugs, but because of noise.

This is the flaky test problem. A flaky test is a test that produces different outcomes across runs without any code changes. It passed yesterday. It fails today. It passes again if you re-run CI. Nothing changed in the codebase.

In Playwright specifically, flakiness tends to come from a predictable set of sources:

Timing and race conditions — The test expects an element to be visible, but an API response has not arrived yet. Sometimes it wins the race, sometimes it loses.
Shared state — One test writes to localStorage or a database, and another test reads from it. The outcome depends on execution order.
Network dependencies — A test hits a real API endpoint that occasionally times out or returns different data.
Animation waits — An element is technically in the DOM but still animating. Click handlers fire inconsistently.
Viewport-dependent selectors — A responsive layout shifts elements depending on viewport size. A selector that works on a 1920x1080 screen fails on the CI runner's default viewport.

Each of these is fixable. The problem is knowing which tests are affected, how badly, and whether they are getting worse.

How Teams Handle Flakiness Today

Most teams deal with flaky tests using a combination of manual processes that do not scale.

Spreadsheets. Someone creates a shared document listing known-flaky tests. It is outdated within a week. New flaky tests appear, old ones get fixed but stay on the list, and nobody trusts it.

Re-running CI. The test failed? Hit the re-run button. It passed this time? Ship it. This works until you notice your team is burning through CI credits on retries, and the underlying instability is never addressed.

Ignoring failures. "Oh, that test is flaky — just ignore it." This is fine until someone ignores a real regression because they assumed it was flakiness. Now you have a bug in production.

Disabling tests. test.skip() everywhere. The flaky test is gone, and so is the coverage it provided. Nobody remembers to re-enable it.

None of these approaches give you signal. They are all reactive — you find out a test is flaky when it blocks you, and you deal with it in the moment. There is no system tracking whether your suite is getting more or less stable over time.

Euriqa's Approach: A Score from 0 to 1

Euriqa assigns every test in your suite a flakiness score between 0 and 1. This score is not a binary "flaky or not" label. It is a continuous measure of how inconsistent a test's outcomes are across recent runs.

Here is how to read it:

0 — Perfectly stable. The test either passes consistently or fails consistently. Consistent failure is not flakiness — it is a broken test, which is a different problem.
0.1 to 0.3 — Occasionally flaky. Worth monitoring. These tests mostly behave but have shown some inconsistency.
0.3 to 0.7 — Moderately flaky. These tests should be investigated. They are producing enough noise to erode trust in your CI results.
0.7 to 1.0 — Highly flaky. Needs immediate attention. These tests are essentially coin flips and actively harm your pipeline reliability.

A few things make this scoring practical rather than theoretical.

First, the Euriqa reporter tracks retry attempts separately. Playwright's built-in retry mechanism means a test might fail on the first attempt, then pass on the second. In most reporting tools, that test shows as "passed" and nobody knows it needed a retry. Euriqa marks it as "flaky" — the test needed help to pass, and that is a signal.

Second, scores update with every new run. If you fix a flaky test, its score drops within days as new stable runs accumulate. If a previously stable test starts flaking, its score rises. You are always looking at current reality, not stale history.

How the Score Is Calculated

The flakiness score considers multiple factors from a test's recent execution history:

Number of recent runs — More data produces a more reliable score. A test that ran twice is not assessed the same way as one that ran fifty times.
Ratio of inconsistent outcomes — The core signal. How often does the test switch between passing and failing across consecutive runs? A test that fails consistently has a low flakiness score. A test that alternates between pass and fail has a high one.
Recency weighting — Recent runs count more than older ones. A test that was flaky two months ago but has been stable for the last three weeks should have a low score. The weighting ensures the score reflects the test's current behavior, not its entire lifetime.
Retry behavior — Tests that pass only after retries are inherently flaky, even if the final outcome is "pass." The number of retries required and how frequently retries are needed both factor into the score.

A Concrete Example

Consider a test called should load dashboard. Over the last week, it ran 20 times:

11 runs: Passed on the first attempt. No issues.
6 runs: Failed on the first attempt, then passed on retry.
3 runs: Failed on all attempts.

This test has a high flakiness score. The 6 retry-dependent runs are strong flakiness signals — the test's outcome is non-deterministic. The 3 full failures could be real bugs or severe flakiness. The inconsistency pattern (pass, fail-then-retry, pass, full-fail, pass) across runs is exactly what the score is designed to capture.

If this test were simply broken (failing on all 20 runs), the score would be low — a consistently failing test is not flaky, it is broken. The score specifically measures inconsistency.

Trend Tracking

A single score tells you the current state. The trend tells you the story.

Euriqa tracks flakiness scores over time so you can see how each test's stability evolves. This surfaces patterns that point scores alone cannot reveal.

A test that held a score of 0.05 for three months and then jumps to 0.4 in a single week — that is a signal. Something changed. Maybe a dependency was updated. Maybe a new test introduced shared state that creates a race condition. Maybe an infrastructure change altered network timing in CI.

The trend chart makes these correlations visible. You can overlay score changes against your commit history and CI infrastructure events to pinpoint when instability was introduced.

Trends also show you whether your efforts are working. If your team spends a sprint fixing flaky tests, you should see a measurable drop in scores across the suite. If scores are climbing despite fixes, something systemic is wrong — maybe your CI environment is degrading, or new tests are being written without proper isolation.

Auto-Quarantine

Knowing a test is flaky is step one. Managing it without manual overhead is step two.

Euriqa supports auto-quarantine with a configurable threshold. The default is 0.3 — any test whose flakiness score exceeds this value is automatically quarantined.

Quarantine does not mean the test stops running. It still executes on every run, and Euriqa still tracks its results. Quarantine means the test is flagged in the dashboard as a managed flaky test. Your team can see at a glance which tests are quarantined, why (the score and trend are right there), and whether they are improving or getting worse.

This replaces the spreadsheet. Instead of a manually maintained list that drifts out of sync with reality, you have an automated system that quarantines and un-quarantines tests based on their actual behavior. When a quarantined test's score drops below the threshold after a fix, it exits quarantine automatically.

The threshold is configurable per project. Some teams set it at 0.2 for stricter management. Others set it higher if they are early in their stability journey and want to focus on the worst offenders first. Manual override is available for tests you want to quarantine or un-quarantine regardless of their score.

A Practical Example

Here is a real-world scenario that plays out regularly.

A test called should display user profile has a flakiness score of 0.45. It failed 9 out of its last 20 runs. Euriqa flags it in the dashboard and shows the trend chart: the score was 0.1 two weeks ago and has been climbing steadily.

The team investigates. They open the failing runs and notice a pattern — the test clicks a profile link and immediately asserts that user data is visible. Sometimes the API response arrives before the assertion, sometimes it does not.

The problematic code looks like this:

test('should display user profile', async ({ page }) => {
  await page.goto('/dashboard')
  await page.click('[data-testid="profile-link"]')

  // This assertion sometimes fails — the API call hasn't resolved yet
  const userName = page.locator('[data-testid="user-name"]')
  await expect(userName).toHaveText('Jane Doe')
})

The click triggers a navigation and an API call. The assertion runs immediately, but the component has not received its data yet. The fix is to wait for the element to be in the expected state using Playwright's web-first assertions, which auto-retry until the condition is met or the timeout expires:

test('should display user profile', async ({ page }) => {
  await page.goto('/dashboard')
  await page.click('[data-testid="profile-link"]')

  // Web-first assertion — retries automatically until the text appears
  const userName = page.locator('[data-testid="user-name"]')
  await expect(userName).toBeVisible()
  await expect(userName).toHaveText('Jane Doe')
})

The fix is deployed. Over the next week, the test passes on first attempt every time. The flakiness score drops from 0.45 to 0.05. If auto-quarantine was active, the test automatically exits quarantine as its score falls below the threshold.

No spreadsheet to update. No Slack thread to close. The system tracked the problem, the team fixed it, and the system confirmed the fix worked.

Tips for Reducing Flakiness in Playwright

While Euriqa helps you find and track flaky tests, the best long-term strategy is writing tests that are less prone to flakiness in the first place. Here are practical patterns that work.

Use Web-First Assertions

Playwright's web-first assertions auto-retry until a condition is met. Prefer them over manual checks.

// Good — retries automatically
await expect(page.locator('.modal')).toBeVisible()

// Avoid — checks once and fails if the element isn't ready
const isVisible = await page.locator('.modal').isVisible()
expect(isVisible).toBe(true)

Avoid Hard Waits

page.waitForTimeout() is almost always the wrong solution. It either waits too long (slowing your suite) or not long enough (causing flakiness). Use event-based waits instead.

// Good — waits for a specific condition
await page.waitForResponse((resp) => resp.url().includes('/api/users') && resp.status() === 200)

// Avoid — arbitrary delay
await page.waitForTimeout(3000)

Handle Animations

Animations cause flakiness when elements are still moving during interactions. Disable them in your Playwright config for test stability.

// playwright.config.ts
export default defineConfig({
  use: {
    // Disables CSS animations, transitions, and web animations
    animations: 'disabled',
  },
})

Use Stable Selectors

Selectors that depend on CSS class names or DOM structure break when the UI changes. Use data-testid attributes for test-critical elements.

// Good — stable, explicit
await page.click('[data-testid="submit-button"]')

// Fragile — breaks when CSS or DOM changes
await page.click('.btn.btn-primary.submit-form > span')

Configure Retries

Set retries in your Playwright config so that flaky tests are detected early. When a test needs a retry to pass, Playwright marks it accordingly — and the Euriqa reporter captures that signal.

// playwright.config.ts
export default defineConfig({
  retries: 2,
})

Isolate Test State

Avoid sharing state between tests. Each test should set up its own data and clean up after itself. Use test.describe.configure({ mode: 'serial' }) only when tests genuinely depend on each other — and prefer to refactor them so they do not.

Track It, Fix It, Prove It

Flakiness is not a mystery. It is a measurable property of your test suite, and it responds to focused effort. The key is having the data: which tests are flaky, how flaky, whether they are getting worse, and whether your fixes are working.

Track flakiness across your entire Playwright suite — sign up free at app.euriqa.dev.

← Back to blog