mongo/buildscripts/monitor_build_status
Jonas Bergler fd4932b623 SERVER-114327: update code lockdown groups for storage engines (#44324)
GitOrigin-RevId: 6ecd069a6f73edb298a084a445106c3e1aab8d0f
2025-11-25 01:39:56 +00:00
..
BUILD.bazel SERVER-100631 Add ruff into "bazel run lint" (#32166) 2025-06-11 14:11:11 +00:00
OWNERS.yml SERVER-109155: Add documentation for block-on-red code (#40327) 2025-08-22 15:57:33 +00:00
README.md SERVER-109155: Add documentation for block-on-red code (#40327) 2025-08-22 15:57:33 +00:00
__init__.py SERVER-90907 Gather BF data from Jira and analyze for lockdown criteria (#22737) 2024-05-31 06:20:19 +00:00
cli.py SERVER-114327: update code lockdown groups for storage engines (#44324) 2025-11-25 01:39:56 +00:00
code_lockdown_config.py SERVER-114327: update code lockdown groups for storage engines (#44324) 2025-11-25 01:39:56 +00:00
issue_report.py SERVER-109434 Expand Code Lockdown Notification to Staging Branches (#40190) 2025-08-21 16:34:51 +00:00
jira_service.py SERVER-101882: include AFs in #10gen-mongo-code-lockdown (#33224) 2025-04-10 00:25:11 +00:00

README.md

Block-on-Red

TL;DR: During times of high BF volume, code approvals and merging in 10gen/mongo master will be restricted to only allow changes that help reduce BFs, Bugs, Performance Regressions, and paying down technical debt.

Motivation

The master branch should remain stable to develop the Server efficiently, and to be within 30 days of releasing at all times. If it becomes too unstable, or "too red," we want to aggressively focus on getting it back into the green. As a side benefit to releasability, a "greener" build should make patch build failures more meaningful. This will also reduce release time stress by having the release time period look and feel more like normal business.

Strategy

Each team carries a quota (see below for details). When a team exceeds their quota - they enter a "code lockdown".

  • Team Level: The intention here is to stop work with a small blast radius in the first instance, and address the releasability risk from that team and their owned code.
  • VP Level: We roll the quotas up to a VPs entire organization as the next step of "code lockdown". The expectation is that redirecting resources within a VPs organization to help address BFs is likely more effective and less disruptive than a global freeze.
  • Global Level: Finally, if the global quota is exceeded, the entire server organization enters a "code lockdown" until we meet the threshold for unfreezing.

Impact of a "Code Lockdown"

Allowed Code Changes

During a "code lockdown," Code Owners are expected to only approve work that closes BFs or helps us reduce/avoid the next Blocking state. i.e. aimed at fixing a BF, a class of BFs, bugs, performance regression, etc.

If your PR does not meet this criteria, it may be pending for some time until the system becomes unblocked. There are of course reasonable exceptions, below.

Feature Work

All feature work stops during a "code lockdown." In exceptional circumstances VPs can approve exceptions.

Non-feature Work

We understand that in many cases addressing the larger BF problem requires refactoring, modularity improvements, changes to our test and paying down other kinds of technical debt. During a "code lockdown" this work is expressly permitted and mergeable - with the guidance that teams index heavily on risk when deciding what to work on. If a piece of work feels like it makes the BF problem worse before it gets better, talk to your director about how to proceed.

Allowable Examples (not exclusive):

  • Refactoring components to make them more unit testable
  • Increasing code coverage through high quality tests that block PRs
  • Making the development loop faster (decreasing build times, fixing slow tests, etc)
  • Improving guardrails that improve code quality (fixing clang-tidy warnings, compiler warnings, etc)

If a team is in a lockdown, but the rest of the org is not - their focus should likely skew towards work that expedites their lockdown exit.

If the org is in a lockdown, but a team doesnt have BFs to work on - they should balance helping other teams with the work theyve identified as addressing the underlying BF problem.

The higher the risk of the work, the more involvement the Staff+ engineers and the Director/VP should have in the decision about what is ok to merge and what isnt.

Code Owner Responsibilities

Code Owners should join the #10gen-mongo-code-lockdown Slack channel to receive daily updates on the status of the build. It produces daily metrics with instructions if there is a state change.

If we change to a blocking state, code owners should use their discretion to only approve changes that are allowed (see above). If we exit the blocking state, code owners should approve PRs as usual.

Quotas and State-Changes

Currently monitored thresholds:

Quota Team
(older than 48h)
VP
(older than 7d)
Global
(older than 7d)
Hot 6 16 60
Cold 10 32 100

Source-of-truth implementation: etc/code_lockdown.yml.

Dashboards

A dashboard is available at go/issue-quota.

This shows relevant JIRA queries for a more live and interactive view of the state.

FAQ

BFs remaining open only on older branches

Some teams may fix a BF in master, but are "waiting for fix" on older branches, which keeps the BF counted against the thresholds. Guidance here is currently evolving.

If the build failure is not frequently occurring, it can be marked as P5-Trivial, and it wont count towards your teams build failures for the block merge.

As we iterate on our processes for this, the exclude-from-master-quota label can be used to exclude BFs that should not be included in these quotas. The expectation is that this is an interim solution as we improve our processes especially around BFs that remain open pending backports.

Specifically:

  • If a BF is only waiting for a backport on a branch older than master, apply the exclude-from-master-quota label to the ticket.
  • If a BF is failing on master, not a serious bug (or a test-only issue that can't affect the real clients), not noisy, and we are choosing not to fix it, set the Priority to P5 - Trivial and apply the keep-trivial label.
  • If a BF is failing on an older branch and we are choosing not to backport a fix, set the Priority to P5 - Trivial and apply the keep-trivial-X.Y label appropriately.

Contributing

For any new proposals, changes to thresholds, or concerns regarding their application, please escalate to your Director/VP. We want advocacy from all levels to make this a successful change to our engineering culture.

CLI

Run the following to read about supported options:

python buildscripts/monitor_build_status/cli.py --help

Testing locally

For Jira API authentication, use the JIRA_AUTH_PAT env variable. More about Jira Personal Access Tokens (PATs) can be found here.

Use your PAT to run the following and output its results:

JIRA_AUTH_PAT=<auth-token> python buildscripts/monitor_build_status/cli.py

The above will not spam the slack channel, unless you explicitly use --notify.