mongo/docs/testing/otel_resmoke.md

2.7 KiB

Open telemetry (OTel) in resmoke

OTel is what we use to capture metrics from resmoke.

What Do We Capture

Using OTel we capture the following things

  1. How long a resmoke suite takes to run (a collection of js tests)
  2. How long each test in a suite takes to run (a single js test)
  3. Duration of hooks before and after test/suite
  4. Resmoke archiver (when there is a failure we archive core dumps)

To see this visually navigate to the resmoke dataset and view a recent trace.

A look at source code

Configuration

The bulk of configuration is done in the _set_up_tracing(...) method in configure_resmoke.py#L164. This method includes documentation on how it works.

BatchedBaggageSpanProcessor

See documentation batched_baggage_span_processor.py#L8

FileSpanExporter

See documentation file_span_exporter.py#L16

Capturing Data

We mostly capture data by using a decorator on methods. Example taken from job.py#L200

TRACER = trace.get_tracer("resmoke")

@TRACER.start_as_current_span("func_name")
def func_name(...):
    span = trace.get_current_span()
    span.set_attribute("attr1", True)

This system is nice because the decorator captures exceptions and other failures and a user can never forget to close a span. On occasion we will also start a span using the with clause in python. However, the decorator method is preferred since the method below makes more of a readability impact on the code. This example is taken from job.py#L215

with TRACER.start_as_current_span("func_name", attributes={}):
    func_name(...)
    ...

Insights We Have Made (so far)

Using this dashboard and this query we can see the most expensive single js tests. We plan to make tickets for teams to fix these long running tests for cloud savings as well as developer time savings.