Working with dbt's artifacts: run_results.json

2021-07-26 · Tomás Farías Santana

dbt is quickly becoming the standard tool for managing data-pipeline transformations (the T in ETL), and having worked with it for a year I’m getting used to some of its quirks. In particular, as I used the tool more and more, the necessity of extracting metrics that could be used to optimize our transformation pipeline, as well as pin-point performance issues became apparent. Luckily for us, the artifacts produced by executing the most common dbt commands contain some very useful information for us to dig our data-analysis teeth into. However, I found that the dbt documentation on artifacts doesn’t give more than a passing mention of what may be done with the data contained in them. As such, this post will dedicate itself to offer examples of the utility that may be extracted from the run_results.json dbt artifact.

For reference, all the work presented was done with the v2 schema of run_results.json, as produced by version 0.20.0 of dbt. If in the future the schemas are updated, I can’t guarantee that some of the examples presented here still work, but hopefully the information available in newer versions is similar enough. The schema for run_results.json is documented by dbt here.

To parse the JSON files, I’ll be jq, a tool which is available for most OS distributions.

Reviewing your latest `run_results.json`

As you have probably already seen in the dbt logs, we can timing and status information from run_results.json when using any of the following dbt commands:

docs generate
run
seed
snapshot
test

Diagnose `run_results.json`

We may interested in doing a quick diagnosis of our latest dbt command to answer, for example:

How many models were successful?
How many models did not succeed?

We can do that with the following jq command:

cat target/run_results.json | jq -r '[.results[].status] | group_by(.) | map({"status": unique, "total": length})'

Which would leave us with an output like:

[
  {
    "status": [
      "error"
    ],
    "total": 8
  },
  {
    "status": [
      "skipped"
    ],
    "total": 10
  },
  {
    "status": [
      "success"
    ],
    "total": 30
  }
]

To take a look at the actual error messages to do some debugging, we may run:

cat target/run_results.json | jq -r '.results | map(select(.status == "error")) | map({"model": .unique_id, "error_message": .message})'

Finding the most expensive models

Timing information is probably the first thing we look at when trying to diagnose performance issues. We can use the following command-line script to find the most expensive models:

cat target/run_results.json | jq -r '[{model: .results[].unique_id, execution_time: .results[].execution_time?}] | sort_by(.execution_time) | .[-1]'

To get the most expensive model right out of the results. The output will look something like:

{
  "model": "model.dbt_project.fct_visits",
  "execution_time": 465.81539845466614
}

Important to note that execution_time is measured in seconds.

Loading results into Amazon Redshift

This information may be loaded into our data-warehouse for further processing or to connect with a visualization tool. Of course we could just insert the values after running our command-line script, but most data-warehouses offer some utility to work with JSON files directly. For example, when working with Amazon Redshift, we can upload our run_results.json to an Amazon S3 bucket and create a table for the results.

Our table schema may look like the following:

CREATE TABLE dbt_run_results (
    status VARCHAR,
    unique_id VARCHAR,
    compile_start_ts TIMESTAMP NULL,
    compile_end_ts TIMESTAMP NULL,
    execute_start_ts TIMESTAMP NULL,
    execute_end_ts TIMESTAMP NULL,
    exec_time FLOAT NULL,
    message VARCHAR
);

Unfortunately, Redshift cannot handle nested JSON structures like the one in run_results.json, so we have to do some preprocessing¹. Essentially, we need to flatten the results so that we have a JSON array of results from a dbt run. Said preprocessing may be done in Python, like so:

def run_results_to_json_array(json_str: str) -> str:
    """Turns a dbt run results JSON string into a JSON array."""
    import json

    json_dict = json.loads(json_str)
    json_arr_str = ""

    for result in json_dict["results"]:
        new_result = {
            "status": result["status"],
            "unique_id": result["unique_id"],
            "exec_time": result["execution_time"],
            # message key is not required, so it may be missing
            "message": result.get("message"),
        }

        for timing in result["timing"]:
            key_prefix = timing["name"]
            new_result[f"{key_prefix}_start_ts"] = timing.get("started_at")
            new_result[f"{key_prefix}_end_ts"] = timing.get("completed_at")

        json_arr_str += json.dumps(new_result)

    return json_arr_str

After we have uploaded our new run_results_processed.json file, we can load it with:

COPY dbt_run_results FROM 's3://mybucket/run_results_processed.json'
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
FORMAT AS JSON;

In conclusion and what comes next

I set out to write this more practical post as I did some exploring of run_results.json myself, and I hope that folks working with dbt and looking to do some quick run analysis can also find some use to it. Ideally, this post becomes the one you Google all the time when you need to remember a jq command to work with run_results.json!

Finally, I would like this post to be part of a “mini-series” of posts that work with dbt artifacts: run_results.json is just the first I decided to take a look at. Coming up, I do plan to dig into manifest.json as there is some valuable information to extract from there.

A previous version of this post suggested the usage of a JSONPath file. Since then I’ve learned that Redshift does not support any of the more advanced JSONPath features, like wildcards, making iterating over a nested structure impossible. ↩︎

#dbt #data-engineering #data-warehouse #sql #jq

Reply to this post by email ↪