Understanding Failing Tests in SQL Queries

Understanding the Problem

The problem at hand is to create a table that stores information about tables failing quality tests. The goal is to identify consecutive days of rows in the same table where the test failed.

Background

To approach this problem, we need to understand the query provided and break it down into its components.

  1. Query Overview
    • The query uses a Common Table Expression (CTE) named “a” to filter tables with failed tests.
    • It groups the results by table name, expectation type, meta data, columns, origin type, and run time.
  2. The Issue
    • Currently, the query accumulates all rows where the test result is false.

Solution Approach

To solve this problem, we need to modify the query to keep track of consecutive days for each table.

Step 1: Identify the Requirements

We want to:

  • Keep track of consecutive days for each table.
  • Only include tables that have failed tests for more than two days consecutively.

Solution Breakdown

To achieve this, we’ll use a combination of window functions and aggregation techniques.

Partitions by Full Table Name and Expectation Type

We first partition the data by full table name and expectation type. This will allow us to track consecutive days for each unique combination of these two columns.

### PARTITION BY FULL TABLE NAME AND EXPECTATION TYPE

WITH a AS (
    SELECT
        full_table_name,
        expectation_type,
        date_trunc('day', run_time) AS run_time,
        origin_type,
        meta,
        expectation_columns
    FROM MY_TABLE
    WHERE is_successful = false
        AND origin_type = 'DWH_TESTS'
    GROUP BY 1, 2, 3, 4, 5, 6
)

ROW_NUMBER() Function

We use the ROW_NUMBER() function to assign a unique row number for each group of consecutive days.

### Assign Row Number Using ROW_NUMBER()

SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )

Window Function

We use the OVER() clause to define the window over which we want to apply the row number function.

### Define Window Over Which to Apply ROW_NUMBER()

SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )

HAVING Clause

We use the HAVING clause to filter rows where the row number is greater than 1. This will remove rows that are not consecutive.

### Filter Consecutive Days Using HAVING Clause

GROUP BY full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time
HAVING MAX(consecutive_days) > 2

Step 2: Execute the Query

Now that we have the modified query, let’s execute it.

### Execute Modified Query

SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
    SELECT
        full_table_name,
        expectation_type,
        date_trunc('day', run_time) AS run_time,
        origin_type,
        meta,
        expectation_columns
    FROM MY_TABLE
    WHERE is_successful = false
        AND origin_type = 'DWH_TESTS'
    GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc

Step 3: Interpret Results

After executing the query, you should see the table with consecutive days of rows for each failing test.

The final result will look something like this:

full_table_nameexpectation_typemetaexpectation_columnsorigin_typerun_timeconsecutive_days
1

The consecutive_days column will contain the row number for each group of consecutive days.

Step 4: Modify Query to Update Failing Tests

If you want to update failing tests, you can use a combination of UPDATE and JOIN statements to achieve this.

### Update Failing Tests Using UPDATE Statement

-- Select rows that fail the test for more than two consecutive days
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
    SELECT
        full_table_name,
        expectation_type,
        date_trunc('day', run_time) AS run_time,
        origin_type,
        meta,
        expectation_columns
    FROM MY_TABLE
    WHERE is_successful = false
        AND origin_type = 'DWH_TESTS'
    GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc
)

-- Update failing tests using UPDATE statement
UPDATE MY_TABLE
SET is_successful = TRUE
WHERE full_table_name = '...'

AND expectation_type = '...'
AND meta = '...'
AND origin_type = 'DWH_TESTS'

Note: This is a basic example and may need to be modified based on your specific use case.

Step 5: Test Query

Test the query by running it multiple times with different sets of data.

### Run Query Multiple Times with Different Sets of Data

-- Run query once
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
    SELECT
        full_table_name,
        expectation_type,
        date_trunc('day', run_time) AS run_time,
        origin_type,
        meta,
        expectation_columns
    FROM MY_TABLE
    WHERE is_successful = false
        AND origin_type = 'DWH_TESTS'
    GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc)

-- Run query again with different set of data
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
    SELECT
        full_table_name,
        expectation_type,
        date_trunc('day', run_time) AS run_time,
        origin_type,
        meta,
        expectation_columns
    FROM MY_TABLE
    WHERE is_successful = false
        AND origin_type = 'DWH_TESTS'
    GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
    full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time,
    ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
    expectation_type,
    meta,
    expectation_columns,
    origin_type,
    run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc)

The final answer is: There is no specific numerical answer to this problem, as it involves executing a SQL query and interpreting the results.


Last modified on 2024-12-22