Understanding the Problem
The problem at hand is to create a table that stores information about tables failing quality tests. The goal is to identify consecutive days of rows in the same table where the test failed.
Background
To approach this problem, we need to understand the query provided and break it down into its components.
- Query Overview
- The query uses a Common Table Expression (CTE) named “a” to filter tables with failed tests.
- It groups the results by table name, expectation type, meta data, columns, origin type, and run time.
- The Issue
- Currently, the query accumulates all rows where the test result is false.
Solution Approach
To solve this problem, we need to modify the query to keep track of consecutive days for each table.
Step 1: Identify the Requirements
We want to:
- Keep track of consecutive days for each table.
- Only include tables that have failed tests for more than two days consecutively.
Solution Breakdown
To achieve this, we’ll use a combination of window functions and aggregation techniques.
Partitions by Full Table Name and Expectation Type
We first partition the data by full table name and expectation type. This will allow us to track consecutive days for each unique combination of these two columns.
### PARTITION BY FULL TABLE NAME AND EXPECTATION TYPE
WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
ROW_NUMBER() Function
We use the ROW_NUMBER() function to assign a unique row number for each group of consecutive days.
### Assign Row Number Using ROW_NUMBER()
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
Window Function
We use the OVER() clause to define the window over which we want to apply the row number function.
### Define Window Over Which to Apply ROW_NUMBER()
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
HAVING Clause
We use the HAVING clause to filter rows where the row number is greater than 1. This will remove rows that are not consecutive.
### Filter Consecutive Days Using HAVING Clause
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
HAVING MAX(consecutive_days) > 2
Step 2: Execute the Query
Now that we have the modified query, let’s execute it.
### Execute Modified Query
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc
Step 3: Interpret Results
After executing the query, you should see the table with consecutive days of rows for each failing test.
The final result will look something like this:
| full_table_name | expectation_type | meta | expectation_columns | origin_type | run_time | consecutive_days |
|---|---|---|---|---|---|---|
| … | … | … | … | … | … | 1 |
The consecutive_days column will contain the row number for each group of consecutive days.
Step 4: Modify Query to Update Failing Tests
If you want to update failing tests, you can use a combination of UPDATE and JOIN statements to achieve this.
### Update Failing Tests Using UPDATE Statement
-- Select rows that fail the test for more than two consecutive days
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc
)
-- Update failing tests using UPDATE statement
UPDATE MY_TABLE
SET is_successful = TRUE
WHERE full_table_name = '...'
AND expectation_type = '...'
AND meta = '...'
AND origin_type = 'DWH_TESTS'
Note: This is a basic example and may need to be modified based on your specific use case.
Step 5: Test Query
Test the query by running it multiple times with different sets of data.
### Run Query Multiple Times with Different Sets of Data
-- Run query once
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc)
-- Run query again with different set of data
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM
(WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
HAVING MAX(consecutive_days) > 2
ORDER BY run_time desc)
The final answer is: There is no specific numerical answer to this problem, as it involves executing a SQL query and interpreting the results.
Last modified on 2024-12-22