Validating Schema Changes Using SQL Set Operations in Snowflake

Introduction

A few weeks ago I wrote about using a similar method to validate data after a code change to the transformations, where we needed to make sure the data was the same before and after the changes.

Another similar but different use case is when we add tables and columns to the models our BI tool or other downstream systems will access. In this case, we can add tables and columns, but we must never delete or rename them unless we coordinate with the downstream engineer.

This blog post discusses a practical SQL approach for comparing multiple tables, views, dynamic tables etc, in two environments — typically, one is the integration environment being tested before deploying the changes to the production environment. — to ensure that we will not break the BI tool (or other downstream systems). This SQL written for this use case assumes that the production schema and the target schema have the same names in different databases. In our naming pattern it is DEV_EDW_DB and PRD_EDW_DB.

For instance, the schema we use for our first data mart, is ANALYTICS_BASE.

The Use Case

Imagine you have model changes, maybe many model changes, that need to be deployed before the end of the year. Your BI Engineers have all taken end of year vacation so you don’t have them on standby for the deployment. So you really need to make sure you don’t break the BI reporting layer. As with the data, tools and frameworks are available for schema comparison. However, a simple SQL approach can be highly effective, especially in environments like Snowflake, where data can be queried directly with minimal setup.

The SQL Query: Set-Based Comparison

The SQL method utilizes MINUS and UNION ALL operations to identify any discrepancies between two data sets. Here’s a basic template:

select 'A-B' as set_name, set_a.* from dev_db.schema.table as set_a 
minus 
select 'A-B' as set_name, set_b.* from prd_db.schema.table as set_b
union all
select 'B-A' as set_name, set_b.* from prd_db.schema.table as set_b
minus 
select 'B-A' as set_name, set_a.* from dev_db.schema.table as set_a;

Only now we will modify it to look at the information_schema.columns view. And we will use meaningful names to help us think about the result set(s) we see.

This query helps identify two types of discrepancies:

Columns present in DEV but not in PRD (DEV-PRD).
Columns present in PRD but not in DEV (PRD-DEV).

For this example we would likely expect to get results DEV-PRD assuming we added tables and or columns, but if we get results for PRD-DEV we know that we deleted or renamed a column. Since we have said this is NOT our intent, we can go back to our transformation code Coalesce or dbt and fix our model, before we test again.

Practical Example

To demonstrate, let’s say you are comparing the ANALYTICS_BASE schema between the DEV_EDW_DB and the PRD_EDW_DB.

To compare multiple schemas, change the four occurrences of:

table_schema = 'ANALYTICS_BASE'

to something like:

table_schema in ( 'ANALYTICS_BASE', 'ANALYTICS_INT', 'ANALYTICS_STG')

WITH diff AS
select 'DEV-PRD' set_name, table_schema, table_name, column_name, data_type
from  dev_edw_db.information_schema.columns as set_a where table_catalog = 'DEV_EDW_DB' and table_schema = 'ANALYTICS_BASE'
minus
select 'DEV-PRD' set_name, table_schema, table_name, column_name, data_type from prd_edw_db.information_schema.columns as set_b where table_catalog = 'PRD_EDW_DB' and table_schema = 'ANALYTICS_BASE'
union all
select 'PRD-DEV' set_name, table_schema, table_name, column_name, data_type
from prd_edw_db.information_schema.columns as set_b where table_catalog = 'PRD_EDW_DB' and table_schema = 'ANALYTICS_BASE'
minus
select 'PRD-DEV' set_name, table_schema, table_name, column_name, data_type
from dev_edw_db.information_schema.columns as set_a where table_catalog = 'DEV_EDW_DB' and table_schema = 'ANALYTICS_BASE'
)
select * from diff
order by 1, 3, 4;

The result set will display columns in PRD but not in DEV, and columns in DEV but not in PRD.

Advantages of This Approach

Simplicity: No additional setup or complex frameworks are required.
Readability: SQL is easy to read and modify, making it accessible to data engineers and analysts.
Scalability: Works with large data sets, leveraging Snowflake’s performance optimizations.

Limitations and Considerations

Note this example does not check that the data types did not change in a way that would cause issues. If desired, add the columns from information_schema.columns as needed.

Conclusion

This SQL pattern is an effective way to validate changes to your target schema(s). It provides a quick and reliable method to ensure that code changes have not introduced unintended schema changes that might break your BI layer. Incorporate this technique into your development and testing workflows to limit the risk of unintentionally breaking your BI layer.