Our Snowflake Framework

Our Snowflake Framework
Design Principles
Least Privilege Principle: The idea that a user of a system be given the least amount of access needed to efficiently conduct the work at hand.
Consistency: Consistent patterns and naming standards are easier to comprehend and quicker to work with than inconsistent patterns where you have to remember the variations in “rules.” Or worse, where there are no real rules, and you must look up how something is done each time you work on it. Consistency improves developer productivity and reduces fatigue as well as errors.
Consistency also allows us to extend the pattern to new circumstances, when needed, with minimal effort.
Our Invented Namespaces
Raw Data
When we bring data into snowflake we follow the ELT pattern. That is, we extract data, load it, and then transform it. Because of this, it is useful to have a portion of our naming pattern that allows us to quickly assess and keep separate data that is in a “raw” state. By raw data we mean data that looks as much as possible like the data from the source system. To make it easy to identify we will include the word RAW in the database name. Because we may have multiple systems loading raw data we create a separate database and accompanying service account user and roles for each program or set of programs under a particular span of control. Some examples of Raw Data:
Fivetran Raw – This “application” space is used to store raw data ingested by Fivetran. At a minimum there is a separate schema for each source system.
Custom Raw – This “application” space is used to store raw data ingested by a custom or manually coded process. If there were multiple disparate groups of coders loading raw data we would recommend breaking “Custom Raw” into more distinct namespaces.
Matillion Raw – Breaking data ingested by Matillion into its own namespace is a recent enhancement of our pattern. Doing this will provide a better separation of concerns in the ELT workflow.
Streaming Raw - Streaming data refers to a continuous flow of data that is generated and ingested in real-time from various sources, such as IoT devices, sensors, social media feeds, logs, and other data-producing systems.
An important call out here is that we want to break these up by areas or responsibilities or control. We want to limit the blast radius that any one service account user has when connecting to Snowflake. So in a larger environment we may wish to have multiple namespaces for each of the above sources of data. In a large enterprise or organization we might want to break this up based on the team or group managing the process.
Data Applications
In a data reporting environment, we often want to break our data into subsets in order to make them more manageable. I will refer to these subsets here as “applications.” You could also think of them as “modules” of a larger application in the corporate data ecosystem.
Example Data Applications:
Enterprise Data Warehouse (EDW) – Integrated and subject-oriented data with history. There is a joke among data warehouse professionals, where you ask someone: How is your data warehouse project going? Answer: Great, we have three of them. In case you don’t know, it’s funny because it is best practice for a company to have only one data warehouse.
Operational Data Store (ODS) – Similar to a data warehouse in that we have integrated subject-oriented data sets but with only the data needed for operational queries. An example is an integrated database that is tracking the status of seats on all current and future flights. In order to sell tickets, we have to track in near real-time the status of all seats as people book, rebook and cancel their tickets, and as airlines need to cancel or reroute flights and flight crews and deal with the fall out of everything involved in the process.
Environments
We need multiple “environments” in which to develop and test software that should be kept separate from our production environment. Before Snowflake, we would sometimes do this with separate servers, databases, schemas, or various combinations designed to keep the production environment safe from unintended consequences during the development and test process. In Snowflake we have a feature called “zero-copy cloning”. This allows us to make a copy of any table, schema, or database without having to actually copy the data. It is accomplished as a part of Snowflake’s secret sauce, but essentially it is done with pointers to blocks of data stored on S3.
Zero-copy cloning makes features available that have never been available before. We can now develop and test against an exact copy of a production database. If the test fails or corrupts our dev/test database, we can drop it and re-clone to have another fresh copy of production. Around 2014, I was the data architect on a project for a company that used Oracle as its data warehouse database. The production database was on one coast of the United States and the DEV and UAT servers were on the other coast. To refresh the UAT data from production, it took approximately two weeks. The data had to be exported from Oracle to disk (and you needed enough space on the server to do this). It was then compressed, and sent by SCP or similar protocol to the data center on the other coast. The compressed files had to be uncompressed and loaded into the UAT instance of Oracle, and then the security grants had to be verified and re-applied as needed. At the point it could be used, the data was at least two weeks behind the production database, and the process had consumed at least 40 hours of the DBA’s time. Because of the expense of this process, it was only practical to do around once in a quarter.
Compare that to Snowflake’s zero data copy clone command where the DBA/Engineer issues one command and the clone is available in a few seconds with no extra storage or compute or time lag needed.
Putting It Together
So now we have a variety of factors on how we break up the namespace of our Snowflake account. Raw data or applications combined with environment. We will work through our examples with two environments; Development (DEV) and Production (PRD), but the pattern can easily be extended to include; Integration (INT), User Acceptance Test (UAT), or even an arbitrary environment created for a particular team or purpose.
We will also simplify our example to one ingestion program–in this case Fivetran, and one additional application, the Enterprise Data Warehouse (EDW).
Each of these namespaces are kept together and managed together by putting each in its own database:
Database Name | Use/Description | Comment |
---|---|---|
PRD_FIVETRAN_RAW_DB | Production data ingested by Fivetran. | |
DEV_FIVETRAN_RAW_DB | Data ingested by Fivetran with custom connectors we are building. | Only needed if we are developing custom connectors. |
PRD_MATILLION_RAW_DB | Production data ingested by production Matillion. | |
DEV_MATILLION_RAW_DB | Data ingested by development Matillion jobs. | |
PRD_HISTORY_DB | The standard location for snapshot (SCD) data. | This data is special because it is not idempotent. |
DEV_HISTORY_DB | Used while developing and testing snapshots. | |
PRD_ODS_DB | The production version of our operational data store is kept in this database. | |
DEV_ODS_DB | The development version of our operational data store is kept in this database. | |
PRD_EDW_DB | The production version of our enterprise data warehouse is stored here. | |
DEV_EDW_DB | Our enterprise data warehouse development space. | |
USER_SANDBOX_DB | A place for power users and data scientists to create their own tables and views. | Can be divided creating a separate schema for each user. And if needed other schemas for sharing working objects. |