Why we love Snowflake

Why we love Snowflake
Introduction
The Snowflake Data Platform is a once-in-a-generation breakthrough in data warehouse, analytics and reporting database technology. It is transformative and disruptive and will continue to change how we think of data analytics platforms (aka data warehousing) and the processing of data, large and small. Since it is a cloud-native Platform as a Service (PaaS) it can be used efficiently for any size data, and you only pay for what you use!
Many of the DBA/admin tasks required for other databases are either no longer necessary or are significantly reduced in Snowflake. However, there is one significant area in Snowflake that continues to require as much effort as any prior technology: account organization and security. The purpose of this article is to address how we organize and name things in a Snowflake account, as well as how to manage the roles, privileges and associated grants within the account so that the data is secure and developers are productive.
The Snowflake documentation site does an excellent job of documenting Snowflake’s security features. Many users have discovered the near infinite power of Snowflake with its many breakthrough features. However, when they first log in and encounter the great potential that blank canvas offers, it can be a bit daunting, to say the least.
Additionally, many database professionals carry loads of prior experience. This can be a double-edged sword. On the one hand, this experience translates well into an easy navigational experience of the database. On the other, some habits need to be left at the door. For instance, we used to separate our workloads into separate servers and/or instances of a database. Someone new to Snowflake might want to create separate “accounts” for their development and production workloads. But to do that would thwart, or at least complicate, the use of the zero-copy cloning feature to replicate production data for testing and development.
Note the first “half” of this article covers some of the Snowflake basics. The database naming framework we use starts at Our Framework/Snowflake Jumpstart.
Snowflake “Locations”
Cloud Provider
Snowflake can be hosted on any of the three major cloud providers: AWS, GCP, or Azure. Each provider features different availability zones as well as slightly different pricing models (cost of storage).
Since the introduction of Snowflake Organizations, Snowflake users have been able to link accounts between cloud providers. For instance, if a company sets up an account on AWS, they would also be able to set up an account within the same organization, but this time with another provider (GCP/Azure), or within the same provider but in the same or different region.
Cloud Provider Region
Snowflake is available in multiple regions across all three providers. Snowflake customers need to consider the requirements of the project to decide which Cloud Provider best supports their Snowflake implementation. Things to consider are data center location, integration with other cloud services from the same provider, as well as legal obligations to keep data in a particular geographic area.
AWS currently (as of May 2022) provides the most availability options, so that potential Snowflake users can select the location closest to their data delivery requirements.
With the Snowflake Organization feature recently introduced to Snowflake, Snowflake Organizational admins can set up new accounts in different regions that are connected under the same Organization. This feature can be used as a data protection measure (copy data between regions to ensure uptime, for instance) or to ensure faster connection times based on distance to/from Provider Regions.
Snowflake Editions
Standard The Standard edition provides all of the base features with certain limitations, such as only one day of time travel and without the Multi-Cluster warehouse feature.
Enterprise
- Multi-Cluster warehouse: more advanced compute scaling control
- Time-Travel up to 90 days for use-cases where data protection is essential
- Materialized views for query optimization
Business Critical
- HIPPA and PII Data Security
- Tri-Secret Secure
- Virtual Private Snowflake
- Dedicated customer-specific virtual servers
Snowflake Organizations
A Snowflake Organization is a set of Snowflake accounts that are managed together under one “organization” for billing purposes. Once you have set up an initial account in Snowflake, the ACCOUNTADMIN or SECURITYADMIN can grant the ORGADMIN role and use a simple create-account command to create more accounts in any supported region of any cloud provider.
Snowflake Account
A Snowflake account organizes a set of Snowflake objects on a given platform and region. When you first sign up for Snowflake you get an “account”. If you wish, you can use the organizational features to create multiple linked accounts that will be billed as a group.
Snowflake Storage
Storage in Snowflake is categorized a couple different ways.
Data storage costs are calculated based on the average daily total storage relative to the cloud provider’s storage cost (typically between $35-$40/TB) per month. Data in Snowflake is automatically compressed by a factor of 4x to 8x, which will reduce data storage costs relative to the total amount of data being imported into Snowflake.
There is the data that is associated with the database, stage (internal) and with failsafe.
Note, external stages are also storage but are not managed inside of Snowflake and hence not billed by Snowflake.
- Database storage consists of data in tables, as well as the time machine history of that data.
- Stage data consists of the storage used by files stored in locations, “stages” managed by Snowflake. A stage can be thought of as a portion of an AWS/GCS bucket or the Azure equivalent. Stages are further subdivided into named stages created with the “create stage” ddl command and implicit stages associated with a Snowflake user, or with a table.
Database and Schemas
Snowflake handles all storage of table data independently of what database and schema an object is in, hence databases and schemas are simply “namespaces”, that is, ways to organize objects inside of Snowflake.
Snowflake Compute AKA Warehouses
Warehouse Size and Cost
In Snowflake a compute cluster is known as a “warehouse”. These warehouses (compute clusters) come in t-shirt sizes from XS up to 6XL. Each step up in size doubles the compute resources of the cluster and doubles the hourly credit consumption as well.
Size | XS | S | M | L | XL | 2XL | 3XL | 4LX | 5XL | 6XL |
---|---|---|---|---|---|---|---|---|---|---|
Credits/hr | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
As a general rule, if the data we are processing is big enough, we will halve the run time of the job when we double the size of the warehouse. Up to a point, the cost for processing that load stays constant, but the processing time can be significantly shorter. This feature alone makes things possible that were never possible with on-premises computers, and it requires a change of mindset to fully utilize.
This is HUGE. We have as much power as we need, when we need it, for as long as we need it. The minute we don’t need it, we stop paying for it!
Changing the size of a warehouse or using a different warehouse of the size you want
Generally, it is much better to use warehouses of the size you need, rather than change the size of a warehouse on the fly. This is for a couple of reasons. For instance, if you change the size of a warehouse that is in use, it will stop accepting new jobs for processing until the longest running job on that warehouse has completed, and only then will it change sizes and start processing with the new size. Whereas if you start using a larger warehouse before you load any queries, any processing will begin with the larger warehouse without delay. The existing warehouse will finish processing the queries in its queue and then automatically shut down if no additional queries are sent to it.
Auto-Suspend, Auto-Resume
Snowflake warehouses can be configured to suspend after a duration of time after the last query has completed. This helps protect against the cost of running warehouses without active jobs. CDC recommends (until the project requirements dictate otherwise) a default auto-suspend setting of 60 seconds to prevent unnecessary credit expenditure from idling warehouses.
As a project continues to mature, the question of increasing auto-suspend duration should come back into play for a very important reason: the warehouse caching layer. As long as a warehouse remains active, any queries computed by a warehouse are cached in the warehouse layer, which has implications for the speed and performance of subsequent queries that use similar data. This, therefore, has implications for the cost/benefit analysis of keeping the warehouse active. At a certain point, there is a crossover in the time it takes for a warehouse to run a query with a certain set of data, and the cost of credits used to keep the warehouse running. It may be a cost saving to keep the warehouse running so that it can maintain the warehouse cache for the next time a similar query is run.
Cost Tracking & Allocation Vs. Absolute Lowest Spend
Generally speaking, Snowflake costs are in two buckets—compute and storage. If you have an “on-demand” deal, the storage cost is about $40/TB/Month. If you have a capacity contract the storage drops to $23/TB/month. Compute is calculated by runtime of the warehouse aka compute cluster.
Sometimes, we need to know and attribute how much compute is being consumed by various groups of users. One way to do that is by giving that group of users a set of dedicated warehouses for the group, e.g., Finance, Sales, Marketing, R&D, etc. Snowflake reporting makes it easy to then add the total credits for the warehouses assigned to each group and then multiply by the contracted rate per credit, then assign the appropriate amount to each group.
If, however, we are running a lot of batch jobs and want to keep our costs as low as possible, we will instead attempt to keep a smaller number of appropriately-sized warehouses running at as close to capacity as possible.
To say it a little differently, using more warehouses by groups of users or other breakouts makes it easy to allocate costs. However, it comes at the risk of some of the warehouses operating at less than full capacity for a portion of the time they are running, which can raise the total cost.
It is generally not cost-effective to perform in-depth analysis on this issue until the amount you are spending reaches a point where the cost savings from optimizing the process justify the cost of analysis and tuning.
Snowflake Security Management
For all of the advancements Snowflake brings to bear as a new paradigm of data warehousing database engine, the one thing that remains to be done is to define and implement the database security inside of the Snowflake database.
In many ways this is very similar to how we would do it in any modern database. We believe that this one area can make or break the security and agility of Snowflake environments. Done sufficiently, the initial security setup provides a framework for multiple teams of developers to work in the same Snowflake account securely and efficiently.
Roles
Defined
Oxford Languages, the dictionary used by Google, reveals this English language definition for “role:”
The function assumed or part played by a person or thing in a particular situation. “She greeted us all in her various roles of mother, friend, and daughter.”
In organizations, we may play multiple roles. More frequently, a role can end up including many “sub-roles.” As Shakespeare said:
All the world’s a stage, And all the men and women merely players; They have their exits and their entrances; And one man in his time plays many parts, …
Snowflake Roles
In Snowflake, Roles are “first class” objects.
- Roles own objects – see “Object Ownership” below.
- Roles can be granted privileges.
- Roles can be granted to other roles. This can be used to add layers and/or matrices of privileges, as we will see.
- Roles can be granted to USERS.
Default roles that come with Snowflake:
ACCOUNTADMIN - The account administrator role is the most powerful role in the system. This role alone is responsible for configuring parameters at the account level. Users with the ACCOUNTADMIN role can view and operate on all objects in the account, can view and manage Snowflake billing and credit data, and can stop any running SQL statements. We work to minimize the need to use this role. It is best practice for any user with this role to use multi-factor authentication (MFA).
SYSADMIN - The system administrator role includes the privileges to create warehouses, databases, and all database objects (schemas, tables, etc.).
SECURITYADMIN – The security administrator role, i.e., users with the SECURITYADMIN system role, includes the global MANAGE GRANTS privilege to grant or revoke privileges on objects in the account. The USERADMIN role is a child of this role in the default access control hierarchy. Users with this role have the ability to grant ACCOUNTADMIN to any user. So it is best practice for any user with this role to use MFA.
USERADMIN - The USERADMIN role didn’t exist before April 2020, and was introduced as an improvement to enable accounts to separate the management of users and roles from the SECURITYADMIN role - if desired.
ORGADMIN - The organization administrator system role is responsible for managing operations at the organization level.
Note: to minimize the use of the ACCOUNTADMIN role and hence to have to use MFA, there are a couple of grants we will typically issue from ACCOUNTADMIN to SYSADMIN:
GRANT MONITOR USAGE ON ACCOUNT TO SYSADMIN;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE SYSADMIN;
Role Based Access Control - RBAC
A quick google search of “role based access control” returned 1.2 Million results in 0.86 seconds.
For the purposes of this article, the first paragraph of the Wikipedia article will suffice. It was credited to several conferences and IEEE papers.
“In computer systems security, role-based access control (RBAC) or role-based security is an approach to restricting system access to authorized users. … Role-based access control (RBAC) is a policy-neutral access-control mechanism defined around roles and privileges. The components of RBAC such as role-permissions, user-role and role-role relationships make it simple to perform user assignments. A study by NIST has demonstrated that RBAC addresses many needs of commercial and government organizations. RBAC can be used to facilitate administration of security in large organizations with hundreds of users and thousands of permissions.” [Emphasis mine.]
Row and Column Level Security
Snowflake has the features required to manage data security at the row level, column level or both. This article will not explore those features. We need to establish a minimum viable level of security and role-based management so developers can start building with Snowflake. As, frequently, the more detailed row and column level requirements get fleshed out later in the process of building a data platform, we will address them in a future document.