A new product offering from Snowflake Computing (Snowflake.net) has the promise to be a game changer in the the cloud data warehouse space.  With some clients easily reaching an 80% reduction in cost to maintain the same or better SLA’s,  we need to be looking at this product closely.

Built from scratch on top of the Amazon Web Services (AWS) cloud infrastructure it has so many of the features I’ve longed for as a Data Warehouse builder.

  1. Separates storage from compute: Snowflake completely separates storage from compute.  Allowing you to add compute resources based on demand and allowing you to separate your loads by type.  For instance having a set of compute resources for your main production users and separate set of compute resources for your ETL jobs.  In a true AWS manner you can turn on and off resources or modify the size of the compute resource as required and you only pay for what you are using.
  2. Clone:  This feature is a game changer.  Built into the Snowflake feature set is the ability to make a copy of a table, a schema or a database using the clone keyword.  This feature gives you a copy of the data that is managed in metadata so it does not double the storage.  It points to the table, schema or database as of the point in time that the command was issued.  Then updates to that data are stored as deltas.  This has the potential to allow you to use a rapidly created clone of production data for QA and Test environments. The bane of database developers for my entire career and before has been trying to get close copy of production into the QA or Test database.
    At one client site the process was taking between one and two weeks to refresh QA. The process looked something like this; the DBA did a full export, compressed the data files and then copied them across the country to the QA data center.  Then in the QA environment; uncompress the files, drop all indexes, truncate or drop all tables, load the production data into the QA database, re-add all indexes, analyze the tables, test, then apply masking algorithms to the sensitive data.  We were lucky if that got done in two weeks and during those two weeks production loads were continuing so the QA data was getting farther and farther away from the data in production.
    At another client site they didn’t even try and had developers working in production.  The deadline for each days development and testing was mid afternoon when the next set of batch jobs kicked off.  Giving a whole new meaning to “I don’t always test my code, but when I do, I do it in production.”
  3. Native SQL: Some competing products are SQL like (Google Data) or need an entire ecosystem of products (and learning curves or hard to find resources) to get to the point of using something like SQL(Hadoop). Hadoop has it’s place in certain scenarios but it is not the right solution for general data warehouse needs, yet so many people have been jumping on the Hadoop bandwagon. See my post on resisting the Hadoop Imperative.
    The Snowflake pitch is that they use standard SQL and requires no new learning curve to start writing queries.
  4. Time Machine built in.  This feature allows you to see your data as of a point in time. Need to see what the data looked like before the last batch cycle, or at the beginning of the week, no problem.  Snowflake automatically allows for querying the data as of a point in time.  By default they keep one day of history but that can be set for up to 90 days.