The really agile way to build Data Analytics Platforms!

By Taylor Fulton
The really agile way to build Data Analytics Platforms!

The really agile way to build Data Analytics Platforms!

Learning dbt has completely changed my process!

Introduction & problem description

While recently reflecting on how I would support a team learning the process of building a new data warehouse in the cloud, I realized that my process is completely different now that I have been using dbt for several projects over the last couple of years. In a nutshell; After years of building data warehouses and data marts, and reading books and articles about how to be agile.

I have finally started to have the experience of working as an agile team, with tools that supported and encouraged an agile process, and with the people and knowledge that allowed it all to come together producing results at an ever-increasing velocity!

In this article, I will describe the old way of building data marts and data warehouses and compare that to our recent experiences with Snowflake and dbt, and how we have taken what we learned using dbt with data ingested by Fivetran and generalized it to use with Matillion, custom code or other tools. What we learned informed and altered how we think about, design, and manage our Snowflake environment. By carefully organizing data into environments and applications we empower our teams to be as agile as possible while still maintaining good security and software engineering practices.

Over my ~25 years of developing data reporting systems; data warehouses, data marts, operational data stores etc, I noticed that in my experience and to my chagrin that this area of software development, for data, was one of the last to embrace modern software engineering practices, such as data modeling, scripted database deployments (with a backout plan), CI/CD, unit tests, data tests, and sometimes without even using modern source code control. On way too many engagements I would arrive to find the teams had no data model, no source code control, and were doing large manual database deployments manually without even logging what happened in the deployment.

Even recently, we used to design our target data models in our data modeling tool of choice and then attempt to shoehorn all of the source-to-target specification information into a series of tabs in an excel spreadsheet.

The data model itself could take weeks if not months of work as we painstakingly attempted to create a perfect set of star schemas in Erwin (or other data modeling tools), with naming standards, datatype standards and documentation standards all with a formal model review process. We frequently did all of this in what I now see as a huge knowledge vacuum. We then dumped all of this target metadata into an excel spreadsheet and started the laborious process of adding the source table, column and table join information into the spreadsheet along with any transformations to that data that we wanted the ETL developer to do to the data. It was a mind-numbing and error-prone process.

On one team for a large financial services company, as project architects, we had to certify in blood, that this was the true and final version of the specification before the developers would accept it to start work. This “offshore” team would then reverse the mind-numbing error-prone process to turn the spreadsheet into a program that would load the target data model. After which they would turn the QA process over to a totally separate team of people who would look at the spreadsheet and attempt to make sure that the program was doing what the spreadsheet said it should do.

After the QA team signed off we would then let the BI tool team start to do their part; connect to the data and build some reports before starting the formal user acceptance test phase.

At this point, it was frequently 6 to nine months (sometimes longer) since I had started working on the project. All too often the request for this work may have been in the backlog for another 3 to 6 months before it had been prioritized. After User Acceptance Testing which frequently would not allow for changes in the requirements that had been worked months earlier. The users finally saw some semblance (or not) of what they had requested all those months ago. Frequently it was no longer what they needed or wanted.

In my opinion, this was a sincere, best efforts approach, to scale data warehouse development in an attempt to control costs. Unfortunately, from my perspective, it was producing sub-optimum results. Very slow time to market, with brittle slow processes, producing brittle slow systems.

Another approach

Flash forward it is now 2022!

On the data engineering side we have a whole new set of really great tools; Snowflake, Matillion, Fivetran, dbt, Coalesce, Dataedo, SqlDBM and others. On the BI side, we have a mix of new and old names; Sigma, Looker, Tableau, MS Power BI, and ThoughtSpot just to name a few.

But no matter the toolset, when we do what we always did, we still get what we always got. Basically, our processes and methods, the way we used our tools, and the way we worked together (or not) were still a waterfall process albeit with shorter iterations. I will say it again, even as we called our teams Agile, used Jira and our project managers were now “Scrum Masters” we were still doing an iterative waterfall process. And doing it this way is much slower, hence much more expensive than it needed to be.

New Formula

Velocity ⟷ agile team + agile tools + agile process

A corollary might be: Velocity ⇔ knowledge + skill + tools + teamwork

In the “Agile” world, velocity is the term used to describe the rate at which a team is able to chew through work in a backlog of “user stories” while maintaining an acceptable level of quality.

I assert that on an agile team, titles are much less important than skills, knowledge, motivation, and the ability to work as a team.

Here are a couple of distinctions that I’ve been seeing lately that I think are helpful for thinking about roles for the project team:

Subject Matter Expert (SME) - Depending on the size and complexity of the organization there are frequently many subject matter experts and it is critical to find the appropriate one(s) for each project or subject area/domain of the project. These are typically experts on source systems and source system data as well as a particular business unit’s current processes and procedures. A sufficiently knowledgeable SME is a critical success factor for any data project.

Data Engineer

  • Very strong in the tools, techniques, and processes for ingesting and transforming data.
  • Note sometimes less knowledge on the business side of things.
  • Frequently does not work with the BI tools.

Analytics Engineer

  • Very strong on the business and analytics side of things
  • Knows or learns, the data very well
  • Strong on the BI reporting and tool(s)
  • Maybe less so on the ingestion and transformation tools and interfaces such as API programming.
  • Helpful to know or learn agile star schema design

Data Modeler

  • Skilled with the team’s data modeling tool.
  • Able to lead discussions with business users to understand and document how the business sees itself, and the data it uses to perform its role in the overall hierarchy of the organization.
  • Able to document high-level top-down logical models from working with the business users, and reverse engineer and organize bottom-up models of the source system data as well as the as-built data model.
  • Agile star schema/dimensional design

Data flow Diagram Process diagram/description

The “dbt” process:

1. Load Raw Data (Data Engineers)

Get RAW data into production as a separate workstream. This lets the Data Engineers work somewhat independently as they only need access to the source system and to be told which sets of data to ingest. The data should be loaded as close as feasible to how it is in the source system. The same object/table names and the same attribute/column names with very few exceptions. One such exception is making sure we don’t end up with lower or mixed case names in the Snowflake raw databases. While the naming of the tables and columns should be as close as feasible to the source system, we do need to take care in how we name the schemas where we will be putting the data. Case in point, many organizations use Salesforce. If an organization either has multiple instances of Salesforce with different purposes we should pick a name for the schema that reflects the purpose of the Salesforce instance. So if the only thing in the instance is Affiniquest we should name the schema something like SFDC_AFFINIQUEST. This team should be responsible to develop the process to ensure that the data from the source matches the data loaded into the RAW database(s).

2. Document/Model the RAW data (Analytics Engineers)

  • Pull together vendor and organizational documentation
  • Reverse engineer the raw database schemas into the modeling tool and add the appropriate PK, AK and FKs. The exception is if the existing documentation is sufficient. Sometimes it is, but many times it is either super high level or out of date.

3. Code generate stage layer objects (Analytics Engineer/Data Engineer)

  • Source configuration yml file

  • Stage sql model for each raw table, with limited transformations:

    • trim(all string columns)
    • Rename ID fields to TABLE_NAME_ID.
    • Logic for is_most_recent_record flag if required
    • Logic to deduplicate raw data if required
  • Stage .yml configuration file for each table created above.

    • Configure test for unique and not null for the primary key column(s)
    • Optionally configure tests for relationships.

4. Iteratively design and Build Data Mart objects (Analytics Engineer, SME & Data Modeler)

  • Populate the data mart bus matrix document

    • Analyze and document a list of candidate fact tables noting likely source tables.
    • Analyze and document a list of candidate dimensions noting like source tables
  • Start with a fact or dim and write the SQL to produce the desired dataset.

    • The sources for this SQL need to be the staging layer objects built in the prior step.
    • Alias the source column name to the standards-based target column name
    • Cast the data type to the appropriate standard data type
    • The SQL must be written in CTE (common table expression) syntax; this is critical for reducing the time it takes to convert each query to use the Jinja syntax required by dbt.
  • Review the SQL and the data produced

    • Names follow standards
    • Data types follow standards
    • Data values are as expected
  • Port this SQL to dbt

    • Tablename.sql - translate source names to jinja refs
    • Tablename.yml - code generator assisted, Add minimum tests
  • Dbt run -s tablename && dbt test -s tablename

  • Fix any issues

  • Move on to the next table

5. After each star schema is completed

  • Build it in the test environment
  • Iteratively pull new facts and dimensions into the data analytics tool (Analytics Engineer)
    • Prep the presentation layer as appropriate for the BI tool
    • Sanity check the data
    • Make some reports
    • Validate the reports and the data

6. As each subset of models (tables) is tested and approved

  • Migrate ELT to production
  • Migrate BI objects to production

Vive La Différence

The data, the whole data, and nothing but the data!

So let’s talk about this process and why I think it yields an improved velocity. In the old days we would frequently start with a process that forbid our engineers from seeing the actual production data of our systems. To develop something we needed some data, and the practice back in the day was to try to synthesize data. The problem was that it is hard to synthesize realistic data sets at scale. Data is messy, and even when there are no data issues.

Share:

Related Articles