Modern Data Stack Essentials: Tools and Strategies for Data-Driven Success

Modern Data Stack Essentials: Tools and Strategies for Data-Driven Success
In this blog post, I’ll guide you through the key elements that constitute the Modern Data Stack. We’ll delve into their roles, significance, and share insights and recommendations drawn from our experience at Cloud Data Consulting.
The evolution towards cloud-native SaaS offerings has revolutionized the technology landscape, marking a pivotal shift in the way companies approach their technology stack. While on-premise solutions retain their importance for specific security and compliance needs, the vast array of modern cloud solutions brings unparalleled convenience and flexibility to the table.
Against this backdrop, the “Modern Data Stack” emerges as a pivotal ensemble of technologies and services, crafted specifically for the cloud, to empower and enhance data analytics initiatives. At Cloud Data Consulting, our expertise lies in crafting robust data warehouses, guiding our clients through the intricacies of business intelligence, data analytics, and machine learning programs. Drawing on our extensive experience, we’ve identified what we believe to be the crème de la crème of tools and services essential for any cutting-edge data strategy.
Basic Concepts
In this first section “Basic Concepts”, I’ve broken down what are the core components of a data analytics project. These are the absolute essential technologies required for a project’s success.
Data Warehouse
At its core, the Data Warehouse is the fulcrum for any data analytics program. Your application data needs to hang out somewhere—This is the location where raw application data is loaded (or ingested), transformed, and managed.
Cloud platforms offer scalable processing power, handling complex transformations efficiently, and accommodating various data types and sizes, which is often challenging in traditional setups.
CDC recommends Snowflake for its unparalleled flexibility in compute and storage scaling. Aided by competitive pricing, it’s offered on all major cloud providers and is being constantly updated.
The “EL” of “ELT” (Extract, Load, Transform)
The E and L components of ELT represent how data travels from the source system (system of record) to the data warehouse. Essentially, they serve as a bridge between the source system and the data warehouse, importing/ingesting the source data and loading it into the data warehouse where it becomes ready for transformation. This process is integral to the Modern Data Stack and begins to address questions about data freshness: How often does the data need to be updated?
CDC recommends Matillion and Fivetran as complimentary, and occasionally overlapping ELT tools. Whereas Fivetran offers effortless connectivity and data ingestion management with it’s pre-built connectors, Matillion provides flexibility with it’s support of custom connectors.
Transformation
The “T” in ELT represents Transformation, and due to its importance, it also gets its own bullet point. RAW data, or data that is ingested from the source system, has not yet been changed, modified, or transformed into an optimized model built for end-user applications (such as BI, data analytics or machine learning).
CDC recommends dbt (Data Build Tool) as our preferred transformation tool. dbt allows developers to use “SQL on steroids” (SQL with macro/function utility) to perform sequential, step-wise transformations from the RAW layer—>stage layer—>intermediate layer—>analytics layer where the resultant star schema (an analytics-optimized data model) serves data to the end-use application.
BI Tools
Many businesses develop data analytics programs with BI in mind. BI tools allow for easy aggregation and visualization of data from the data warehouse. While there are many BI tools the in marketplace, not all are created equally and can differ widely in their core data philosophies (how they architect the underlying data model/semantic layer or dataset structure).
CDC recommends Looker, Tableau and PowerBI as primary candidates for BI tools as they are widely adopted, have large communities for support, and have strong market penetration.
Complementing Concepts
Now that the core concepts are covered, we are going to move on to the complementing concepts—the workflows, practices and tools that support, enhance, and make resilient the concepts discussed above.
Source Code Control
Source code control is a method, or series of practices, that allows for incremental progress of a software project to develop in a way that preserves the integrity of the project.
CDC recommends Git using Github. That being said, any source code control is better than none. While there are competitors, Github offers the tooling, support and ease of use that make it the industry standard for a reason.
Several BI tools, including Looker and PowerBI (preview), offer source code control solutions that enhance/protect the BI development workflow. Again, CDC strongly advocates for making use of these capabilities.
Orchestration and Automation
While it is possible to manually initiate all the steps and sequences required to transfer data from a source system to a data warehouse, thru to a BI tool, the goal is to orchestrate and automate these processes so that manual intervention is not required, resulting in reduced overhead and a more reliable/resilient system. This process involves properly setting up the source system, ELT tools, data warehouse and BI tools in a way such that when a process finishes, it triggers a message or call, for the next in the series to begin. Many tools offer ways to establish schedules and triggers for whatever function they provide to begin. Additionally, there are tools that are specifically built for managing the communication between tools, servers, databases, etc.
CDC has had great success leveraging Github Actions as a low-to-no-cost solution for Orchestration due to it’s accessible API, programmatic flexibility, industry adoption and supportive community.
Master Data Management
MDM is a process of formalizing the business practices surrounding how a company handles it’s data. The topic of MDM could serve as its own blog post (and will in the future), but at it’s core constitutes defining the processes, governance, policies, standards, and tools that systematically define and manage the critical data of an organization. Often times this process begins by simply discussing how the company currently handles their data (what happens when, who does what), and documenting these processes.
As data volumes grow, and projects increase in complexity, leveraging the structure and foundation that a defined MDM program supports is imperative.
CDC recommends Non-Invasive Data Governance by Robert S. Seiner for an accessible and actionable primer on the topic.
CI/CD: Continuous Integration/Continuous Deployment
Humans are fallible, and even with the best of intentions, make mistakes. This can be especially problematic if faulty code is promoted to production, thereby causing an outage/loss of service. Thankfully, programming practices have been developed that make it easy to ensure that code is production ready. To offer a more concrete definition, CI/CD refers to automating the processes of 1) integrating code changes and 2) testing those codes changes before (and while) those changes are promoted to the production environment. This concept is the logical next step after implementing Source Code Control practices and makes a project more robust and resilient against unanticipated or accidental down-time.
CDC recommendation is to use any purpose built tooling to support CI/CD processes that are integrated into the project stack; testing of processes, establishing communication triggers in the case of failures or exceptions.
Infrastructure as Code (IaC)
Infrastructure as Code broadly defined is scripting the deployment of computing infrastructure through code, rather than either physically managing hardware or manually setting up cloud infrastructure. This offers many advantages: 1) cloud infrastructure can be setup, managed and destroyed considerably faster due to automation, 2) projects can be easily scaled and customized with code modifications, 3) each environment can be custom built for the projects requirements. Of course, development time is needed to build an IaC library, but the upfront cost is usually worth the expense, especially if the infrastructure is being built multiple times.
CDC’s recommendation is Terraform. Hashicorp supports a wide number of vendors, including the major cloud providers AWS, Azure and GCP. It’s worth mentioning that CDC uses Terraform to deploy both our Snowflake Security Framework and AWS Matillion Jumpstart package.