Understanding dbt project structure for quality assurance
In this comprehensive guide, we delve into the critical realm of data quality assurance using dbt (data build tool). Data quality is paramount in the world of data analytics and decision-making. To ensure the reliability, accuracy, and consistency of your data models, you need a robust testing framework and a well-organized project structure.
Here are the key files and directories you’ll be working with in a dbt project:
- yml: Located in the ~/.dbt/ or %USERPROFILE%\.dbt\ directory, this file contains your database connection settings. It allows you to set up multiple profiles for different projects or environments
- models: This directory contains your data models or SQL transformation files. Each file represents a single transformation, such as creating tables, views, or materialized views.
- macros: Macros are reusable pieces of SQL code that can referenced in your models. You can store generic tests here or in tests/generics folders.
- snapshots: The snapshots directory which contains snapshot files that define how to capture the state of specific tables in your database over time.
- tests: directory in which you can store test SQL files for your data models. These tests help ensure data quality and consistency.
- seeds: Seeds are essentially CSV or TSV files containing raw data. dbt loads these static data files into tables in your specified schema. Seeds can contain sample data used for testing your dbt models or other data processing logic.
- analyses: The analysis directory contains ad-hoc SQL files for exploring data and performing data analysis.
- target: Directory automatically created by dbt when you run the dbt run command. It contains the compiled and executed SQL code from your models. It is useful when debugging the pipeline.
By understanding the key files and directories in your dbt project, you can effectively organize, manage, and scale your data transformation processes while ensuring data quality in your project.
Overview of dbt’s testing framework
Dbt’s testing framework is designed to ensure data quality and consistency by validating the data within your models. It provides built-in tests, as well as the ability to create custom tests tailored to your specific data requirements. The testing framework is an essential component of any dbt project as it promotes trust in your data and helps identify issues early in the development process.
dbt’s testing framework includes the following components:
Generic Tests:
These are predefined tests that validate the structure of your data. Initially, there are four of them but you can create and add more. The initial four are:
- unique: Ensures that a specified column has unique values.
- not_null: Checks that a specified column does not contain null values.
- accepted_values: Validates that a column contains only specified values.
- relationships: Ensures that foreign key relationships between tables are consistent.
You can configure generic tests in the schema.yml file which is associated with your models.
Custom Data Tests:
Custom data tests allow you to define your own SQL queries to test specific data requirements not covered by generic tests. These tests are written in individual SQL files and stored in the tests directory of your dbt project. When creating custom data tests, ensure the SQL query returns zero rows for a successful test or one or more rows for a failed test.
Test Configuration:
dbt allows for configuration of your tests by setting test severity levels, adjusting error thresholds, or even disabling specific tests. These configurations can be defined in the dbt_project.yml file or directly within the schema.yml file for individual tests.
Test Execution:
To execute tests in dbt, use the dbt test command. This command runs all the tests defined in your project, including schema, and custom data tests. The results are displayed in the console, indicating the success or failure of each test, along with any relevant error messages.
Test Documentation:
dbt 's testing framework also integrates with other feature. When generating documentation for your project, the test information is included in the generated documentation, providing a comprehensive overview of quality checks performed on your data models.
By integrating data tests into your development workflow, dbt’s testing framework empowers you to actively safeguard the reliability and accuracy of your data models. This proactive approach ensures that potential data issues are identified and rectified early in the development process, preventing inaccuracies and inconsistencies from proliferating through your data pipeline. As a result, you can trust that your data models consistently produce high-quality, dependable insights crucial for informed decision-making.
Tips for setting up your testing environment
Setting up a testing environment for your dbt project is crucial to ensure data quality and integrity. Here are some tips to help you create an efficient and effective testing environment:
- Use separate targets in profile.yml for development and production: dbt supports multiple targets within a single profile to promote the use of separate development and production environments.
- Use ref() macro whenever possible: Even dbt’s documentation highlights it as the most important macro. It’s used to reference other models and helps dbt document data lineage. Additionally when using ref() it is easy to test changes, programmatically changing the target, to a testing database.
- Use dbt seeds: dbt seeds allow you to load CSV files into your database, which can be helpful for creating sample data sets for testing. You can configure seed files in your dbt_project.yml and use the dbt seed command to load data into your database.
- Begin with Generic Tests: Start by implementing the built-in generic tests provided by dbt, such as unique, not_null, accepted_values, and relationships. These tests cover essential data validation requirements and help you maintain the overall structure and integrity of your data models.
- Implement your own data tests: Create tests for your models to validate the data’s quality and consistency. dbt offers two types of tests: generic ones and singular data tests. Generic tests validate the structure of your data and are highly reusable, while custom data ones allow you to define specific SQL queries to test your data. Singular tests can be promoted to generic so it’s often helpful to create it first, check if it works and then promote it to generic.
- Prioritize critical data attributes: Focus on testing the most critical aspects of your data, such as key business metrics, important relationships between tables, and mandatory fields. Prioritizing these attributes will ensure that the most vital aspects of your data are accurate and reliable, while not consuming much additional resources.
- Organize and structure your tests: Organize your tests by creating separate directories for schema tests, column value tests, etc. This structure makes it easier to navigate and manage your tests, as well as understand the coverage of your data models.
- Configure test severity and thresholds: Adjust the severity levels and error thresholds of your tests to suit your specific needs. For instance, you might want to configure certain tests as warnings, while others as errors. Customizing these settings helps with differentiating issues that require immediate attention from ones that can be addressed later.
- Use Continuous Integration (CI): Incorporate continuous integration tools, such as GitHub Actions, GitLab CI/CD, or Jenkins, to automatically run your tests whenever changes are pushed to your code repository. This practice ensures that data tests are consistently executed and helps identify issues early in the development process.
- Perform incremental testing: To improve testing efficiency, consider using incremental tests that only validate the new or modified data instead of re-testing the entire dataset. You can implement this kind of testing by adding conditions to your SQL queries that target only new or modified records. Additionally you can tag your tests and run tests only with the specified tags, in case you want to test only some part of the system.
- Document your setup: Provide values for the “description” key wherever possible. Good documentation helps future stakeholders, such as data analysts or engineers, to easily understand the purpose of models and extend them when appropriate.
- Review and update tests regularly: Regularly review and update your data tests to ensure they remain relevant and effective. As your data models evolve, so should your tests.
- Monitor test results: Keep an eye on the test results to identify and address any issues or patterns in your data. Monitoring will help you maintain high-quality data in your project.
- Use limit: There rarely is a need to save all failed records to a table. If 2 billion rows fail it’s not efficient to save them again. Usually just a couple of records is enough for debugging. Use limit in tests, which might fail with lots of records.
By following these tips, you can set up a robust testing environment that helps ensure the quality and integrity of your dbt project, allowing you to build and maintain reliable, accurate, and valuable data models.
Community made packages
The dbt community has created several packages that extend the built-in testing capabilities and help improve data quality in your projects. These packages offer additional tests, macros, and utilities to help you effectively manage your testing process. Some popular community-made testing packages include:
dbt-utils: The dbt-utils package is a collection of macros and tests which can be used across different projects. It includes tests for handling more complex scenarios, such as testing whether a combination of columns is unique across a table or asserting that a column has values in a specified range. You can find the package on GitHub here
dbt-expectations: Inspired by the Great Expectations Python library, this package provides a suite of additional data tests to expand the built-in test functionality of dbt. It covers a wide range of data quality checks, such as string length tests, date and timestamp validations, and aggregate checks. The package is available on GitHub here
dbt-date: The dbt-date package is a collection of date-related macros designed to simplify working with date and time data in dbt projects. It includes macros for generating date ranges and creating date dimensions. It’s a very useful and readable abstraction that can help you create new tests relating to datetime fields in your models, as well as create the models themselves. You can find the package on GitHub here
dq-tools: The dq-tools package purpose it to provide an easy way for storing test results and visualizing them in a BI dashboard. The dashboard focuses on the six KPI’s mentioned in the previous article: accuracy, consistency, completeness, timeliness, validity, uniqueness. This package can be found on GitHub here
dbt-meta-testing: The dbt-meta-testing package is a tool for meta-testing your dbt project. It asserts test and documentation coverage. You can find the package on GitHub here
dbt-checkpoint: To use these packages in your dbt project, you need to add them as dependencies in your packages.yml file and run dbt deps to download and install them. Once installed, you can use the additional tests, macros, and utilities provided by these packages in your projects. You can find it on GitHub here
By leveraging community-made testing packages, you can enhance the testing capabilities of your dbt project, ensuring data quality and consistency throughout your data transformation processes.
Summary
Dbt’s testing framework ensures data quality and consistency by providing built-in tests, custom tests, test configuration, test execution, and test documentation. Implementing data tests in the development process ensures data models remain reliable and accurate.
When setting up a testing environment you should: use separate targets for development and production; use ref() macro, dbt seeds; prioritize critical data attributes; organize and structure tests; configure test severity and thresholds; use continuous integration; perform incremental testing, document the setup; review and update tests regularly; and finally – monitor test results.
Community-made testing packages, such as: dbt-utils, dbt-expectations, dbt-date, dq-tools, and dbt-meta-testing, provide additional tests, macros, and utilities that enhance dbt’s testing capabilities, ensuring data quality and consistency throughout data transformation processes.
All content in this blog is created exclusively by technical experts specializing in Data Consulting, Data Insight, Data Engineering, and Data Science. Our aim is purely educational, providing valuable insights without marketing intent.