Data Vault 3.0 – The summary

After the second part of the article series about Data Vault, where I’ve talked about data modelling and architecture, I return to you with a quick look into naming conventions and a summary of the topic. It is a great opportunity to learn something new or just refresh your knowledge about Data Vault.

Naming convention

As we have already seen, the Data Vault is a multitude of tables with different structures and purposes. With hundreds of such objects in the warehouse, it is impossible to use them if we do not set the right naming rules.

 

Below is a sample set of prefixes for Data Vault objects:

 

Layer Data Vault object Name prefix
RDV Hub H_
RDV Satellite S_
RDV Multiactive satellite SM_
RDV Relational link L_
RDV Hierarchical link LH_
RDV Non-hierarchical link LT
BDV Hub BH_
BDV Satellite BS_
BDV Multiactive satellite BSM_
BDV Relational link BL_
BDV Hierarchical link BLH_
BDV Non-hierarchical link BLT
Other PIT PIT_
Other Bridge BR_
Other View V_<DV_object_prefix>

 

In addition to prefixes, it is worth standardizing the naming of related objects such as satellites around a common HUB and the naming of links. It is worth naming technical and business columns consistently. A dictionary of abbreviations and a dictionary of column prefixes and suffixes can be introduced.

Recap

If you’ve made it this far, you should already have a rough idea of what Data Vault is, how to create it, and what its advantages are. In my opinion, in order for the methodology to be used correctly it is also necessary to be aware of its disadvantages in order to prepare for their mitigation. For me, the fundamental disadvantage of Data Vault is the multiplicity of tables in the model and the difficulty in connecting them. Let’s say we want to write a cross-sectional query that retrieves data from three business hubs. Let’s say we need data from 2 satellites connected to each of these hubs (that’s already 9 tables). In addition, there are links between the hubs, and if there are satellites attached to the links, they also have to be included, which gives a total of (9+4) 13 tables that we have to involve.

 

This creates challenges in several areas:

  • Performance
  • Difficulty in writing SQL queries for the model
  • Difficulty in documenting the model

 

Of course, each of these points can be addressed, but it requires additional work that one should be aware of.

 

The fragmentation of tables is, on the one hand, a disadvantage that I mentioned above, but on the other hand, it also has its advantages. For data warehouses with multiple consumers, many sources, and many critical processes, fragmentation helps to minimize the impact of any errors in data feeding. For example, we read a small dictionary from a CSV file and based on it, calculate a column in the Data Vault satellite. When this file does not appear or appears with an error, we will not feed only that one satellite in the data warehouse.

 

The rest of the data warehouse will work correctly, and the processes based on it. In the case of choosing a different modeling approach, where broad tables are created, a problem with one small element can cause a problem with feeding one of the most important data warehouse tables, delaying most critical processes. Fragmentation also makes data storage more efficient – we store data immediately after it appears. There are no situations where we wait for data from, for example, five sources, which we then combine in ETL and store. It is clear that in such an approach, ETL can only start after all the input data has appeared, so the writing is delayed by this waiting time, unlike in Data Vault.

 

Fragmentation also helps in developing a data warehouse in many independent teams and releasing such changes. Data Vault is very „agile” and greater gradation of data and feeding processes means we have fewer dependencies between teams. It looks completely different when we have critical and broad tables in the model and many teams that modify them. In such cases, conflicts are not difficult, and the effort required for integration and regression testing is much greater.

 

How to effectively manage a Data Vault model? I don’t want to give advice on when to create a new satellite and under what rules, because in my opinion it must be tailored to the company and how the data warehouse is to be developed. However, I would like to draw attention to the elements that must be addressed in order not to fail during the development of a Data Vault model consisting of hundreds of tables.

 

First of all, the production process should be described, which establishes the rules for developing the data warehouse, from the moment the data requirements appear to the implementation stage and then maintenance. I will not go into details here because this is a topic for a separate article, but I will only emphasize the fact that the model must be properly documented, that the rules for development (adding additional tables to the model) should be defined, that object and column naming should be consistent, and that a framework should be created to automate the feeding of DV objects (calculating keys, hdif, partitioning, etc.). It is also best for such a fragmented model to refer to something at a more generalized level. In the company, a high-level Corporate Data Model should be created, which the fragmented model must be consistent with (we always model down: CDM -> Data Vault Model).

 

The Data Vault model is a business-oriented approach to data, not source systems. Business concepts are usually constant, while IT systems live and change much more often. If we want to have a consistent model that does not change with the exchange of the IT system underneath, then Data Vault is the right choice. However, is it recommended for every organization? Definitely not. If you want to integrate several dozen or hundreds of data sources in the company, and if the company does not have dozens or hundreds of critical processes, then Data Vault is unecessary. The overhead required for a proper solution preparation can also be significant. The larger planned data warehouse is, the more certain the Return On Investment (ROI). ROI increases when:

  • the number of source systems is large
  • source systems change frequently
  • the number of planned critical processes is significant
  • we plan to develop the model in many independent teams

 

So is Data Vault right for you? To answer your question you will need thorough understanding of your business needs and strategy, as well as knowledge about adventages and weaknesses od Data Vault. However, after reading our Data Vault series, you should be much better equipped to start answering the question.

 

This concludes the third and final part of our series of articles about Data Vault and its implementation. However, if you are curious about experts opinion and insights about data science, integration of data engineering solutions and synergizing technological and business strategy during data transformation – you are in luck!

Our experts create comperhensive and informative articles about the data analytic business. So tune in on our site and social media linked below to not miss valuable content.

 

And if you have additional questions about data – let’s talk about it!

 

All content in this blog is created exclusively by technical experts specializing in Data ConsultingData InsightData Engineering, and Data Science. Our aim is purely educational, providing valuable insights without marketing intent.