Data Vault : Best Practices For Creating Hubs - Part 2
- Published by Luc Durand
This article is the second of a series of blogs explaining the Data Vault approach for data warehouse modeling.
In the first article, we presented the Data Vault approach overall as an alternative to normalized subjects and star modeling. Then, we described the three types of entities of this approach:
- The structural entities represented by hubs that identify the business concepts used and communicated by one or several sectors of the organisation as well as by the links that connect the business concepts by linking at least 2 hubs. Figure 1 shows an example of the Data Vault model. The Parking and the Employee entities are hubs and the Parking Employee entity is a link.
- The descriptive entities represented by the satellites that describe the business concepts and by the links as well as their usage contexts.
Figure 1 Example of a Data Vault Model
There were two main ideas to remember:
- There is a separation between the structural data that has a somewhat stable nature and the descriptive/contextual data that has a somewhat changing nature. Separating what is stable from what is unstable is generally a very important principle to a good architecture.
- Data from source systems are kept intact. Loading from the sources is quick and the data is rearranged in hubs-links-satellites. However, there is no transformation, hence allowing to perfectly rebuild an image of the source’s data at any given time. A Data Vault type warehouse is a warehouse of raw data (raw data warehouse).
Now, let us look more closely at each type of entity, beginning with the hub.
A hub contains a natural key that identifies only (it is actually what we hope for!) one occurrence of a business concept. A natural key is a visible key used by the organisation to identify a concept’s occurrence. For example, the employee’s code is used to identify only one employee in different contexts/process: human resources, parking management, incidents, pay, financial system, etc. Therefore, the employee’s code is the connection (communication) point of the “employee” concept between various business units, hence the name hub.
A natural key is different from the source systems’ internal identifiers that are, in principle, invisible and specific to the system that generates the values.
In an ideal world, each concept would be given the same key, whatever the business unit, and that key should be unique. As a practical matter, it is far from being always the case. For example, incidents management could use its own natural key for the different resources involved.
Here are some criteria to meet for a good hub.
represents only one concept;
does not contain any descriptive data elements (example: an employee’s name). The hub’s satellites contain the descriptive part;
does not contain any connections. The links contain the connections between the hubs and the satellites contain the connections between a concept’s description and the described concept (hub);
contains, ideally, a unique natural key composed of at least one data element that identifies the concept (example: the parking number). For example, if there is a same natural key value in two sources that does not correspond to the same concept occurrence, one must identify only the different occurrences using the natural key combined with the name of the source where the data comes from.
always contains at least two elements of information allowing traceability: the source where the data comes from and the moment when the data was brought in the hub;
- is associated to at least one satellite to describe it.
In the next article, we will keep on describing the types of entities of a Data Vault model.
Interested in receiving the latest news about analytics and Big Data? Provide us with your email to receive the blog articles of our experts.