In today’s interconnected world, with the ease of communication and transportation, setting up various data centers across different regions is remarkably simple for companies. However, it comes with the challenge of accessing and unifying the scattered data. Data Warehousing overcomes this challenge.
A Data Warehouse solution retrieves data from multiple sources into a centralized location, creating a unified, organized view of the data. It also eliminates duplicates, producing a cleaner data state. Examples of known Data Warehouse solutions encompass platforms like Azure Data Factory, Oracle, IBM Infosphere, Amazon Redshift, Snowflake, and Teradata.
For example, imagine a multinational retail company operating in several regions. The company gathers extensive sales data from its distributed stores, online platforms, and customer feedback channels. The scattered data across different regions, in different data forms, and across multiple databases makes it challenging to gain insights on the spot.
The company decided to implement a Data Warehouse solution to tackle this challenge. The Data Warehouse hosts all the data from the various data sites in one destination and structures the data, making it easier to analyze. Now, the company owners and data analysts can easily retrieve all the data, analyze and visualize it to gain more precise insights, capture patterns in the data, and make the right decisions.
Deploying Data Warehousing Solutions
Where are these solutions deployed?
In the past, Data Warehouse solutions were hosted on on-site server computers. These servers extracted data from various sources, cleaned it, and prepared it for a relational database. The solution also included features to manage and maintain the database. Nowadays, companies deploy Data Warehouse solutions in the cloud.
In a traditional on-premises Data Warehouse setup, customers need to purchase both hardware and software solutions, which poses challenges for scalability and maintenance. In contrast, in a cloud-based setup, the cloud provider manages these critical solution elements.
History of Records
Are historical records also stored?
Within a Data Warehouse, we maintain comprehensive records of data versions. The records archive enables better trend analysis, which helps business owners make more informed decisions.
A data record typically contains static and dynamic dimensions; for instance, the product cost dimension may change over time. The dynamic dimension is known as a slowly changing dimension. These dimensions maintain a history of the changes that occur over time. Keeping a historical archive can facilitate better analysis. For example, financial analysts can examine sales trends and patterns across different periods.
Five types of slowly changing dimensions include:
- Keeping the original, overwriting.
- Adding a new row.
- Introducing a new dimension.
- Utilizing a history table.
- Implementing a combined approach.
We’ll briefly explain some of these types. The “keeping the original:” type refers to an unchanged dimension, such as the date of birth or the social security number. In the “overwriting” type, users can change a dimension’s value, losing the previous value. The “adding a new row” type adds a fresh row alongside the existing row, thus preserving the old record while incorporating the updated values in the fresh one.
Data Warehousing Three-Tier Architecture
Let’s delve into the architectural aspect!
A Data Warehouse solution usually follows a three-tier architecture. The three-tier architecture is made up of three layers:
A relational database management system (RDMS) typically implements the Data Warehouse server, constituting the bottom tier. The tier incorporates Extraction, Transformation, and Loading (ETL) tools to extract data from external sources or operational databases (an operational database is one in which data is updated in real-time). Extraction programs include data gateways.
Additionally, ETL tools undergo cleaning, transforming, and loading the data into the Data Warehouse.
Remember the retail company we previously discussed? Now, imagine having to query numerous relational databases across multiple regions, merge the results, and query the combined data. Such a task can be very tedious for a technical engineer. This is where the middle tier plays a crucial role.
The middle tier consists of an online analytical processing (OLAP) server. The OLAP server stores the data from the bottom tier in a multidimensional format. The resulting multidimensional data constitutes a data hypercube. Users can use complex queries to answer intricate questions regarding the data.
The top tier comprises a front end crafted to display the results returned by the middle tier.
Source-driven versus Destination-driven
Data from the source regions can be sent to the Data Warehouse in two different ways. In one way, known as the source-driven architecture, data is sent continually or periodically from the source regions. Alternatively, the Data Warehouse solution requests data from the source regions in the destination-driven architecture.
In this article, we provided an overview of data warehousing and its primary purpose: retrieving data from diverse sources and placing it in a central repository. Moreover, we gave a real-world practical application of Data Warehousing and highlighted the shift toward cloud-based deployments. Then, we described the three-tier architecture and the difference between source-driven and destination-driven architecture designs.