Data warehouse: what it is, examples and use cases of DWH
April 5, 2023
What is a data warehouse? It means much more than a database: a DWH is a storage and analysis system for big data, which supports a company’s decision-making processes in order to obtain better guidelines for thefuture. Let’s study the structure ofa data warehouse and how it works, to learn about the advantages, disadvantages and applications ofthis technology.
Data warehouse: what is it and meaning of DWH
Understanding what a data warehouse (DW or DWH) is and its meaning is essential for any business. It is the set of tools used to manage the continuous flow of information in which we are immersed, in order to extract value from it. More precisely, a DWH is composed of a historical archive of data, structured on several levels, which can be analysed through specific queries in order to support strategic decisions.
This process is called business intelligence (BI), a concept formulated by the company IBM, which also coined the term data warehouse. In 1988, in fact, it was researchers Barry Devlin and Paul Murphy who named this technology to build a data warehouse (database) and the tools to examine it. A DWH is the only source of factual truth for a company, based on which to make optimal choices for the future.
The definition of a data warehouse, therefore, involves big data, i.e. information of great volume, variety and speed, which is incompatible with traditional analysis and archiving systems. In fact, a DWH is first and foremost a huge collection of structured data, information with pre-defined characteristics, aggregated from sources inside and outside the company.
In particular, William Inmon explains what a data warehouse is, through 4 key characteristics of this system:
- Subject-oriented : DWHs follow a sectoral logic, considering data belonging only to subjects of interest to the company. In practice, they are subject-oriented archives, organised in specific areas, such as products, customers, sales, etc;
- Integrated: a data warehouse extracts information from different sources, so it is necessary to take care of consistency between them, cross-referencing the data in search of inconsistencies. This integration is usually performed before archiving, in the data transformation phase;
- Variable over time : a DWH serves as a historical archive for the company, thus referring to a very long period, in the order of years. All changes in the information, relating to each individual item, are therefore recorded, creating new instances without overwriting the data;
- Non-volatile: access to stored information is ‘read-only‘ because, given the historical value of a data warehouse, it is not necessary to modify or update it, only new information can be added.
A comprehensive definition of a data warehouse, therefore, could be:
A corporate archiving and analysis system to support decision-making processes, based on a historical collection of information, permanent and organised by subject, consulting different sources.
Data warehouse types: layers and components
Another way to understand what a data warehouse is is to look at its architecture, its components and the stages of its operation.
Based on the information path, we can first distinguish between two types of DWH. The first is organised following, in order, the operations of data extraction, transformation and loading. The second, on the other hand, reverses the last two stages of the process, resulting in a technically different system.
Considering the first type of data warehouse, the ETL (Extract, Transform, Load) process starts with the retrieval of information from the selected sources by querying the respective databases. The raw data, thus extracted, are collected in a kind of ‘waiting room’ for the data warehouse, called the staging area.
After that, the transformation of information takes place: essentially, it is ‘translated’ into a single language that conforms to precise designations, codifications and units of measurement. In addition, incomplete data, errors and corrupt or redundant information are integrated or eliminated. This process is also referred to as data cleaning or cleansing and its meaning for data warehouse is as important as the subsequent ‘loading’ phase.
This processed and ‘cleaned’ data is then recorded in the operational data store (ODS), where it will be further organised before being moved to the actual data warehouse ‘memory’. This database contains the maximum detail of information, but can be divided into subgroups: data marts, limited sets of data, which relate to a particular branch of a company, such as marketing or customer service.
Here, the big data is organised in a star pattern: given a certain element, the fact table lists its essential characteristics, which are then made explicit in the dimension tables. In practice, the facts are summary codes, to which the dimensions give a broader context.
For example, a customer would be represented like this in a data warehouse, facts precede dimensions:
- Residence ID: region, province and city
- ID Age: day, month, year (of birth)
- Employment ID: role, salary, seniority
If the dimensions were further divided into sub-tables, one for each attribute, the star pattern would expand into a further branching snowflake pattern.
These types of databases, which link information in hierarchical structures, are called relational. Each of them is associated with a catalogue: a kind of registry for metadata, i.e. additional information that explains the nature of the information it accompanies; for instance, it may indicate its provenance or date of extraction.
The ETL process, however, is not enough to explain what a data warehouse is: the finely processed data still has to be analysed by business intelligence tools. At this point, therefore, the various databases of the data warehouse are questioned: queries are submitted to them, formulated according to the Structured Query Language (SQL). In practice, queries are directives based on facts and dimensions, which explore the database in search of the best-matched data. By exploiting specific data marts, solutions are obtained in a short time, but the level of detail of the general archive is sacrificed.
In any case, the tools for analysing a data warehouse are OLAPs (On-Line Analytical Processing): based on star schemas, they construct so-called multidimensional cubes, snapshots of the database at a certain point in time. The answers to queries are then visualised in written reports, graphs or dashboards for presentation to the end user.
According to the ETL model, therefore, we could explain what a data warehouse is in 4 levels or layers:
- Transformation: extraction and integration (cleaning)
- Preparation and storage: DWH database and data mart
- Analysis: OLAP tools
- Presentation: written or graphic
The meaning of a data warehouse also includes the ELT (Extract, Load, Transform) type, which loads raw data directly into the central database, transforming it only when needed for analysis. In the ELT process, in particular, there are no staging areas or data marts: data are both stored and processed in the single DWH database. The ELT model requires fewer operations and is therefore faster and cheaper, and also supports unstructured data (such as images, video or audio). However, it could lead to privacy and security breaches: for instance, if no pre-processing is carried out, sensitive data cannot be ‘blacked out’ at the beginning of the process.
Advantages, disadvantages and uses of a data warehouse
If you have a more practical approach, the meaning of data warehouse might be clearer through its pros and cons.
First of all, it provides qualitative data: aggregating different sources provides comprehensive information, but it is the processing phase that makes it reliable, eliminating errors and normalising the data. Furthermore, querying a single, curated and integrated database, instead of consulting individual sources, speeds up insight operations. The main advantage of a data warehouse, in this respect, is the support for decision-making: future strategies and plans can be organised based on data, thus increasing the probability of success. This, finally, represents a competitive advantage, compared to companies that do not make informed choices with a DWH.
Data warehouses, however, also have some disadvantages:
- Creating and maintaining a DWH system requires a lot of time and constant effort, and specific resources must be devoted to preserving its information value.
- Human intervention in data processing can generate inconsistencies in the data. Given the immensity of databases, it is difficult to detect errors, and this may lead to biased results, reducing the usefulness of the data warehouse.
- A historical archive needs a lot of space to cover several years. This may lead to an increase in costs, given the exponential growth of information generated by the Internet. Cloud storage solutions, however, can be an alternative to proprietary hardware, providing storage at economical prices.
- Scalability is a challenge for any technology: simultaneous access of several users to a data warehouse could cause congestion and slow down operations. However, distributing traffic among several data marts could be a solution.
Having understood the meaning of DWH and what a data warehouse is, we can look at some applications. In general, two types of systems arise from a DWH, representing its two purposes:
- Decision support system (DSS) – the scheme for solving specific problems
- Executive/Enterprise Information System (EIS) – the infrastructure that allows information to spread, without any particular objectives.
Turning to practice, the first use case is the already explored one of analysis and reporting, the so-called ‘management control’. However, data warehouses can also be used for risk management and fraud detection: their statistical power, given by various sources, allows in-depth monitoring, from which potential vulnerabilities of the company can be identified. In addition, the application to marketing is the best known, because the data warehouse contains useful information for organising campaigns, customer conversion strategies or product sales, especially if the activities are channelled through the Internet.
As anticipated, DWH represents a source of knowledge for companies, especially in the case of structured data. Their information power, however, is enhanced by data lakes containing unstructured data (such as those from social networks), so the two types of databases are often integrated in a single data lakehouse.
Finally, data warehouses are crucial in the design of new products and services: analysis tools, in this case, are complemented by simulation environments called ‘sandboxes’ (not the metaverse) where it is possible to test, build solutions or apply new methods.In conclusion, data scientists are well acquainted with what a data warehouse is, its operational meaning and possible uses, but it is possible to combine their analysis contribution with artificial intelligence, so as to automate certain processes through machine learning.