Preface
A data warehouse is a single centralized unit that collects and stores large volumes of data from several data sources. Its primary purpose is to serve as a single place where business users can gain insights into the past and current operations of the enterprise.
Unfortunately, many businesses bump into issues getting their data warehouses to a position where it consistently pushes out reliable and useful data to perform useful reporting and analysis from, so what problems are possibly causing this?
Lack of complete understanding - if developers misinterpret end-user requirements or the business fails to make their expectations clear, the developers may build a data warehouse the business can’t use or the data warehouse contains data that doesn’t bring end-users closer to making better business decisions
Lack or scarce budget - any high-quality data warehousing project can be a costly endeavour from a resource perspective even if done correctly.
Weak or unpractical governance - even a high-quality data warehouse will slowly decline to its inevitable end if the governance framework in place is fragile or out-of-touch with the unique challenges of the team or department that uses it. For example, if an unauthorized business user has been temporarily granted read-write access to make small updates to a data warehouse workflow to cover for a sick colleague, they have free access to making continuous workflow adjustments (e.g. manually adding cheques to an accounts receivables table daily) that may not be tested through agreed QA protocols resulting in a declining quality in the warehouse
Data sources
Structured data - this is any data with rows and columns i.e. tabular datasets. Each row represents a single record of an event and each column is a characteristic/attribute for describing the record (e.g. name, age, location, nationality, gender). This can be stored and extracted from relational database systems.
Semi-structured data - this is data that contains a schema of organised elements that can be arranged into rows and columns. Data formats under this category include CSV, JSON, XML, and TXT files. This can be stored in data lakes (e.g. Azure Blob, AWS, GCP, BigQuery)
Unstructured data - this is data that holds no set shape or structure (i.e. doesn't possess typical rows and columns). Data formats under this category include PDFs, images, audio, and emails. These formats can also be stored in data lakes.
Types of data warehouses
There are two types of data warehouses:
Traditional data warehouse
Cloud data warehouse
A traditional data warehouse is a centralized repository for storing and maintaining data on a company's servers.
A cloud data warehouse is a centralized repository that stores and maintains data on a cloud provider’s servers.
Without going into excessive detail in this post, here are the key differences between the two:
Attributes | Traditional data warehouse | Cloud data warehouse |
Deployment | On-premise | Cloud-based |
Ownership | Organisation | Cloud-provider |
Cost | High upfront costs including ongoing maintenance costs | Pay-as-you-go model with minimal upfront costs |
Scalability | Limited | Unlimited |
Flexibility | Rigid, limited - requires high IT involvement | Highly flexible with minimal IT involvement required |
Latency | High latency due to on-prem infrastructure | Lower latency due to direct access to cloud resources |
Security | Maintained by company's compliance measures | Maintained by 3rd party cloud providers with their robust security and compliance measures |
As the feature comparison table above implies, cloud data warehouses are a better option compared to a traditional warehouse in many cases as they address many of the real-world challenges business face today that traditional warehouses are not able to.
Because a cloud data warehouse is hosted and maintained on a third party's servers, it's easier to scale the organization's warehouse when data volumes increase because the pay-as-you-go model is cheaper and more flexible. After all, the organization doesn't need to invest in hardware to maintain it over time, plus the security and compliance measures undergo vigorous tests to account for many sophisticated cybersecurity threats.
However, there are scenarios where the traditional data warehouse trumps the cloud warehouse. For example, if a company has specific regulatory compliance requirements that cloud providers may not be able to satisfy, traditional data warehouses may be the preferred option to guarantee compliance. Also, if a company is situated in a location with relatively low connectivity, the cloud option would serve more as an obstacle than an asset, so a traditional warehouse once again would be the better alternative.
So a cloud data warehouse is NOT always the correct option for a business, and neither is a traditional data warehouse the "old way of doing things". Leading with the tool is never a good idea - it's solely about understanding the unique circumstances and challenges of a business and then picking the right option after.
Pipelines
A pipeline is a tool that runs a set of sequential data tasks. Each data pipeline is made up of
data source
processing steps
target destination
Here are the main types of pipelines for processing data in the data warehouse:
ETL - extract, transform, load
ELT - extract, load, transform
1. ETL
From a data warehouse perspective, an ETL pipeline is a specific pipeline for collecting data from a source, manipulating the data and then loading it into the data warehouse.
Here are the main sequence of steps for ETL pipelines:
Extract - pull or collect raw data from the source
Transform - clean and shape the raw data into a format useful for the target audience
Load - push or upload clean data into a destination easy for the target audience to access
2. ELT
In an ELT pipeline data is collected from the source and then loaded into the data warehouse before transformation activities are applied to the data.
It's becoming a more popular option to extract and load raw data into the data warehouse first before performing transformation activities on because of the wide range of data manipulation options available (especially in proprietary tools like Redshift, Snowflake, BigQuery) when data arrives at the transformation stage. In a future post, I’ll seek to dig into more details on some of these features.
Here are the main sequence of steps for ELT pipelines:
Extract - pull or collect raw data from the source
Load - push or upload clean data into a destination easy for the target audience to access
Transform - clean and shape the raw data into a format useful for the target audience
Tools for building data pipelines
Regardless of what pipeline is used, there are generally two types of tools for creating pipelines:
Proprietary tools - these are programs and platforms owned and maintained by commercial enterprises e.g. ADF, SSIS, Fivetran, Snowpipe, Informatica, Talend etc. Using proprietary tools backed by trusted users and communities can save time in building pipelines from scratch but carries inherent risk of building workflows that lock dependencies to a single vendor
Custom tools - these are pipelines developed directly by engineers using coding languages and frameworks like Python, Spark, Kafka, Java etc. Out of nothing, engineers can build cost-saving pipelines by combining technical expertise with open-source libraries that target the end-user's requirements, but this may require a significant amount of time to execute
ETL or ELT: Which should I use?
Short answer: it depends. It's like asking which cutlery is better for eating your roast potatoes and chicken soup with - knife, fork or spoon?
Loosely speaking,
if the data needs to be manipulated and shaped in a certain manner before it meets the main target location based on business requirements or system-related constraints, use an ETL pipeline.
If the loading destination contains appropriate features for manipulating the data that are simple and/or meet your project’s data expectations, use an ELT pipeline.
The reason I pointed out “loosely speaking” in the previous paragraph is that real-world scenarios would demand you to combine the two types of pipelines between hundreds and thousands of data sources and destinations, depending on the complexity of the projects at hand.
Layers
Raw layer
This is the first stage of a data warehouse that serves as the location where the raw data is dumped into once extracted from the source systems.
At this level, the raw data needs to be preserved in its original state, so no transformation or integration activities occur at this level. This is so the data here is an identical representation of what is found in the sources.
This means it's very possible (and likely) the raw data may contain dirty or sensitive data expressed as corrupted columns, personally identifiable information (PII), spelling errors, bad characters, and incorrect calculations, among many more examples.
Here are some of the activities to consider when building the raw layer:
Decide on the data sources that contain the data useful for your end users
Pick out the sensitive data to create masking solutions around
Move source data into raw tables
Add event logging to the source-to-raw operation
Add measurement and monitoring tools to comb out any performance bottlenecks
Getting the raw layer correct is vital for ensuring the data warehouse with the high quality and security required in providing reliable data to your end users.
Staging layer
The staging layer is the second stage of the data warehouse where the preparation for the data cleaning and consolidation occurs.
Here are some of the activities that need to be conducted for the raw-to-staging load operation :
Load raw tables into staging tables
Highlight the sensitive personally identifiable information (PII) for encrypting, masking or dropping
Specify the data quality rules and the QA checks
Specify the data transformation strategy for the semantic layer
Add event logging to track the progress on each of the tasks in the load operation
Run validation & verification tests on the load operation
Tips
- Create an acceptance criteria made up of small staging tasks that manage each part of the data load operation. Each staging task must be completed before advancing to the pipeline's next phase
task_no | dwh_layer | task | task_type | completion_status |
DWH-78 | STAGING | load customers_tbl | load | completed |
DWH-79 | STAGING | load orders_tbl | load | in-progress |
DWH-80 | STAGING | run error-checking process | data_verification_check | not started |
DWH-81 | STAGING | specify entity relationships for the semantic layer | erd | not started |
So the staging layer is a temporary location where the data is prepared for heavy-lifting transformations. Once the acceptance criteria is satisfied, the semantic layer is the next stage to focus on.
Semantic layer
This is the layer where the data is cleaned and consolidated into one central unit that serves as a single version of truth (SVOT). This is achieved by applying a series of data transformation and integration techniques, which includes a stage where business rules and requirements are converted into transformation logic using custom pipelines as well as low-code or no-code tools and interfaces, which depends on the resources available to the organization so the data reflects a symmetrical perspective of real-world events even when the warehouse scales over time.
The semantic layer also includes the data governance piece too, where there’s a focus on implementing solid policies, processes and practices around:
Data access - what custom roles are created and granted to each team and user, and what are the privileges granted to each role (e.g. read, write, list, delete)?
Data ownership - which teams and users own what data assets? Who are the main players responsible for maintaining the data assets?
Data security - what actions are taking place in ensuring the data is secure from potential threats?
Data quality - what stress tests do we run to confirm the data moved and processed at each stage is integral (complete, accurate and relevant)? What metrics are important in communicating this?
Data observability - how is my data performing at the "dwh_layer" stage? what table contains the "col_name" field in the "db_name" database?
Here are some of the activities conducted in building a semantic layer:
apply data transformation strategy i.e. clean the data (add filters, normalize tables, remove bad characters, set up calculated columns, concatenate fields)
creating a common data model (creating a standard schema that consolidates all data tables into a single structure)
applying dimensional modelling techniques (arranging data tables into fact and dimension tables)
design data governance framework and docs
Tips
If privileged users have access to data that isn’t useful to their reporting needs, or unprivileged users don’t have access to the data they need, this will inevitably create data silos (a situation where users build their data and reporting workflows that are not accessible to the rest of the business e.g. in their local drive).
To avoid this, here are a few considerations for mitigating this:
Specify each team and user’s responsibilities
Use RBAC (role-based access control) - this is a system for managing access to certain privileges by creating a set of custom roles with specific privileges. Each department, team or user can be mapped to roles that contain system rights enabling them to access resources and privileges that enable them to perform their duties comfortably. If more or less access is required, a request can be sent to a system admin responsible for reassigning roles.
Add SSOs (single-sign-on) where possible and appropriate
So remember the semantic layer covers the processes behind enriching, integrating, governing the data as well as looking into the policies and practices around data security, quality, access, ownership and observability.
Data warehouse layer
Also known as the User Access Layer (UAL), this is the layer where the single source of truth is exposed to the appropriate end users. The goal of this layer is to display the cleaned data to the downstream teams for them to analyse.
The data can be delivered as data marts (aggregated/summarised tables) for users to perform querying activities. In some architectures, this also serves as the presentation layer, where visual reports can be created on top of the tables from the data marts.
If the previous layers are planned and executed correctly (and only if), this layer should be the zone where the “self-service BI reporting” dream becomes a reality because all that needs to be done is for technical users to plug their BI/reporting tools to the data marts to create the data stories needed to drive stakeholders to meaningful decision-making. Once again, this is dependent on the integrity of the previous stages before this one.
Technical users can use languages like SQL, Python or Spark (among other options) to communicate with the data warehouse’s API to ask the data marts questions (via queries and commands) to get the answers they’re after to perform investigations around the summaries created.
Presentation layer
This is the layer where informative reports and notifications are built. I refer to it as a visualization channel because this is the portal where you can generate as many visualizations as valuable to the end-users.
Technical users can use languages like SQL, Python or Spark (among other options) to communicate with the data warehouse’s API to ask the data marts questions (via queries and commands) to get the answers they’re after and use those answers to create visuals that communicate the results engagingly and powerfully.
Non-technical users can use BI tools to interact with the data warehouse to run queries for pulling the data they need and run visualizations from the same platforms
Alerts can also be created at this layer to notify the right teams and users of specific events occurring (i.e. event-based triggers). For example, if the unpaid sales invoices hit a certain figure, an email notification can be sent to Accounts Receivables to chase up on them.
Visualization tools normally used at this layer include Power BI, Tableau, Looker, Qlik etc
Here are some of the visuals you can create from this:
Dashboards
Scorecards
Quarterly reports
KPI reports
Closing notes
There are plenty of other points that make a data warehouse a success that I haven’t added to this article that is crucial in building a scalable data warehouse that includes deep dives around the warehouse itself (e.g. performance, availability, partitioning, indexing etc), more advanced topics (orchestration tools, historical vs incremental load approaches, distributed computing involvement), the resources around building one (e.g. how many engineers are really needed in building a performant and scalable data warehouse?), or even buzz-wordy topics (e.g. what is a data lakehouse and how do you build one?).
I will venture in going into more depth into each of these when I get a bit of downtime at some point. In the meantime, feel free to contact me if you have any questions or want me to clarify a few points, and I’ll be happy to get back to you as soon as I can!
Feel free to reach out via my handles: LinkedIn| Email | Twitter