SOLID principles in data engineering - Part 2
The Object Oriented Programming (OOP) version
Preface 📖
Disclaimer: This article assumes you have a proficient level of Python programming knowledge. If you do get stuck in certain areas of the article, feel free to contact me directly through the handles at the bottom of this post and I'll be happy to walk you through areas you need me to shed light on 🌞
In part 1, I break down what SOLID principles are, and included simplified real-world examples of what each one looks like from a developer’s standpoint.
This time, I aim to show you an example of a real data pipeline built using these design principles with Python.
Pipeline architecture📜
I will be using a web scraper I built for another project designed to scrape football data from a website, transform it into a Pandas dataframe, and then load it into the cloud or local environment.
You can find the source code in my GitHub repository here.
Pipeline breakdown🔎
Traditional data engineering approach
What does it look like?
The traditional data engineering mindset would break down the pipeline architecture pieces into:
Extractor > Transformer > Loader
...or most commonly referred to as...
Extract > Transform > Load
Here's the role of each object in the ETL pipeline when processing data:
The Extractor object performs the extract (E)
The Transformer object performs the transform (T)
The Loader object performs the load (L)
What’s wrong with this?
Although it successfully scrapes the football data it was designed to process, this version violates the SOLID principles in many ways:
❌ This program is open for modification but closed for extension, meaning it would be difficult to upgrade or improve the code overtime if it requires it
❌ Having the bot perform many duties like logging, configuration, and web scraping tightly couples the code which means changes to one part of the program may cause unintended consequences on other parts of the code
❌ The program’s functionalities depend on direct concrete implementations of libraries like boto3, logging, and pandas which means changes to one code block may cause a ripple effect to other parts of the code. This implies modifications have to be applied in multiple places
This isn’t an exhaustive list of the challenges faced with this approach, but these problems make it hard for us to test, extend, or refactor the codebase over a period, which is a necessity for any production-grade application.
Software engineering approach
How do I do this?
Refactoring the pipeline would require approaching it with a software engineering mindset, which means I would need to dive a few layers deeper than the traditional data engineering one made earlier.
To comply with SOLID principles, we would need to break the pipeline into classes that
are employed to perform one job only (Single responsibility)
can have functionality added to it over time but never removed from it at any time (Open/close principle)
can be swapped for any children classes created from them without any code breaking (Liskov substitution)
only have functions that all child classes can use (Interface segregation)
depend on abstractions only (Dependency inversion)
What have I done?
So here are the classes I came up with:
ILogger
- to log messages to a file and the consoleConfig
- to manage environment variablesIWebPageLoader
- to load the football URL in a browserIPopUpHandler
- to close the annoying cookie box displayed when you first enter a webpageIDataExtractor
- to scrape the football data from the HTML elements in the webpageIDataTransformer
- to transform the scraped data into a Pandas data frameIFileUploader
- to upload the transformed data into the cloud or local machine
Implementation🎬
1. Logger📝
(GitHub source code for the Logger object can be found here to follow along)
A. What does it do?
The Logger object deals with logging events linked to the scraped content processed in the dataflow. It is expressed as ILogger
in my codebase as an abstract class that contains 5 abstract methods for logging messages for the different severity levels such as debug, info, warning, critical and error.
The 1st level children classes created include:
FileLogger
- for recording messages to a log fileConsoleLogger
- for streaming messages to the console
The ConsoleLogger
contains 2nd-level children classes of its own, which include:
ColouredConsoleLogger
- for streaming messages to the console in coloured formatNonColouredConsoleLogger
- for streaming messages to the console with no colour formatting
B. How does it satisfy SOLID principles?
Here’s how the Logger object harmonizes with the SOLID principles:
Single responsibility (SRP) ✅ - Each parent and child class has one specific logging duty
Open-close principle (OCP) ✅- No classes need to be modified but we can add more functionality if the need arises
Liskov substitution (LSP) ✅-
The
FileLogger
class can be substituted for the parentILogger
classThe
ConsoleLogger
class can be substituted for the parentILogger
classThe
ColouredConsoleLogger
class can be substituted for its parentConsoleLogger
classThe
NonColouredConsoleLogger
class can be substituted for its parentConsoleLogger
class
Interface segregation principle (ISP) ✅ - All child classes at every level share the same methods of their parent class without implementing methods they do not require
Dependency inversion principle (DIP) ✅ -
The
FileLogger
concrete class depends on theILogger
abstract classThe
ConsoleLogger
class depends on theILogger
abstract classThe
ColouredConsoleLogger
class depends on theConsoleLogger
abstract classThe
NonColouredConsoleLogger
class depends on theConsoleLogger
abstract classNo class is dependent on any concrete classes in the Logger object
2. Config⚙️
(GitHub source code for the Config object can be found here to follow along)
A. What does it do?
The Config class is a simple object for storing private credentials for accessing key environments like the main AWS S3 bucket and the local machine in use. It also contains a WRITE_FILES_TO_CLOUD flag for specifying whether files processed in the pipeline should be persisted to the cloud (True means ‘yes please’, False means ‘no’). Once set to False, files will be persisted locally in the format specified under the Data Loader section.
B. How does it satisfy SOLID principles?
To be honest, it's a simple class that doesn't need derived classes anytime soon. The only principle it satisfies is the Single Responsibility principle (SRP) because there’s only one true reason to change, which is to modify the configuration settings for persisting the scraping bot’s output.
3. Webpage Loader🌐
(GitHub source code for the Webpage Loader object can be found here to follow along)
A. What does it do?
The Webpage Loader object uses the Selenium web driver to load the URL specified to it. In my code, it’s expressed as IWebPageLoader
, an abstract interface for other webpage loaders to depend on.
A 1st level child class named WebPageLoader
is created which inherits the load_page
abstract method, ready to be implemented in its child class.
A 2nd level child class named PremLeagueTableWebPageLoader
is created from WebPageLoader
. The load_page
method is implemented at this level to load the webpage containing the Premier League table.
B. How does it satisfy SOLID principles?
Single responsibility (SRP) ✅ - Each parent and child class is responsible for one simple duty in the Webpage Loader object
Open-close principle (OCP) ✅- If we need to add other webpage loaders for other European leagues we don’t need to alter existing code, we can easily add more interfaces
Liskov substitution (LSP) ✅ -
The
WebPageLoader
class can be substituted for the parentIWebPageLoader
classThe
PremLeagueTableWebPageLoader
class can be substituted for its parentWebPageLoader
class
Interface segregation principle (ISP) ✅ - Each child class uses the same
load_page
method of its parent classDependency inversion principle (DIP) ✅ -
The
WebPageLoader
abstract class depends on theIWebPageLoader
abstract classThe
PremLeagueTableWebPageLoader
class depends on theWebPageLoader
abstract classNo class is dependent on any concrete classes in the Webpage Loader object
4. Popup Handler🪟
(GitHub source code for the Popup Handler object can be found here to follow along)
A. What does it do?
The Popup Handler is designed to close popup windows appearing in the browser about cookies, ads or subscription services we’re probably not interested in. This is expressed through the IPopUpHandler
abstract class, which contains a close_popup
abstract method.
Similar to the hierarchy of the Webpage Loader classes, it contains a 1st level child class named PopUpHandler
, which inherits from the IPopUpHandler
class, including the non-implementation of the close_popup
abstract method.
A 2nd class child named PremLeagueTablePopUpHandler
is used to implement the close_popup
method to close the popup boxes in the browser.
B. How does it satisfy SOLID principles?
Single responsibility (SRP) ✅ - Each parent and child class in the Popup Handler object contains only one unit of work each
Open-close principle (OCP) ✅- Splitting the subclasses into smaller interfaces allows us to add new behaviours to the platform without modifying existing code
Liskov substitution (LSP) ✅ -
The
PopUpHandler
class can be substituted for the parentIPopUpHandler
classThe
PremLeagueTablePopUpHandler
class can be substituted for its parentPopUpHandler
class
Interface segregation principle (ISP) ✅ - Each child class uses the same
close_popup
method of its parent classDependency inversion principle (DIP) ✅ -
The
PopUpHandler
abstract class depends on theIPopUpHandler
abstract classThe
PremLeagueTablePopUpHandler
concrete class depends on thePopUpHandler
abstract classNo class is dependent on any concrete classes in the Popup Handler object
5. Data Extractor🧪
(GitHub source code for the Data Extractor object can be found here to follow along)
A. What does it do?
The Data Extractor is the first object in the program dealing with the actual data to be processed. Its interface is named IDataExtractor
, and is responsible for scraping the football data from the internet. The IDataExtractor
is an abstract class that contains an abstract method called scrape_data
.
This consists of children classes at the
1st level -
TableStandingsDataExtractor
: an abstract class for a table that contains the scraped data relating to each football team’s ranking in the league. This is dependent on the abstraction of theIDataExtractor
class.2nd level -
PremLeagueTableStandingsDataExtractor
: the concrete class for scraped data about each Premier League team’s position in the standings table. This is where thescraped_data
method is implemented to perform the scraping activity.
B. How does it satisfy SOLID principles?
Single responsibility (SRP) ✅ - Each parent and child class in the Data Extractor object is assigned one duty to handle
Open-close principle (OCP) ✅- The code is designed in a way we can add expected behaviours without editing or removing what’s already in the code
Liskov substitution (LSP) ✅ -
The
TableStandingsDataExtractor
class can be substituted for the parentIDataExtractor
classThe
PremLeagueTableStandingsDataExtractor
class can be substituted for its parentTableStandingsDataExtractor
class
Interface segregation principle (ISP) ✅ - Each child class uses the same
scraped_data
method of its parent class ensuring no class depends on methods they do not need to useDependency inversion principle (DIP) ✅ -
The
TableStandingsDataExtractor
abstract class depends on theIDataExtractor
abstract classThe
PremLeagueTableStandingsDataExtractor
concrete class depends on theTableStandingsDataExtractor
abstract classNo class is dependent on any concrete classes in the Data Extractor object
6. Data Transformer🔄
(GitHub source code for the Data Transformer object can be found here to follow along)
A. What does it do?
The Data Transformer object is tasked with cleaning the data and converting it into a dataframe. It is expressed as IDataTransformer
interface with one abstract method named transform_data
.
Its only direct child class is named TableStandingsDataTransformer
- a derived class inheriting from the IDataTransformer
interface, and is responsible for transforming scraped data related to the table standings of different football teams.
The PremierLeagueTableStandingsDataTransformer
is a concrete class that extends the TableStandingsDataTransformer
class and implements the transform_data
method.
B. How does it satisfy SOLID principles?
Single responsibility (SRP) ✅ - Each class in the Data Transformer is responsible for only one task
Open-close principle (OCP) ✅- If we need to add other data transformers to process other football data statistics, we can easily add more interfaces without altering existing code
Liskov substitution (LSP) ✅ -
The
TableStandingsDataTransformer
class can be substituted for the parentIDataTransformer
The
PremierLeagueTableStandingsDataTransformer
class can be substituted for its parentTableStandingsDataTransformer
class
Interface segregation principle (ISP) ✅ - Each child class uses the same
transform_data
method of its parent classDependency inversion principle (DIP) ✅ -
The
TableStandingsDataTransformer
abstract class depends on theIDataTransformer
abstract classThe
PremierLeagueTableStandingsDataTransformer
class depends on theTableStandingsDataTransformer
abstract classNo class is dependent on any concrete classes in the Data Transformer object
7. Data Loader💾
(GitHub source code for the Data Loader object can be found here to follow along)
A. What does it do?
The Data Loader object is designed to upload clean data into the target environment (cloud or local machine) configured by the main user. The primary interface is named IFileUploader
, an abstract class with one abstract method named upload_file
.
Here is how the child classes are split:
1st level
S3FileUploader
- for uploading files to Amazon S3 buckets (an extension ofIFileUploader
)LocalFileUploader
- for uploading files to target local folders (an extension ofIFileUploader
)
2nd level
S3CSVFileUploader
- for uploading CSV files to Amazon S3 buckets (an extension ofS3FileUploader
)LocalCSVFileUploader
- for uploading files to target local folders (an extension ofLocalFileUploader
)
3rd level
PremierLeagueTableS3CSVUploader
- for uploading CSV files containing football data on the Premier League table standings to Amazon S3 buckets (an extension ofS3CSVFileUploader
)PremierLeagueTableLocalCSVUploader
- for uploading files containing football data on the Premier League table standings to target local folders (an extension ofLocalCSVFileUploader
)
The 3rd-level interfaces contain the upload_file
implementations.
B. How does it satisfy SOLID principles?
Single responsibility (SRP) ✅ - The 3rd level classes have the responsibility of uploading files to their respective destinations
Open-close principle (OCP) ✅- We can incorporate additional file formats like JSON and text files by adding more small interfaces without modifying any existing code
Liskov substitution (LSP) ✅ -
All 2nd-level classes can be substituted for their 1st-level counterparts
All 3rd-level classes can be substituted for their 1st and 2nd-level counterparts
Interface segregation principle (ISP) ✅ - The
IFileUploader
contains a single method,upload_file
, which is implemented by all concrete classes at the 3rd level being the only necessary method availableDependency inversion principle (DIP) ✅ -
All 1st level child classes depend on the
IFileUploader
class, where theIFileUploader
class is an abstract base classAll 2nd-level child classes depend on the 1st-level classes, where the 1st-level child classes are all abstract base classes
All 3rd-level child classes depend on the 2nd-level classes, where the 2nd-level child classes are all abstract base classes
No class is dependent on any concrete classes in the Data Loader object
Resources📚
Here are the quick references to the links used in the article:
My old data engineering approach - https://github.com/sdw-online/web-to-databricks-pipeline/blob/main/scraper/main_scraper.py
My new data engineering approach (with SOLID) - https://github.com/sdw-online/football_web_scraper_2023/blob/stephen-dev-branch-01/scraper/main_scraper.py
You can find other blog posts at https://stephendavidwilliams.com/
Conclusion🏁
We have now explored how SOLID principles can be applied to data processing applications using Python. These design principles force us to carefully consider each component in the data workloads, which is a core tenet of the software engineering lifecycle.
Ensuring our codebase checks all 5 SOLID principles will safeguard our code from unintentional alterations or surprises.
Feel free to reach out via my handles: LinkedIn| Email | Twitter