SOLID principles in data engineering - Part 2

SOLID principles in data engineering - Part 2

The Object Oriented Programming (OOP) version

Preface 📖

Disclaimer: This article assumes you have a proficient level of Python programming knowledge. If you do get stuck in certain areas of the article, feel free to contact me directly through the handles at the bottom of this post and I'll be happy to walk you through areas you need me to shed light on 🌞

In part 1, I break down what SOLID principles are, and included simplified real-world examples of what each one looks like from a developer’s standpoint.

This time, I aim to show you an example of a real data pipeline built using these design principles with Python.

Pipeline architecture📜

I will be using a web scraper I built for another project designed to scrape football data from a website, transform it into a Pandas dataframe, and then load it into the cloud or local environment.

You can find the source code in my GitHub repository here.

Pipeline breakdown🔎

Traditional data engineering approach

What does it look like?

The traditional data engineering mindset would break down the pipeline architecture pieces into:

Extractor > Transformer > Loader

...or most commonly referred to as...

Extract > Transform > Load

Here's the role of each object in the ETL pipeline when processing data:

  • The Extractor object performs the extract (E)

  • The Transformer object performs the transform (T)

  • The Loader object performs the load (L)

What’s wrong with this?

Although it successfully scrapes the football data it was designed to process, this version violates the SOLID principles in many ways:

  • ❌ This program is open for modification but closed for extension, meaning it would be difficult to upgrade or improve the code overtime if it requires it

  • ❌ Having the bot perform many duties like logging, configuration, and web scraping tightly couples the code which means changes to one part of the program may cause unintended consequences on other parts of the code

  • ❌ The program’s functionalities depend on direct concrete implementations of libraries like boto3, logging, and pandas which means changes to one code block may cause a ripple effect to other parts of the code. This implies modifications have to be applied in multiple places

This isn’t an exhaustive list of the challenges faced with this approach, but these problems make it hard for us to test, extend, or refactor the codebase over a period, which is a necessity for any production-grade application.

Software engineering approach

How do I do this?

Refactoring the pipeline would require approaching it with a software engineering mindset, which means I would need to dive a few layers deeper than the traditional data engineering one made earlier.

To comply with SOLID principles, we would need to break the pipeline into classes that

  • are employed to perform one job only (Single responsibility)

  • can have functionality added to it over time but never removed from it at any time (Open/close principle)

  • can be swapped for any children classes created from them without any code breaking (Liskov substitution)

  • only have functions that all child classes can use (Interface segregation)

  • depend on abstractions only (Dependency inversion)

What have I done?

So here are the classes I came up with:

  • ILogger - to log messages to a file and the console

  • Config - to manage environment variables

  • IWebPageLoader - to load the football URL in a browser

  • IPopUpHandler - to close the annoying cookie box displayed when you first enter a webpage

  • IDataExtractor - to scrape the football data from the HTML elements in the webpage

  • IDataTransformer - to transform the scraped data into a Pandas data frame

  • IFileUploader - to upload the transformed data into the cloud or local machine

Implementation🎬

1. Logger📝

(GitHub source code for the Logger object can be found here to follow along)

A. What does it do?

The Logger object deals with logging events linked to the scraped content processed in the dataflow. It is expressed as ILogger in my codebase as an abstract class that contains 5 abstract methods for logging messages for the different severity levels such as debug, info, warning, critical and error.

The 1st level children classes created include:

  • FileLogger - for recording messages to a log file

  • ConsoleLogger - for streaming messages to the console

The ConsoleLogger contains 2nd-level children classes of its own, which include:

  • ColouredConsoleLogger - for streaming messages to the console in coloured format

  • NonColouredConsoleLogger - for streaming messages to the console with no colour formatting

B. How does it satisfy SOLID principles?

Here’s how the Logger object harmonizes with the SOLID principles:

  • Single responsibility (SRP) ✅ - Each parent and child class has one specific logging duty

  • Open-close principle (OCP) ✅- No classes need to be modified but we can add more functionality if the need arises

  • Liskov substitution (LSP) ✅-

    • The FileLogger class can be substituted for the parent ILogger class

    • The ConsoleLogger class can be substituted for the parent ILogger class

    • The ColouredConsoleLogger class can be substituted for its parent ConsoleLogger class

    • The NonColouredConsoleLogger class can be substituted for its parent ConsoleLogger class

  • Interface segregation principle (ISP) ✅ - All child classes at every level share the same methods of their parent class without implementing methods they do not require

  • Dependency inversion principle (DIP) ✅ -

    • The FileLogger concrete class depends on the ILogger abstract class

    • The ConsoleLogger class depends on the ILogger abstract class

    • The ColouredConsoleLogger class depends on the ConsoleLogger abstract class

    • The NonColouredConsoleLogger class depends on the ConsoleLogger abstract class

    • No class is dependent on any concrete classes in the Logger object

2. Config⚙️

(GitHub source code for the Config object can be found here to follow along)

A. What does it do?

The Config class is a simple object for storing private credentials for accessing key environments like the main AWS S3 bucket and the local machine in use. It also contains a WRITE_FILES_TO_CLOUD flag for specifying whether files processed in the pipeline should be persisted to the cloud (True means ‘yes please’, False means ‘no’). Once set to False, files will be persisted locally in the format specified under the Data Loader section.

B. How does it satisfy SOLID principles?

To be honest, it's a simple class that doesn't need derived classes anytime soon. The only principle it satisfies is the Single Responsibility principle (SRP) because there’s only one true reason to change, which is to modify the configuration settings for persisting the scraping bot’s output.

3. Webpage Loader🌐

(GitHub source code for the Webpage Loader object can be found here to follow along)

A. What does it do?

The Webpage Loader object uses the Selenium web driver to load the URL specified to it. In my code, it’s expressed as IWebPageLoader, an abstract interface for other webpage loaders to depend on.

A 1st level child class named WebPageLoader is created which inherits the load_page abstract method, ready to be implemented in its child class.

A 2nd level child class named PremLeagueTableWebPageLoader is created from WebPageLoader. The load_page method is implemented at this level to load the webpage containing the Premier League table.

B. How does it satisfy SOLID principles?

  • Single responsibility (SRP) ✅ - Each parent and child class is responsible for one simple duty in the Webpage Loader object

  • Open-close principle (OCP) ✅- If we need to add other webpage loaders for other European leagues we don’t need to alter existing code, we can easily add more interfaces

  • Liskov substitution (LSP) ✅ -

    • The WebPageLoader class can be substituted for the parent IWebPageLoaderclass

    • The PremLeagueTableWebPageLoader class can be substituted for its parent WebPageLoaderclass

  • Interface segregation principle (ISP) ✅ - Each child class uses the same load_page method of its parent class

  • Dependency inversion principle (DIP) ✅ -

    • The WebPageLoader abstract class depends on the IWebPageLoader abstract class

    • The PremLeagueTableWebPageLoader class depends on the WebPageLoader abstract class

    • No class is dependent on any concrete classes in the Webpage Loader object

4. Popup Handler🪟

(GitHub source code for the Popup Handler object can be found here to follow along)

A. What does it do?

The Popup Handler is designed to close popup windows appearing in the browser about cookies, ads or subscription services we’re probably not interested in. This is expressed through the IPopUpHandler abstract class, which contains a close_popup abstract method.

Similar to the hierarchy of the Webpage Loader classes, it contains a 1st level child class named PopUpHandler, which inherits from the IPopUpHandler class, including the non-implementation of the close_popup abstract method.

A 2nd class child named PremLeagueTablePopUpHandler is used to implement the close_popup method to close the popup boxes in the browser.

B. How does it satisfy SOLID principles?

  • Single responsibility (SRP) ✅ - Each parent and child class in the Popup Handler object contains only one unit of work each

  • Open-close principle (OCP) ✅- Splitting the subclasses into smaller interfaces allows us to add new behaviours to the platform without modifying existing code

  • Liskov substitution (LSP) ✅ -

    • The PopUpHandler class can be substituted for the parent IPopUpHandler class

    • The PremLeagueTablePopUpHandler class can be substituted for its parent PopUpHandler class

  • Interface segregation principle (ISP) ✅ - Each child class uses the same close_popup method of its parent class

  • Dependency inversion principle (DIP) ✅ -

    • The PopUpHandler abstract class depends on the IPopUpHandler abstract class

    • The PremLeagueTablePopUpHandler concrete class depends on the PopUpHandler abstract class

    • No class is dependent on any concrete classes in the Popup Handler object

5. Data Extractor🧪

(GitHub source code for the Data Extractor object can be found here to follow along)

A. What does it do?

The Data Extractor is the first object in the program dealing with the actual data to be processed. Its interface is named IDataExtractor, and is responsible for scraping the football data from the internet. The IDataExtractor is an abstract class that contains an abstract method called scrape_data.

This consists of children classes at the

  • 1st level - TableStandingsDataExtractor: an abstract class for a table that contains the scraped data relating to each football team’s ranking in the league. This is dependent on the abstraction of the IDataExtractor class.

  • 2nd level - PremLeagueTableStandingsDataExtractor: the concrete class for scraped data about each Premier League team’s position in the standings table. This is where the scraped_data method is implemented to perform the scraping activity.

B. How does it satisfy SOLID principles?

  • Single responsibility (SRP) ✅ - Each parent and child class in the Data Extractor object is assigned one duty to handle

  • Open-close principle (OCP) ✅- The code is designed in a way we can add expected behaviours without editing or removing what’s already in the code

  • Liskov substitution (LSP) ✅ -

    • The TableStandingsDataExtractor class can be substituted for the parent IDataExtractor class

    • The PremLeagueTableStandingsDataExtractor class can be substituted for its parent TableStandingsDataExtractor class

  • Interface segregation principle (ISP) ✅ - Each child class uses the same scraped_data method of its parent class ensuring no class depends on methods they do not need to use

  • Dependency inversion principle (DIP) ✅ -

    • The TableStandingsDataExtractor abstract class depends on the IDataExtractor abstract class

    • The PremLeagueTableStandingsDataExtractor concrete class depends on the TableStandingsDataExtractor abstract class

    • No class is dependent on any concrete classes in the Data Extractor object

6. Data Transformer🔄

(GitHub source code for the Data Transformer object can be found here to follow along)

A. What does it do?

The Data Transformer object is tasked with cleaning the data and converting it into a dataframe. It is expressed as IDataTransformer interface with one abstract method named transform_data.

Its only direct child class is named TableStandingsDataTransformer - a derived class inheriting from the IDataTransformer interface, and is responsible for transforming scraped data related to the table standings of different football teams.

The PremierLeagueTableStandingsDataTransformer is a concrete class that extends the TableStandingsDataTransformer class and implements the transform_data method.

B. How does it satisfy SOLID principles?

  • Single responsibility (SRP) ✅ - Each class in the Data Transformer is responsible for only one task

  • Open-close principle (OCP) ✅- If we need to add other data transformers to process other football data statistics, we can easily add more interfaces without altering existing code

  • Liskov substitution (LSP) ✅ -

    • The TableStandingsDataTransformer class can be substituted for the parent IDataTransformer

    • The PremierLeagueTableStandingsDataTransformer class can be substituted for its parent TableStandingsDataTransformer class

  • Interface segregation principle (ISP) ✅ - Each child class uses the same transform_data method of its parent class

  • Dependency inversion principle (DIP) ✅ -

    • The TableStandingsDataTransformer abstract class depends on the IDataTransformer abstract class

    • The PremierLeagueTableStandingsDataTransformer class depends on the TableStandingsDataTransformer abstract class

    • No class is dependent on any concrete classes in the Data Transformer object

7. Data Loader💾

(GitHub source code for the Data Loader object can be found here to follow along)

A. What does it do?

The Data Loader object is designed to upload clean data into the target environment (cloud or local machine) configured by the main user. The primary interface is named IFileUploader, an abstract class with one abstract method named upload_file.

Here is how the child classes are split:

  • 1st level

    • S3FileUploader - for uploading files to Amazon S3 buckets (an extension of IFileUploader)

    • LocalFileUploader - for uploading files to target local folders (an extension of IFileUploader)

  • 2nd level

    • S3CSVFileUploader - for uploading CSV files to Amazon S3 buckets (an extension of S3FileUploader)

    • LocalCSVFileUploader - for uploading files to target local folders (an extension of LocalFileUploader )

  • 3rd level

    • PremierLeagueTableS3CSVUploader - for uploading CSV files containing football data on the Premier League table standings to Amazon S3 buckets (an extension of S3CSVFileUploader )

    • PremierLeagueTableLocalCSVUploader - for uploading files containing football data on the Premier League table standings to target local folders (an extension of LocalCSVFileUploader )

The 3rd-level interfaces contain the upload_file implementations.

B. How does it satisfy SOLID principles?

  • Single responsibility (SRP) ✅ - The 3rd level classes have the responsibility of uploading files to their respective destinations

  • Open-close principle (OCP) ✅- We can incorporate additional file formats like JSON and text files by adding more small interfaces without modifying any existing code

  • Liskov substitution (LSP) ✅ -

    • All 2nd-level classes can be substituted for their 1st-level counterparts

    • All 3rd-level classes can be substituted for their 1st and 2nd-level counterparts

  • Interface segregation principle (ISP) ✅ - The IFileUploader contains a single method, upload_file, which is implemented by all concrete classes at the 3rd level being the only necessary method available

  • Dependency inversion principle (DIP) ✅ -

    • All 1st level child classes depend on the IFileUploader class, where the IFileUploader class is an abstract base class

    • All 2nd-level child classes depend on the 1st-level classes, where the 1st-level child classes are all abstract base classes

    • All 3rd-level child classes depend on the 2nd-level classes, where the 2nd-level child classes are all abstract base classes

    • No class is dependent on any concrete classes in the Data Loader object

Resources📚

Here are the quick references to the links used in the article:

You can find other blog posts at https://stephendavidwilliams.com/

Conclusion🏁

We have now explored how SOLID principles can be applied to data processing applications using Python. These design principles force us to carefully consider each component in the data workloads, which is a core tenet of the software engineering lifecycle.

Ensuring our codebase checks all 5 SOLID principles will safeguard our code from unintentional alterations or surprises.

Feel free to reach out via my handles: LinkedIn| Email | Twitter