Functional programming in data engineering with Python - Part 1

Photo by Chris Ried on Unsplash

Functional programming in data engineering with Python - Part 1

Preface 📖

This is an introduction to a series on functional programming in data engineering using Python. Here I lay out some of the fundamental concepts and tools found in functional programming using Python code.

What is functional programming?🎛️

Functional programming is a declarative type of programming used to build bug-resistant programs and applications through the use of functions.

In other words, it's a computing paradigm that emphasizes the use of pure functions and immutable data structures for mitigating side effects instead of specifying the steps on how to perform tasks (imperative programming).

What is a function?🔄

A function is an object that turns inputs into outputs.

You can have functions that perform simple operations where they simply convert inputs to outputs. You can also have more sophisticated functions that are linked with other functions in a complex system. In such cases, functions can be:

  • An input (parameter for another function)

  • An output (result from another function)

Also, functions:

  • are treated as first-class citizens (first-class functions)

  • can be passed into other functions as arguments or returned from other functions (higher-order functions)

  • can be connected with other functions to form new ones (function composition)

  • should be designed to work with other functions (reusability)

Core principles of functional programming🌟

  • monoids

  • immutability

  • recursion

  • function composition

  • dependency injection

  • currying

  • referential transparency

  • lazy evaluations

Monoids✨

A monoid is anything that consists of a binary operation and an identity element. From a functional programming perspective, a monoid is a set of values that are concatenated at any point in time while satisfying these properties:

  • the output shares the same type as its input arguments (closure)

  • a neutral element that doesn’t change the answer when combined with other input arguments is present (identity)

  • the elements can be ordered in any way and still return the same answer (associativity)

Irrespective of the internal operations in the function, as long as the input argument shares the same type as the result, they qualify as a monoid. A monoid’s main task is combining data of the same type like integers, strings, lists etc. For example, a function that takes in currency in $ and returns currency in $ would be an example of a monoid.

Properties of Monoids🔑

Remember, three main properties make a monoid:

  • Closure - you take type X and return an output type of X; the input’s type is the same as the output’s type

  • Identity - you add your set of values to an empty value or string and the results remain the same

  • Associativity - you can order the values however you like, but you will still get the same answer

Let’s break down each property:

  • Closure

Closure refers to the binary operation’s ability to produce an output that is the same type as the input arguments of a function. In other words, a function qualifies as a monoid if its combination operation creates a result that shares the same type as its inputs.

A binary operation can be addition, multiplication, subtraction, division, or any other operation that involves at least two different variables combined to form another variable.

Here are examples that fail to qualify as monoids:

❌Mathematic example:

10 / 8 = 1.25

This example involves an integer dividing another integer but the result is a float type

❌Text example:

"Add" + 20 = #TypeError

A string cannot be concatenated with an integer - the result will be a TypeError

❌Code example:

def add_items(a, b):
    return a + b

result = add_items(10.25, 20)
print(result)
### Output ###

# Output
# 30.25

The add_items function doesn’t qualify as a monoid either because there’s no guarantee the input parameters will be the same data types, so we could easily combine a float and integer to the add_items function, which will result in a float as output.

Here are examples that qualify as monoids:

✅Mathematic example

2 + 2 = 4

An integer plus an integer results in an integer

✅String example:

“Hello” + “friend” = “Hello friend”

A string concatenated with another string outputs a string

✅Code example:

def add_items(a: int, b: int) -> int:
    if not isinstance(a, int) or not isinstance(b, int):
        raise TypeError("Only integers can be inserted into the function")
    return a + b

result = add_items(10.25, 20)
print(result)
### Output ###

# Traceback (most recent call last):
# File "<string>", line 6, in <module>
# File "<string>", line 3, in add_items
# TypeError: Only integers can be inserted into the function

The add_items operation only accepts and returns integers, and throws an exception if any other data type is passed into the function, so, therefore, satisfies the closure rule for monoids.

  • Identity🆔

This states that something must be added to the arguments of a function that makes it still return the same output. In a short sentence, a monoid must have an identity element.

An identity element (or neutral element) is a value combined with other elements passed into the function's arguments that keeps the output value the same. So in an equation like 3 * 1 = 3, the value 1 is the identity element.

Note: The identity element depends on the binary operation selected for the monoid. so although 1 is the identity element in this multiplication example, the same use of numbers will fail to create a monoid in an addition operation because 3 + 1 does not equal 3 - to make this a monoid, 0 would need to replace 1 in this instance.

Examples of this include:

10 + 0 = 10

The identity element here is 0, because the result of 10 remains unchanged

"Hello Sam" + "" = "Hello Sam"

The empty double quotes used here are the identity element as the output of the operation is still "Hello Sam"

[2, 4, 6] + [] = [2, 4, 6]

This concatenation operation uses the empty square brackets as the identity element and the results remain the same.

  • Associativity🔁

This states that the monoid must be associative. Associative means the results will return the same answer no matter how you order the values in the operation.

For example, (2 + 3) + 4 is the same as 2 + (3 + 4), so addition in this context is associative and qualifies as a monoid.

Here’s a code example of this:

def multiply_items(x, y):
    return x * y


multiply_numbers_1 = multiply_items(multiply_items(3, 4), 5)     # (3 * 4) * 5
multiply_numbers_2 = multiply_items(3, multiply_items(4, 5))     # 3 * (4 * 5)

print(f'1st approach: {multiply_numbers_1}')
print(f'2nd approach: {multiply_numbers_2}')
### Output ###

# 1st approach: 60
# 2nd approach: 60

This code multiplies two arguments passed into the multiply_items function. By creating two separate approaches that multiply the numbers 3, 4 and 5 in different orders, we demonstrate how this meets the associativity rule for monoids.

Like Scott Wlaschin often says in his talks, once you understand monoids you begin to see them everywhere!

Immutability🧱

An object is immutable if it cannot be changed or altered under any circumstance. In functional programming, immutability is when an object’s state is unable to change once it has been created or initialized. So under this form of programming, we only create new objects instead of modifying existing ones.

Using immutable objects helps to avoid unintended side effects like changing the state of objects outside the functions used.

Let’s take a bank statement as an example - transactions cannot be modified or deleted once they are recorded onto a bank statement, you can only add new transactions to record changes to existing ones in the form of corrections, adjustments or reversals.

Types of data

  1. Mutable data structures🔀

Common examples of mutable data structures include:

  • lists

  • dictionaries

  • sets

  1. Immutable data structures🔒

Common examples of immutable data structures include:

  • strings

  • tuples

  • fronzensets

Examples

Mutable operation

Here is an example of a function that deals with mutable data type:

# Create function for standardizing a sequence of dates 
def change_date_formats(dates):
    for i, date in enumerate(dates):
        dates[i] = date.replace('-', '/')
    return dates

# Run the operations
list_of_dates = ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01']
transformed_dates = change_date_formats(list_of_dates)

# Display results
print(f'Old dates: {list_of_dates} ')
print(f'New dates: {transformed_dates} ')
### Output ###

# Old dates: ['2023/01/01', '2023/02/01', '2023/03/01', '2023/04/01'] 
# New dates: ['2023/01/01', '2023/02/01', '2023/03/01', '2023/04/01']

The old dates have been overwritten with the new results by the change_date_formats function.

Immutable operation

Here is an approach that doesn't modify the existing list (remember, it's recommended you use immutable data structures in general, especially for parallel computing activities):

# Create function for standardizing a sequence of dates 
def change_date_formats(dates):
    new_dates = [date.replace('-', '/') for date in dates]
    return new_dates

# Run the operations
list_of_dates = ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01']
transformed_dates = change_date_formats(list_of_dates)

# Display results
print(f'Old dates: {list_of_dates} ')
print(f'New dates: {transformed_dates} ')
### Output ###

# Old dates: ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'] 
# New dates: ['2023/01/01', '2023/02/01', '2023/03/01', '2023/04/01']

The change_date_formats function uses a list comprehension to perform the transformation job on each date while recreating them as new dates once the - characters are replaced with /.

However, although our approach is correct, our new output is still in a list format and can be accidentally modified or take up more memory space than we need. The severity of this becomes more apparent once the data grows larger.

The best approach is to use an immutable data type, like tuples:

# Create function for standardizing a sequence of dates 
def change_date_formats(dates):
    new_dates = tuple(date.replace('-', '/') for date in dates)
    return new_dates

# Run the operations
tuples_of_dates = ('2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01')
transformed_dates = change_date_formats(tuples_of_dates)

# Display results
print(f'Old dates: {tuples_of_dates} ')
print(f'New dates: {transformed_dates} ')
### Output ###

# Old dates: ('2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01') 
# New dates: ('2023/01/01', '2023/02/01', '2023/03/01', '2023/04/01')

Now that the results are in tuples, we can expect our program to process the dates more efficiently and the dates are protected from manual modification or corruption by the user or the program.

Recursion♻️

Recursion is when a function keeps calling itself until it accomplishes its main goal. Functions that do this are called recursive functions.

Recursive functions do not use for or while loops as they prioritize recursion over iteration.

Examples

Iterative function🚫

An iterative function (or iterator) loops through a sequence of elements and applies an operation to each element within the sequence.

Here’s an example of an iterative function that multiplies a list of numbers:

def multiply_numbers(numbers):
    total = 1
    for number in numbers:
        total *= number
    return total 

list_of_numbers = [1, 2, 3]
print(multiply_numbers(list_of_numbers))
### Output ###

# 6

This function doesn’t qualify as a recursive function because it uses a for loop to achieve the multiplication operation.

Recursive function✔️

This is what the recursive version looks like:

def multiply_numbers(numbers):
    if len(numbers) == 1:
        return numbers[0] 
    else:
        return numbers[0] * multiply_numbers(numbers[1:])

list_of_numbers = [1, 2, 3]
print(multiply_numbers(list_of_numbers))
### Output ###

# 6

However, there is a cost to using recursions: we trade fast performance for simple and readable functions in many scenarios. Like any trade-off, you should perform a cost-benefit analysis to determine whether this is worth it for your distinctive use cases.

So use:

  • recursions for easy-to-read functions

  • iterations for speedy functions

Function composition🔗

Function composition occurs when you create a new function by combining multiple smaller functions. This is an effective way of abstracting complex behaviours from several functions into one simple function that acts as an API.

Imagine flying from one country to another. As a flight passenger, all you care about is just getting to your destination in a safe and relaxed manner - many complex activities are occurring during your entire flight to make that possible that are abstracted (hidden) from you.

Examples

Without function composition🚫

Let’s create some functions that perform simple data cleaning activities like removing whitespace, converting data to lowercase and removing any dollar signs:

def strip_whitespace(data):
    return data.strip()

def use_lowercase(data):
    return data.lower()

def remove_dollar_sign(data):
    return data.replace('$', '')

dummy_data              =   "  $100.00  "
no_space_data           =   strip_whitespace(dummy_data)
lowercase_data          =   use_lowercase(no_space_data)
no_dollar_data          =   remove_dollar_sign(lowercase_data)

print(no_dollar_data)
### Output ###

# 100.00

This example doesn’t comply with function composition because the functions are applied sequentially without forming a new function in the process.

With function composition✔️

You can implement composition by creating helper functions. Helper functions are functions that assist in chaining functions together.

Here’s an example:

from functools import reduce

# Create data cleaning functions
def strip_whitespace(data):
    return data.strip()

def use_lowercase(data):
    return data.lower()

def remove_dollar_sign(data):
    return data.replace('$', '')

# Create helper function
def compose_data_cleaning_functions(*functions):
    def compose(x):
        return reduce(lambda v, f: f(v), functions, x)
    return compose

# Use helper function to combine cleaning functions into one new function
transform_data           =   compose_data_cleaning_functions(strip_whitespace, use_lowercase, remove_dollar_sign)
dummy_data               =   "  $100.00  "

cleaned_data            =   transform_data(dummy_data)
print(cleaned_data)
### Output ###

# 100.00

We’ve now been able to implement composition by creating a helper function called compose_data_cleaning_functions to allow us to create an API called transform_data, which performs the complex hidden data cleaning jobs behind the scenes.

Dependency injections💉

A dependency injection occurs when a resource or behaviour is passed into a function instead of being hard-coded into one.

Examples

Without dependency injection🚫

Let’s see which football team in the Premier League our program believes is the best:

team_name = 'Manchester United'

def display_message():
    print(f'{team_name} is the best team in the Premier League!')

display_message()
### Output ###

# Manchester United is the best team in the Premier League!

This code contradicts the dependency injection rule because the function’s internal operations are coupled to the team_name global variable located outside the function. So although the team_name variable is a dependency for the function in this context, it isn’t “injected” into it via an input parameter.

With dependency injection✔️

def display_message(team_name):
    print(f'{team_name} is the best team in the Premier League!')

team_name = 'Manchester United'
display_message(team_name)
### Output ###

# Manchester United is the best team in the Premier League!

This method is flexible enough to have any team names passed into it, so if another team ends up performing better in the Premier League than Manchester United (highly unlikely), then we can easily pass in their names without interfering with the function’s internal code.

Currying🍛

Currying is turning a function that takes in multiple arguments into a sequence of nested functions that each take one argument.

By converting a multiple-argument function to a hierarchy of single-argument functions, we can create more modular and reusable code that makes it easier to create partial functions for different use cases.

Examples

Without currying🚫

def make_breakfast(toasts, sausages, eggs):
  return f"Breakfast made with {toasts} pieces of toasts, {sausages} sausages and {eggs} scrambled eggs"

my_breakfast = make_breakfast(3, 4, 3)
print(my_breakfast)
### Output ###

# Breakfast made with 3 pieces of toasts, 4 sausages and 3 scrambled eggs

This example isn’t considered to be currying because the make_breakfast function takes 3 input arguments at once.

With currying✔️

def make_breakfast(toasts):
    def add_sausages(sausages):
        def add_eggs(eggs):
            return f"Breakfast made with {toasts} pieces of toasts, {sausages} sausages and {eggs} scrambled eggs"
        return add_eggs
    return add_sausages

my_breakfast = make_breakfast(3)(4)(3)
print(my_breakfast)
### Output ###

# Breakfast made with 3 pieces of toasts, 4 sausages and 3 scrambled egg

By splitting the make_breakfast function into single-argument functions nested inside it, it’s easier to see the inputs required to make the function work. We can also reuse the make_breakfast function to create partial functions for other bespoke uses.

def make_breakfast(toasts):
    def add_sausages(sausages):
        def add_eggs(eggs):
            return f"Breakfast made with {toasts} pieces of toasts, {sausages} sausages and {eggs} scrambled eggs"
        return add_eggs
    return add_sausages

make_breakfast_with_2_toasts = make_breakfast(2)
my_breakfast = make_breakfast_with_2_toasts(6)(4)

print(my_breakfast)
### Output ###

# Breakfast made with 2 pieces of toasts, 6 sausages and 4 scrambled eggs

In this example, we created a partial function, make_breakfast_with_2_toasts - by reusing make_breakfast function, which sets the number of toasts to 2, while still specifying any amount of sausages and eggs we want to be included in our breakfast. This demonstrates the reusability and flexibility of curried functions for various use cases.

Referential transparency🔍

Referential transparency is when the result of a function can be swapped with its inputs without any changes in the behaviour. The inputs expressed must always return a specific output under any circumstance.

A good example of referential transparency is the use of a dictionary. In the Cambridge Dictionary, the term courage is defined as “ the ability to control your fear in a dangerous or difficult situation”. This definition will always be mapped to courage if we keep checking it in this dictionary.

Examples

Without referential transparency🚫

Here’s a code example of what life looks like without referential transparency:

# Set discount factor
discount_factor = 0.5

# Define discount function
def apply_discount(item, price, location):
    global discount_factor
    if location == "USA" and item == "shoes":
        price *= discount_factor
        discount_factor += 0.1
    return price

# Add items
item = "shoes"
price = 30
location = "USA"

# Display results 
discounted_price_1 = apply_discount(item, price, location)
discounted_price_2 = apply_discount(item, price, location)

print(f'Discounted price 1: {discounted_price_1} ')
print(f'Discounted price 2: {discounted_price_2} ')
### Output ###

# Discounted price 1: 15.0 
# Discounted price 2: 18.0

This violates referential transparency because the apply_discount function does not primarily depend on all its inputs - it doesn’t rely solely on location and item inputs, but on the discount_factor global variable’s value. This means calling the function with the same input value may generate different results depending on the value of the discount_factor variable.

The discount_factor value increases in value by 0.1 each time the apply_discount function is executed, which means the same inputs for the function return different results every time, and thus the output for the discounted prices are 15 and 18 respectively.

With referential transparency✔️

Here’s a refactored version that satisfies referential transparency this time:

# Set discount factor
discount_factor = 0.5

# Define discount function
def apply_discount(item, price, location, discount_factor):
    if location == "USA" and item == "shoes":
        price *= discount_factor
    return price

# Add items
item = "shoes"
price = 30
location = "USA"

# Display results 
discounted_price_1 = apply_discount(item, price, location, discount_factor)
discounted_price_2 = apply_discount(item, price, location, discount_factor)

print(f'Discounted price 1: {discounted_price_1} ')
print(f'Discounted price 2: {discounted_price_2} ')
### Output ###

# Discounted price 1: 15.0 
# Discounted price 2: 15.0

The apply_discount function now depends on all inputs instead of any external/global variables influencing its internal operations. The discount_factor is now an input argument itself to guarantee the function relies on the same input parameters to return the same results every time it is executed.

Lazy evaluation🥱

Lazy evaluation is a technique used to execute functions only when they are required/called, not when they are created.

Examples

Eager evaluation🚫

The opposite of lazy evaluation is eager evaluation, which is when a function is “eager” to be executed - in other words, it runs the moment it is created.

A code example would look like this:

# Create a generator that triples the numbers in a list
def triple_numbers(numbers):
    return [f'{number} * 3 = {number * 3}' for number in numbers]

# Create the list of numbers 
list_of_numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10]
tripled_numbers = triple_numbers(list_of_numbers)

# Display the results 
for eager_result in tripled_numbers:
    print(eager_result)
### Output ###

# 1 * 3 = 3
# 2 * 3 = 6
# 3 * 3 = 9
# 4 * 3 = 12
# 5 * 3 = 15
# 6 * 3 = 18
# 7 * 3 = 21
# 8 * 3 = 24
# 9 * 3 = 27
# 10 * 3 = 30

This would not qualify for lazy evaluation because the triple_numbers uses a list comprehension to print the computed results once calculations are completed.

Lazy evaluation✔️

This is what the lazy evaluation version looks like:

# Create a generator that triples the numbers in a list
def triple_numbers(numbers):
    for number in numbers:
        yield f'{number} * 3 = {number * 3}'

# Create the list of numbers 
list_of_numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10]
tripled_numbers = triple_numbers(list_of_numbers)

# Display the results 
for lazy_result in tripled_numbers:
    print(lazy_result)
### Output ###

# 1 * 3 = 3
# 2 * 3 = 6
# 3 * 3 = 9
# 4 * 3 = 12
# 5 * 3 = 15
# 6 * 3 = 18
# 7 * 3 = 21
# 8 * 3 = 24
# 9 * 3 = 27
# 10 * 3 = 30

By replacing the list comprehension with a yield generator, the triple_numbers generator is only executed once it is needed by the tripled_numbers variable.

Types of functions used in functional programming

Deterministic functions🎯

A deterministic function takes the same input and returns the same output each time. Because they return the same result each time it receives a specific input argument, we can rely on the results of these types of functions, making them reliable.

Deterministic functions are like calculators - if you add 1 + 1 into a calculator, the answer returned should always be 2.

Now, let's look at the code examples:

Examples

Non-deterministic functions🎲🚫

A function is non-deterministic if it doesn’t return the same output every time you pass the same input into it. In other words, the output is always random and unpredictable even when you feed it the same input arguments.

Here is an example of a non-deterministic function:

import random

# Create a function that cleans data
def clean_data(data):
    return data.strip().lower()

# Add the input number to a random number between -100 and 100
def add_random_number(data):
    random_number = random.uniform(-100, 100)
    return data + random_number

# Run the operations
data                =   "  100.00   "
transformed_data    =   float(clean_data(data))
random_result       =   add_random_number(transformed_data)

# Display results
print(random_result)

This code defines a clean_data function that removes whitespace from the data and converts the input data to lowercase. Then the add_random_number function adds a different random number to the input provided every time it is executed, therefore making the operation a non-deterministic one.

Running this 3 times returned these results…

### Output ###

# 107.06291086940836
# 125.20529145203881
# 5.629407901961244

The problem with them is obvious - they often give us unreliable outputs. There are use cases where this is necessary (like experimentation, testing etc). But for production-grade applications that require predictable outcomes, these functions fall short of the mark.

Deterministic functions✔️

# Clean the data
def clean_data(data):
    return data.strip().lower()

# Add the input number to a fixed number
def add_fixed_numbers(transformed_data, fixed_number):
    return transformed_data + fixed_number

# Run the operations
data                =   "  100.00   "
transformed_data    =   float(clean_data(data))
fixed_result        =   add_fixed_numbers(transformed_data, 50)

# Display results
print(fixed_result)

Running this code three times returns these outputs:

### Output ###

# 150.0
# 150.0
# 150.0

By using the add_fixed_numbers instead of the add_random_numbers function, the number added to the input number is always the same, making the add_fixed_numbers operation a deterministic one.

Pure functions👼

A pure function is a deterministic function that takes the same input and returns the same output each time it is executed without any side effects. No matter how many times it runs, it consistently returns the same result with the same input argument(s).

Not only does it consistently return the same results if the same input arguments are provided, but also has no side effects, meaning it doesn’t alter any external states (it does not change the values outside of it).

See above for the code example that satisfies deterministic functions of an example of a pure function. The add_fixed_numbers function is a pure function that always returns the same value for the same input arguments it’s given.

Higher-order functions🔝

A higher-order function can either take a function as an input or return a function as an output.

See the code example that satisfies function composition as an example of a custom higher-order function. The compose_data_cleaning_functions function is a higher-order function that takes in multiple functions as input parameters.

In-built higher-order functions include:

  • map

  • filter

  • reduce

Map🗺️

The map function takes a function and applies it to every item in an iterable (e.g. lists, tuples).

Here’s a code example of applying map on a list:

convert_to_caps     =   lambda name_in_lowercase: name_in_lowercase.title()
old_list_of_names   =   ['amy', 'chris','rachel' ,'terry', 'abraham']

new_list_of_names   =   list(map(convert_to_caps, old_list_of_names))
print(new_list_of_names)
### Output ###

# ['Amy', 'Chris', 'Rachel', 'Terry', 'Abraham']

The map function takes the convert_to_caps function as the first argument and applies it to each name in the iterable object named old_list_of_names to convert all the names from lowercase to uppercase.

Here’s another example of applying map on a tuple this time:

# Define the data transformation function
def transform_employee_data(employee):
    name, age, department = employee
    return (name.title(), age, department.upper())

# Create the employee data 
employee_data  = (
    ('john smith', 32, 'sales'),
    ('amy holloway', 27, 'business intelligence'),
    ('ryan bakerwood', 45, 'operations')
    )

# Apply the transformation 
transformed_employee_data = tuple(map(transform_employee_data, employee_data))
print(transformed_employee_data)
### Output ###

#   (('John Smith', 32, 'SALES'), 
#   ('Amy Holloway', 27, 'BUSINESS INTELLIGENCE'), 
#   ('Ryan Bakerwood', 45, 'OPERATIONS'))

The map function takes transform_employee_data function and applies it to each employee in the employee_data object, converting the first letter of each name from lowercase to uppercase, and raising each department name to uppercase.

Filter🔍

The filter operation is used to select (or remove) a subset of data from an iterable. It takes a function with a condition, applies it to the iterable and returns the iterable’s elements that meet the function’s condition.

Here’s a list example:

list_of_numbers             =   [-2, 6, -24, -928, 13, 83, 401]
only_positive_numbers       =   lambda positive_number: positive_number > 0
positive_numbers            =   list(filter(only_positive_numbers, list_of_numbers))

print(positive_numbers)
### Output ###

# [6, 13, 83, 401]

This code employs the filter function, which filters out the positive numbers from a given list of random numbers, list_of_numbers. This is accomplished by using a lambda function named only_positive_numbers, which evaluates whether a number is greater than 0. The filter operation applies the lambda function to each element in the list_of_numbers and only retains the numbers that satisfy the condition. The resulting numbers are then stored in the positive_numbers variable as a new list.

Now here’s a tuple example:

def filter_country_data(country):
    country_name, population_in_millions = country
    return population_in_millions >= 500

# Create the country data 
country_data_in_millions  = (
('China', 1402), 
('India', 1366), 
('United States', 329), 
('Indonesia', 270), 
('Brazil', 212), 
('Pakistan', 205), 
('Nigeria', 201), 
('Bangladesh', 168), 
('Russia', 144), 
('Japan', 127))

# Apply the filter
filtered_country_data = tuple(filter(filter_country_data, country_data_in_millions))
print(filtered_country_data)
### Output ###

# (('China', 1402), ('India', 1366))

This code uses population data by country to filter out any country containing a population less than 250 million by using the filter operation to apply the filter_country_data operation to each tuple in the country_data_in_millions iterable.

Reduce➗

The reduce function takes a function and applies it to every item in an iterable to reduce it to a single cumulative value. This is done by applying the function to the first two elements in the iterable, repeating the process until only one cumulative value remains.

Let’s observe a string-based example:

from functools import reduce

random_strings = ("I", "love", "pizza", "and", "orange", "juice", "!")
concat_operation = lambda x, y: x + " " + y
sentence = reduce(concat_operation, random_strings)
print(sentence)
### Output ###

# I love pizza and orange juice !

Here we use the reduce function to apply the concat_operation to the first two elements in the tuple of random_strings. The operation concatenates the elements within the iterable and separates them by a space. The result is then stored in the sentence variable.

Now let’s see a numerical example:

from functools import reduce

def calculate_net_profit(net_profit, transaction):
    gross_profit, expenses, tax = transaction
    return net_profit + gross_profit - expenses - tax


list_of_transactions = [(2000, 500, 300), (3000, 750, 450), (1000, 250, 150), (5000, 1250, 750)]
net_profit = reduce(calculate_net_profit, list_of_transactions, 1000)

print(net_profit)
### Output ###

# 7600

The reduce function is used to calculate the net profit using the gross_profit, expenses and tax variables. The custom calculate_net_profit function takes the net_profit and transaction values unpack the transaction tuple into the gross_profit, expenses and tax variables then calculate the net profit by adding the net_profit and gross_profit together to then subtract the expenses and tax from the final figure.

The calculate_net_profit function is applied to each element in the list_of_transactions variable, with the net_profit set to 1000 (just for demonstrative purposes).

First-order functions🔢🔡

A first-order function is a function that operates on simple data types like numbers, lists and strings, and doesn’t take arguments as functions or return one as output. They’re basically what many consider to be normal functions.

Higher-order functions can do the same, except they can take functions as arguments, return them as outputs, or both.

Examples of in-built first-order functions in Python are:

  • zip

  • sorted

  • enumerate

  • any

  • all

Zip🤐

The zip function joins elements that share the same index position from different iterables into one new iterable in tuple format.

employees       =   ['Brian Jackson', 'Melissa Hammersmith', 'Connor Shaw']
salaries        =   [35000, 45000, 55000]

employee_data   =   list(zip(employees, salaries))
bonuses         =   [(employee, salary * 0.1) for employee, salary in employee_data]

print(bonuses)
### Output ###

# [('Brian Jackson', 3500.0), ('Melissa Hammersmith', 4500.0), ('Connor Shaw', 5500.0)]

This example demonstrates the use of the zip function to combine the employees and salaries list based on their index positions, which creates a new list, employee_data. A list comprehension is used to apply the 10% bonus to each employee’s salary.

Using the zip function simplifies the process of combining multiple lists and making the code more readable and easier to understand.

Sorted📚

The sorted function takes an iterable as an input and creates a new list sorted in ascending order. Although any iterable (mutable or immutable) can be passed into the sorted function, the output is always in list format.

Note: The sorted function becomes a higher-order function once the key function is taken as an input argument. Here’s an example of this:

action_movies   =   (
        {'title':'Inception', 'year': 2010},
        {'title':'Rush Hour', 'year': 1998},
        {'title':'Avengers: Endgame', 'year': 2019},
        {'title':'Bad Boyz', 'year': 1995},
        {'title':'John Wick', 'year': 2014}
    )

year_key        =   lambda x: x["year"]
sorted_movies   =   sorted(action_movies, key=year_key)

print(sorted_movies)
### Output ###

# [{'title': 'Bad Boyz', 'year': 1995}, 
# {'title': 'Rush Hour', 'year': 1998}, 
# {'title': 'Inception', 'year': 2010}, 
# {'title': 'John Wick', 'year': 2014}, 
# {'title': 'Avengers: Endgame', 'year': 2019}]

This code uses the sorted function to order a list of dictionaries containing different action films and their release dates in ascending order. The key argument takes in the lambda function, year_key, which extracts each year from the dictionaries.

Enumerate🧮

The enumerate operation supplies each value in an iterable with an index. This is useful for monitoring the number of iterations in a loop operation.

sales = [2000, 1500, 4600, 39000, 6500, 800]
sorted_sales = sorted(sales, reverse=True)

for rank, amount in enumerate(sorted_sales, start=1):
    print(f'Rank {rank}: ${amount} ')
### Output ###

# Rank 1: $39000 
# Rank 2: $6500 
# Rank 3: $4600 
# Rank 4: $2000 
# Rank 5: $1500 
# Rank 6: $800

The example here uses enumerate to iterate through a list of ordered sale amounts, sorted_sales, with their corresponding index values next to each amount. This saves us the hassle of manually incrementing using a counter variable.

So we’re still running a normal for loop but with an index added to each value to see where they are ranked in the iteration.

Any❓

The any function is an operation applied to an iterable to check whether at least one element in the iterable is true. If there is at least one element in the iterable that is true, the operation will return True.

It will return False if the iterable is empty or all the elements in the iterable are false.

inventory  = [
    {'name': 'notebooks', 'quantity': 120},
    {'name': 'pencils', 'quantity': 140},
    {'name': 'highlighters', 'quantity': 233},
    {'name': 'sticky-notes', 'quantity': 56},
    ]

minimum_stock_required = 100

any_low_stock_in_inventory = any(item["quantity"] <  minimum_stock_required for item in inventory)

print(any_low_stock_in_inventory)
### Output ###

# True

The inventory variable is a list of dictionaries that list stock items and their quantities. The any function checks each stock item to see whether their quantity is less than the minimum_stock_required. If the program identifies any item lower than the minimum threshold defined in minimum_stock_required, it will return False, indicating some items required restocking.

All✅

The all function is another in-built function applied to an iterable that returns True only if all elements in the iterable are true or the iterable is empty.

This will return False if any element in the iterable is false.

employees_trained = (
    {'name': 'Shannon', 'passed': True},
    {'name': 'Rhys', 'passed': True},
    {'name': 'Jimmy', 'passed': False},
    {'name': 'Emma', 'passed': True},
    {'name': 'Ben', 'passed': True},

)

all_employees_passed = all(employee['passed'] for employee in employees_trained)
print(all_employees_passed)
### Output ###

# False

The employees_trained is a tuple of employee names and a flag indicating whether they passed/failed the course. The all function iterates through the tuple to check whether each employee passed the course (i.e. returns True if all the items in the employees_trained iterable returns True).

In this case, it returns False because the program identified at least one employee who didn’t pass the course in this round.

Closures🔐

A closure is a function stored or nested in another function that remembers all the hard-coded values in the outside function even after the outside function has completed running.

See the code example that satisfies function composition as an example of a closure. The compose function is a nested function that captures the functions passed into its outside function, compose_data_cleaning_functions, and applies each function (f) to the value selected (v) in the reduce operation.

Partially-applied functions🎛️

A function is “applied”, or considered a total function if it’s given all of its compulsory arguments in one go. A function is “partially-applied” if it is only given a subset of its compulsory arguments with the hope of getting the rest later on via a new function.

A partially-applied function takes some of its mandatory parameters and then creates a new function to take the remaining input parameters on its behalf.

Examples

Total function🚫

Here’s what a total function (or applied function) can look like:

def transform_data(data, strip_whitespace, use_lowercase):
    if strip_whitespace:
        data = data.strip()
    if use_lowercase:
        data = data.lower()
    return data

dummy_data = "  My FRIENDS love to TAKE WALKS in the park.       "
clean_data = transform_data(dummy_data, True, True)

print(clean_data)
### Output ###

# my friends love to take walks in the park.

All the compulsory arguments have been supplied to the transform_data function once it’s called, which makes it a total function.

Partially-applied function✔️

from functools import partial

def transform_data(data, strip_whitespace, use_lowercase):
    if strip_whitespace:
        data = data.strip()
    if use_lowercase:
        data = data.lower()
    return data

dummy_data = "  My FRIENDS love to TAKE WALKS in the park. "
clean_data = partial(transform_data, dummy_data, strip_whitespace=True, use_lowercase=True)
final_data = clean_data()

print(final_data)
### Output ###

# my friends love to take walks in the park.

Partial functions🧩

A partial function is a function built from another function but already has some input arguments filled. Partial functions are useful for extending the functionality of its base function created for other bespoke use cases.

Note: Partial applications are not to be confused with currying.

Examples

Without partial functions🚫

# Create function to calculate profit
def calculate_profit(revenue, cost, tax_rate):
    return (revenue - cost) * (1 - tax_rate)


# Specify financial constants
revenue     =   1_000_000
cost        =   50_000
tax_rate    =   0.2

# Calculate and display profit 
profit      =   calculate_profit(revenue, cost, tax_rate)
print(profit)
### Output ###

# 760000.0

This code example forced us to add the three input parameters (revenue, cost and tax_rate) directly into the calculate_profit function, which means this isn’t partial.

With partial functions✔️

from functools import partial

# Create function to calculate profit
def calculate_profit(revenue, cost, tax_rate):
    return (revenue - cost) * (1 - tax_rate)


# Specify financial constants
revenue     =   1_000_000
cost        =   50_000
tax_rate    =   0.2

# Create partial function 
calculate_profit_with_tax = partial(calculate_profit, tax_rate=tax_rate)

# Calculate and display profit
profit      =   calculate_profit_with_tax(revenue, cost)
print(profit)
### Output ###

# 760000.0

By using the partial function from the functools module, we can create a new function called calculate_profit_with_tax that already fills the tax_rate input parameter, so that all that needs to be added to the new function are the revenue and cost input parameters to the calculate_profit_with_tax function.

Conclusion🏁

In this blog post, we’ve touched on the foundational concepts that underpin functional programming in general. Because each of these deserves its dedicated blog posts for more in-depth exploration, I’ll begin introducing new posts where I perform technical deep dives into the tools and techniques mentioned from a data engineering standpoint, among others. This will involve illustrating their applications through real-world data engineering scenarios to help you better understand their practical use in data projects.

Feel free to reach out via my handles: LinkedIn| Email | Twitter