Loops I Did it Again – Avoiding for-loops with pandas and numpy

Python is a very popular language amongst data scientists due to the extensive list of libraries available to us. The libraries pandas and numpy in particular are extremely useful when solving data science and machine learning problems.

However, Python also comes with some disadvantages. As Python is an interpreted language (code is executed line-by-line) it is significantly slower than other languages; C and Java for example. With the rise of big data, more data is becoming available for data scientists to work with. This means it is important to find ways of keeping code as efficient as possible, thereby avoiding any unnecessary computation time. One way of doing this is by making use of the in-built optimised routines in pandas and numpy.

Throughout this blog we will see different examples of Python code which will execute the same task, but with considerably different run times.

import pandas as pd
import numpy as np
import warnings

from timeit import Timer

warnings.filterwarnings("ignore")

We will investigate four ways of calculating the total cost, when given a dataframe consisting of the following two columns:

  • the number of units
  • the price per unit

We will randomly generate 10,000 integer values for each column using the numpy.random library for the purpose of this example. It is important to remember that in practice, data scientists tend to work with much larger datasets, often comprising millions of rows, which would make the following variations in performance even more dramatic.

# setting the seed to ensure we get the same results
np.random.seed(1000)

# creating a dataframe with two columns and 10,000 rows, where the values have been randomly generated
df = pd.DataFrame({'number_of_units':np.random.randint(0,100,size=10000),
                    'price_per_unit':np.random.randint(0,1000,size=10000)})

# printing out the top 5 rows of our dataframe
df.head()

We will write four different functions which all carry out the same task of calculating the total cost. These functions are as follows:

  1. calculating the total cost using a for-loop
  2. calculating the total cost using a list comprehension
  3. calculating the total cost making use of vectorization
  4. calculating the by using the in-built numpy dot product
def for_loop():
    df['cost_of_items'] = pd.Series()
    for i in range(len(df)):
        df['cost_of_items'].iloc[i] = df['price_per_unit'].iloc[i] * df['number_of_units'].iloc[i]
        
    total_cost = sum(df['cost_of_items'])
    return total_cost

def list_comprehension():
    cost_of_items = [price*num for price, num in zip(df['price_per_unit'], df['number_of_units'])]
    total_cost = sum(cost_of_items)
    return total_cost

def vectorized():
    total_cost = sum(df['price_per_unit'] * df['number_of_units'])
    return total_cost

def dot_product():
    total_cost = np.dot(df['price_per_unit'], df['number_of_units'])
    return total_cost

We can calculate the time it takes to execute each function once using the timeit library.

computation_time_for_loop = Timer(for_loop).timeit(1)
computation_time_list_comprehension = Timer(list_comprehension).timeit(1)
computation_time_vectorized = Timer(vectorized).timeit(1)
computation_time_dot_product = Timer(dot_product).timeit(1)
 
print("Computation time is %0.9f using for-loop"%computation_time_for_loop)
print("Computation time is %0.9f using comprehension for-loop"%computation_time_list_comprehension)
print("Computation time is %0.9f using vectorization"%computation_time_vectorized)
print("Computation time is %0.9f using numpy"%computation_time_dot_product)
Computation time is 148.951093500 using for-loop
Computation time is 0.002266300 using comprehension for-loop
Computation time is 0.000865100 using vectorization
Computation time is 0.000057100 using numpy

As expected, the for-loop took by far the most time with approximately 149 seconds! A for-loop can, and often is, replaced with a list comprehension and it can be seen through this example how valuable that can be; with the computation time going from 149 seconds down to roughly 0.002 seconds.

For-loops are a good place to start, especially as a beginner, as they can be more intuitive and readable. However, by refactoring these into list comprehensions, your code suddenly becomes much more efficient and condensed down to one line.

This example also shows that it is important to use vectorization when possible. We can see that the vectorized function is executed in less than half the time of a list comprehension. This might seem minor here, as we are considering computation times of less than a second, but when working with considerably more data, this change can go a long way. It is also worth noting that the code for the vectorized version is much closer to the plain mathematical statement of the problem.

Most importantly, making use of the numpy package with the in-built vectorized operations, our code can become more than 15 times quicker than that!

In conclusion, it is worth investing time in refactoring code. Making use of the in-built operations in the numpy library can improve the speed of the code by orders of magnitude. By doing this, not only are we speeding up our code, we are also writing much shorter pieces of code. This helps when it comes to finding bugs.

By Holly Jones

Published
Categorized as blog

Leave a comment

Your email address will not be published. Required fields are marked *