Python is a very popular language amongst data scientists due to the extensive list of libraries available to us. The libraries `pandas`

and `numpy`

in particular are extremely useful when solving data science and machine learning problems.

However, Python also comes with some disadvantages. As Python is an interpreted language (code is executed line-by-line) it is significantly slower than other languages; C and Java for example. With the rise of big data, more data is becoming available for data scientists to work with. This means it is important to find ways of keeping code as efficient as possible, thereby avoiding any unnecessary computation time. One way of doing this is by making use of the in-built optimised routines in `pandas`

and `numpy`

.

Throughout this blog we will see different examples of Python code which will execute the same task, but with considerably different run times.

importpandasaspdimportnumpyasnpimportwarningsfromtimeitimportTimer warnings.filterwarnings("ignore")

We will investigate four ways of calculating the total cost, when given a dataframe consisting of the following two columns:

- the number of units
- the price per unit

We will randomly generate 10,000 integer values for each column using the `numpy.random`

library for the purpose of this example. It is important to remember that in practice, data scientists tend to work with much larger datasets, often comprising millions of rows, which would make the following variations in performance even more dramatic.

# setting the seed to ensure we get the same resultsnp.random.seed(1000)# creating a dataframe with two columns and 10,000 rows, where the values have been randomly generateddf=pd.DataFrame({'number_of_units':np.random.randint(0,100,size=10000), 'price_per_unit':np.random.randint(0,1000,size=10000)})# printing out the top 5 rows of our dataframedf.head()

We will write four different functions which all carry out the same task of calculating the total cost. These functions are as follows:

- calculating the total cost using a
**for-loop** - calculating the total cost using a
**list comprehension** - calculating the total cost making use of
**vectorization** - calculating the by using the
**in-built**`numpy`

dot product

deffor_loop(): df['cost_of_items']=pd.Series()foriinrange(len(df)): df['cost_of_items'].iloc[i]=df['price_per_unit'].iloc[i]*df['number_of_units'].iloc[i] total_cost=sum(df['cost_of_items'])returntotal_costdeflist_comprehension(): cost_of_items=[price*numforprice, numinzip(df['price_per_unit'], df['number_of_units'])] total_cost=sum(cost_of_items)returntotal_costdefvectorized(): total_cost=sum(df['price_per_unit']*df['number_of_units'])returntotal_costdefdot_product(): total_cost=np.dot(df['price_per_unit'], df['number_of_units'])returntotal_cost

We can calculate the time it takes to execute each function once using the `timeit`

library.

computation_time_for_loop=Timer(for_loop).timeit(1) computation_time_list_comprehension=Timer(list_comprehension).timeit(1) computation_time_vectorized=Timer(vectorized).timeit(1) computation_time_dot_product=Timer(dot_product).timeit(1) print("Computation time is %0.9f using for-loop"%computation_time_for_loop) print("Computation time is %0.9f using comprehension for-loop"%computation_time_list_comprehension) print("Computation time is %0.9f using vectorization"%computation_time_vectorized) print("Computation time is %0.9f using numpy"%computation_time_dot_product)

Computation time is 148.951093500 using for-loop Computation time is 0.002266300 using comprehension for-loop Computation time is 0.000865100 using vectorization Computation time is 0.000057100 using numpy

As expected, the for-loop took by far the most time with approximately 149 seconds! A for-loop can, and often is, replaced with a list comprehension and it can be seen through this example how valuable that can be; with the computation time going from 149 seconds down to roughly 0.002 seconds.

For-loops are a good place to start, especially as a beginner, as they can be more intuitive and readable. However, by refactoring these into list comprehensions, your code suddenly becomes much more efficient and condensed down to one line.

This example also shows that it is important to use vectorization when possible. We can see that the vectorized function is executed in less than half the time of a list comprehension. This might seem minor here, as we are considering computation times of less than a second, but when working with considerably more data, this change can go a long way. It is also worth noting that the code for the vectorized version is much closer to the plain mathematical statement of the problem.

Most importantly, making use of the `numpy`

package with the in-built vectorized operations, our code can become more than 15 times quicker than that!

In conclusion, it is worth investing time in refactoring code. Making use of the in-built operations in the `numpy`

library can improve the speed of the code by orders of magnitude. By doing this, not only are we speeding up our code, we are also writing much shorter pieces of code. This helps when it comes to finding bugs.

**By Holly Jones**