5 pandas tricks to save time

The pandas library is a very powerful, simple, and flexible tool for data analysis and manipulation in Python. I have been using pandas for quite a while now, and I must say that there is a big difference in how I do things now compared to how I used to do them when I started. There are many ways that one can take to achieve the same outcome in pandas. But then there is the “pandas” way of doing things, which is usually one of the most efficient way that one can take to achieve the desired outcome.

This blog is intended for those who are just starting in pandas, or for those who are just curious. I will be covering some of the useful pandas methods that I have used extensively while doing data analysis and manipulation. They saved me and still save me a lot of time while working with dataframes.

The version of the pandas library used in this blog is: v1.4.2

The pandas methods that I will cover in this blog will be about doing the below operations:

Manipulating the column names
Manipulating the dtypes of variables
Using nunique, unique, value_counts
Using loc and iloc to get slices of the data
Using groupby and pivot table to group data

Making a dummy dataframe:

1. Manipulating the column names:

If you want to replace the spaces in all the column names by an underscore (_) or else just remove them (“”), a quick and efficient way of doing it is:

If you want to rename one or multiple columns:

Or you can overwrite all of the column names at once:

You can also add prefix and/or suffix to the column names. This sometimes makes it easier to work with data from multiple datasets, especially when they have similar column names.

2. Manipulating the dtypes of variables:

Changing the dtype of one column:

Changing the dtypes of two or more columns at once:

Or we can also specify the dtypes of the variables while making the dataframe itself:

While we are still talking about dtypes, one little trick to convert the date columns to the datetime format directly while reading .csv files is by using the parse_dates parameter. Also, the dtype parameter can be used to specify the dtype(s) of other column(s):

3. Using nunique, unique, value_counts:

Checking the number of unique values in the column continent:

Checking the unique values in the column continent:

Counting the number of values for each categories in the column continent:

If we want the results for each category to be expressed as the percentage of the data that they represent, we must set the normalize parameter to True:

4. Using Ioc and iloc to get slices of the data:

To get slices of the data from the dataframe, we can use the loc and iloc methods. The loc method uses the column/row labels, while the iloc method uses the column/row integer positions.

Using loc, we will get all the countries from the continent Africa:

Note: The results tables shown below are only the head of the actual results tables. This has been done to save space.

We can even take a slice of the columns as well. For e.g., if we are interested in the country, beer_servings and continent colunms only, and the rows where the continent is Africa:

We can use masks to achieve the same thing as above, but in a slightly neater way (which can also increase the readability of the code):

Personally, I only use the iloc method on rare occasions when I want to get rows/columns/values which will be exactly at certain positions (For e.g., the first value of the second column). I find the loc method more intuitive. But I am sure other people may prefer iloc over loc. So here is how to achieve the same outcome as the above, but using the iloc method:

Note: The .index of the relevant subset is used for the row slice.

5. Using groupby and pivot table to manipulate data:

Using groupby to group the data by continent, and then perform one operation (For e.g., one of mean, sum, min, max, count) on the beer_servings column:

Using groupby to group the data by continent, and then perform multiple operations (For e.g., all of count, min, max, mean) on the beer_servings column:

Note: The groupby results can be plotted as well. We can do so for either just the mean of the beer_servings column:

Or for the mean of all the columns in the dataframe (wherever applicable):

Now I’ll use the titanic dataset to make a very basic pivot_table:

6. BONUS: List Comprehension and Dictionary Comprehension:

Although these are not pandas methods, lists and dictionaries are used extensively while working with data. List and dictionary comprehension are easy and efficient ways of working with lists and dictionaries efficiently.

List comprehension can be used in cases where a for loop is used to generate values that are to be appended to a new list:

We can do the above in one line of code using list comprehension:

Dictionary comprehension can be used in cases where a for loop is used to generate key and/or values that are to be appended to a new dictionary:

Again, we can do the above in one line of code using dictionary comprehension:

Conclusion:

This was some of the pandas methods that I use most of the time when I need to do these types of operations on the dataset. I always try to do something in the best way that I can because I believe that ‘good practice’ goes a long way. It is even more important while working with big datasets where the efficiency of the code might affect the overall runtime significantly. And most of these methods have other parameters within them as well, other than what I have covered in this blog. Do check them if you are interested in making the most of the methods. I had to use only a few of them just so to keep the size and content of the blog reasonable.

Finally, I thank you all for taking the time to read my blog. I’ll be very interested if some of these can be done in even better ways. If you feel like you have a better/more efficient solution than the one I mentioned in the blog, please feel free to add a comment of your solution(s) because we always keep learning.

Cheers All!