pandas library is a very powerful, simple, and flexible tool for data analysis and manipulation in Python. I have been using
pandas for quite a while now, and I must say that there is a big difference in how I do things now compared to how I used to do them when I started. There are many ways that one can take to achieve the same outcome in
pandas. But then there is the “
pandas” way of doing things, which is usually one of the most efficient way that one can take to achieve the desired outcome.
This blog is intended for those who are just starting in
pandas, or for those who are just curious. I will be covering some of the useful
pandas methods that I have used extensively while doing data analysis and manipulation. They saved me and still save me a lot of time while working with dataframes.
The version of the pandas library used in this blog is: v1.4.2
pandas methods that I will cover in this blog will be about doing the below operations:
- Manipulating the column names
- Manipulating the dtypes of variables
- Using nunique, unique, value_counts
ilocto get slices of the data
groupbyand pivot table to group data
Making a dummy dataframe:
1. Manipulating the column names:
If you want to replace the spaces in all the column names by an underscore (_) or else just remove them (“”), a quick and efficient way of doing it is:
If you want to rename one or multiple columns:
Or you can overwrite all of the column names at once:
You can also add prefix and/or suffix to the column names. This sometimes makes it easier to work with data from multiple datasets, especially when they have similar column names.
2. Manipulating the dtypes of variables:
Changing the dtype of one column:
Changing the dtypes of two or more columns at once:
Or we can also specify the dtypes of the variables while making the dataframe itself:
While we are still talking about dtypes, one little trick to convert the date columns to the datetime format directly while reading
.csv files is by using the
parse_dates parameter. Also, the
dtype parameter can be used to specify the dtype(s) of other column(s):
3. Using nunique, unique, value_counts:
Checking the number of unique values in the column
Checking the unique values in the column
Counting the number of values for each categories in the column
If we want the results for each category to be expressed as the percentage of the data that they represent, we must set the
normalize parameter to
4. Using Ioc and iloc to get slices of the data:
To get slices of the data from the dataframe, we can use the
iloc methods. The
loc method uses the column/row labels, while the
iloc method uses the column/row integer positions.
loc, we will get all the countries from the continent Africa:
Note: The results tables shown below are only the
head of the actual results tables. This has been done to save space.
We can even take a slice of the columns as well. For e.g., if we are interested in the
continent colunms only, and the rows where the
continent is Africa:
We can use masks to achieve the same thing as above, but in a slightly neater way (which can also increase the readability of the code):
Personally, I only use the
iloc method on rare occasions when I want to get rows/columns/values which will be exactly at certain positions (For e.g., the first value of the second column). I find the
loc method more intuitive. But I am sure other people may prefer
loc. So here is how to achieve the same outcome as the above, but using the
.index of the relevant subset is used for the row slice.
5. Using groupby and pivot table to manipulate data:
groupby to group the data by
continent, and then perform one operation (For e.g., one of
count) on the
groupby to group the data by
continent, and then perform multiple operations (For e.g., all of
mean) on the
groupby results can be plotted as well. We can do so for either just the
mean of the
Or for the
mean of all the columns in the dataframe (wherever applicable):
Now I’ll use the titanic dataset to make a very basic
6. BONUS: List Comprehension and Dictionary Comprehension:
Although these are not
pandas methods, lists and dictionaries are used extensively while working with data. List and dictionary comprehension are easy and efficient ways of working with lists and dictionaries efficiently.
List comprehension can be used in cases where a
for loop is used to generate values that are to be appended to a new list:
We can do the above in one line of code using list comprehension:
Dictionary comprehension can be used in cases where a
for loop is used to generate key and/or values that are to be appended to a new dictionary:
Again, we can do the above in one line of code using dictionary comprehension:
This was some of the
pandas methods that I use most of the time when I need to do these types of operations on the dataset. I always try to do something in the best way that I can because I believe that ‘good practice’ goes a long way. It is even more important while working with big datasets where the efficiency of the code might affect the overall runtime significantly. And most of these methods have other parameters within them as well, other than what I have covered in this blog. Do check them if you are interested in making the most of the methods. I had to use only a few of them just so to keep the size and content of the blog reasonable.
Finally, I thank you all for taking the time to read my blog. I’ll be very interested if some of these can be done in even better ways. If you feel like you have a better/more efficient solution than the one I mentioned in the blog, please feel free to add a comment of your solution(s) because we always keep learning.
By Parwez Diloo
Check the official pandas documentation here.
For some useful pandas videos, I highly recommend this channel on YouTube.