The pandas
library is a very powerful, simple, and flexible tool for data analysis and manipulation in Python. I have been using pandas
for quite a while now, and I must say that there is a big difference in how I do things now compared to how I used to do them when I started. There are many ways that one can take to achieve the same outcome in pandas
. But then there is the “pandas
” way of doing things, which is usually one of the most efficient way that one can take to achieve the desired outcome.
This blog is intended for those who are just starting in pandas
, or for those who are just curious. I will be covering some of the useful pandas
methods that I have used extensively while doing data analysis and manipulation. They saved me and still save me a lot of time while working with dataframes.
The version of the pandas library used in this blog is: v1.4.2
The pandas
methods that I will cover in this blog will be about doing the below operations:
- Manipulating the column names
- Manipulating the dtypes of variables
- Using nunique, unique, value_counts
- Using
loc
andiloc
to get slices of the data - Using
groupby
and pivot table to group data
Making a dummy dataframe:
1. Manipulating the column names:
If you want to replace the spaces in all the column names by an underscore (_) or else just remove them (“”), a quick and efficient way of doing it is:
If you want to rename one or multiple columns:
Or you can overwrite all of the column names at once:
You can also add prefix and/or suffix to the column names. This sometimes makes it easier to work with data from multiple datasets, especially when they have similar column names.
2. Manipulating the dtypes of variables:
Changing the dtype of one column:
Changing the dtypes of two or more columns at once:
Or we can also specify the dtypes of the variables while making the dataframe itself:
While we are still talking about dtypes, one little trick to convert the date columns to the datetime format directly while reading .csv
files is by using the parse_dates
parameter. Also, the dtype
parameter can be used to specify the dtype(s) of other column(s):
3. Using nunique, unique, value_counts:
Checking the number of unique values in the column continent
:
Checking the unique values in the column continent
:
Counting the number of values for each categories in the column continent
:
If we want the results for each category to be expressed as the percentage of the data that they represent, we must set the normalize
parameter to True
:
4. Using Ioc and iloc to get slices of the data:
To get slices of the data from the dataframe, we can use the loc
and iloc
methods. The loc
method uses the column/row labels, while the iloc
method uses the column/row integer positions.
Using loc
, we will get all the countries from the continent Africa:
Note: The results tables shown below are only the head
of the actual results tables. This has been done to save space.
We can even take a slice of the columns as well. For e.g., if we are interested in the country
, beer_servings
and continent
colunms only, and the rows where the continent
is Africa:
We can use masks to achieve the same thing as above, but in a slightly neater way (which can also increase the readability of the code):
Personally, I only use the iloc
method on rare occasions when I want to get rows/columns/values which will be exactly at certain positions (For e.g., the first value of the second column). I find the loc
method more intuitive. But I am sure other people may prefer iloc
over loc
. So here is how to achieve the same outcome as the above, but using the iloc
method:
Note: The .index
of the relevant subset is used for the row slice.
5. Using groupby and pivot table to manipulate data:
Using groupby
to group the data by continent
, and then perform one operation (For e.g., one of mean
, sum
, min
, max
, count
) on the beer_servings
column:
Using groupby
to group the data by continent
, and then perform multiple operations (For e.g., all of count
, min
, max
, mean
) on the beer_servings
column:
Note: The groupby
results can be plotted as well. We can do so for either just the mean
of the beer_servings
column:
Or for the mean
of all the columns in the dataframe (wherever applicable):
Now I’ll use the titanic dataset to make a very basic pivot_table
:
6. BONUS: List Comprehension and Dictionary Comprehension:
Although these are not pandas
methods, lists and dictionaries are used extensively while working with data. List and dictionary comprehension are easy and efficient ways of working with lists and dictionaries efficiently.
List comprehension can be used in cases where a for
loop is used to generate values that are to be appended to a new list:
We can do the above in one line of code using list comprehension:
Dictionary comprehension can be used in cases where a for
loop is used to generate key and/or values that are to be appended to a new dictionary:
Again, we can do the above in one line of code using dictionary comprehension:
Conclusion:
This was some of the pandas
methods that I use most of the time when I need to do these types of operations on the dataset. I always try to do something in the best way that I can because I believe that ‘good practice’ goes a long way. It is even more important while working with big datasets where the efficiency of the code might affect the overall runtime significantly. And most of these methods have other parameters within them as well, other than what I have covered in this blog. Do check them if you are interested in making the most of the methods. I had to use only a few of them just so to keep the size and content of the blog reasonable.
Finally, I thank you all for taking the time to read my blog. I’ll be very interested if some of these can be done in even better ways. If you feel like you have a better/more efficient solution than the one I mentioned in the blog, please feel free to add a comment of your solution(s) because we always keep learning.
Cheers All!
By Parwez Diloo
References:
Check the official pandas documentation here.
For some useful pandas videos, I highly recommend this channel on YouTube.