We all love and cherish pandas. However, there is a new bear in town: polars. It’s written in rust and claims to be better and faster though heavily influenced by pandas. Let’s evaluate these bears in terms of syntax readability and performance by using a simple DataFrame example.
I downloaded a basic world cities dataset as a csv and will use it to demonstrate the capabilities of both libraries.
First let’s read the csv:
Pandas
import pandas as pd
df = pd.read_csv(
"worldcities.csv",
usecols=["city", "population", "country"],
dtype={"population": "float32"},
).dropna(subset=["population"])
Polars
import polars as pl
df = pl.read_csv(
"worldcities.csv",
columns=["city", "population", "country"],
dtypes={"population": pl.Float32},
).drop_nulls(subset=["population"])
They seem almost identical in syntax, both are readable. However, the performance suggests otherwise:
Pandas read_csv: 0.07 seconds
Polars read_csv: 0.01 seconds
Now let’s perform some operations:
I want the following from this dataset:
- Find top 5 most populated cities with population greater than a million for each country
- For each country get the total population and average population of selected cities
This to me is a pretty realistic example. Let’s do it in Pandas first:
threshold = 1000000
df_pd = (
df.query("population > @threshold") # pandas can use variables in query with @ in front
.sort_values("population", ascending=False) # pandas has ascending = True as default
.groupby("country")
.head(5) # can use filter aggregate "head" on groupby
.groupby("country")
.agg(population_sum=("population", "sum"), population_mean=("population", "mean")) # create new columns using aggregate functions
.sort_values("population_sum", ascending=False)
.reset_index() # reset index needed as groupby indexes by group name
)
and now with Polars:
df_pl = (
df.filter(pl.col("population") > threshold) # filter is similar to query but you can use variables directly
.sort("population", descending=True) # descending is the default sort in polars
.group_by("country")
.head(5) # same as pandas
.group_by("country")
.agg(
pl.sum("population").name.suffix("_sum"),
pl.mean("population").name.suffix("_mean"),
) # this can be done in many ways in polars but I chose this to demonstrate the suffix usage which is neat
.sort(
"population_sum",
descending=True,
)
.with_row_index() # polars does not use index! so we need to add it manually if needed
)
The outputs:
Pandas:
Polars:
and wow what a beautiful output from Polars! Pretty CLI output is the way to my heart.
and in performance Polars beats Pandas yet again:
Pandas: 0.007008 seconds
Polars: 0.003381 seconds
Conclusion: In this blog we saw example usages of Pandas and Polars libraries. If the performance is a bottleneck in your pandas operations, Polars is 100% worth a try! The output is wonderful and no index/multiindex means less errors for the user. There is a great section in Polars documentation for migrating from Pandas which is worth a read if you’re willing to make the switch.