Polars – the new bear in town

We all love and cherish pandas. However, there is a new bear in town: polars. It’s written in rust and claims to be better and faster though heavily influenced by pandas. Let’s evaluate these bears in terms of syntax readability and performance by using a simple DataFrame example.

I downloaded a basic world cities dataset as a csv and will use it to demonstrate the capabilities of both libraries.

First let’s read the csv:

Pandas

import pandas as pd

df = pd.read_csv(
    "worldcities.csv",
    usecols=["city", "population", "country"],
    dtype={"population": "float32"},
).dropna(subset=["population"])

Polars

import polars as pl
df = pl.read_csv(
    "worldcities.csv",
    columns=["city", "population", "country"],
    dtypes={"population": pl.Float32},
).drop_nulls(subset=["population"])

They seem almost identical in syntax, both are readable. However, the performance suggests otherwise:

Pandas read_csv: 0.07 seconds
Polars read_csv: 0.01 seconds

Now let’s perform some operations:

I want the following from this dataset:

  • Find top 5 most populated cities with population greater than a million for each country
  • For each country get the total population and average population of selected cities

This to me is a pretty realistic example. Let’s do it in Pandas first:

threshold = 1000000
df_pd = (
    df.query("population > @threshold")  # pandas can use variables in query with @ in front
    .sort_values("population", ascending=False)  # pandas has ascending = True as default
    .groupby("country")
    .head(5)  # can use filter aggregate "head" on groupby
    .groupby("country")
    .agg(population_sum=("population", "sum"), population_mean=("population", "mean"))  # create new columns using aggregate functions
    .sort_values("population_sum", ascending=False)
    .reset_index()  # reset index needed as groupby indexes by group name

)

and now with Polars:

df_pl = (
    df.filter(pl.col("population") > threshold)  # filter is similar to query but you can use variables directly
    .sort("population", descending=True)  # descending is the default sort in polars
    .group_by("country")
    .head(5)  # same as pandas
    .group_by("country")
    .agg(
        pl.sum("population").name.suffix("_sum"),
        pl.mean("population").name.suffix("_mean"),
    )  # this can be done in many ways in polars but I chose this to demonstrate the suffix usage which is neat
    .sort(
        "population_sum",
        descending=True,
    )

    .with_row_index()  # polars does not use index! so we need to add it manually if needed
)

The outputs:

Pandas:

Polars:

and wow what a beautiful output from Polars! Pretty CLI output is the way to my heart.

and in performance Polars beats Pandas yet again:

Pandas: 0.007008 seconds
Polars: 0.003381 seconds

Conclusion: In this blog we saw example usages of Pandas and Polars libraries. If the performance is a bottleneck in your pandas operations, Polars is 100% worth a try! The output is wonderful and no index/multiindex means less errors for the user. There is a great section in Polars documentation for migrating from Pandas which is worth a read if you’re willing to make the switch.

By Zeynep Bicer
Published
Categorized as blog

Leave a comment

Your email address will not be published. Required fields are marked *