If you happen to use Python in your knowledge, you’ve got most likely skilled the frustration of ready a number of minutes for a Pandas operation to finish.
Every part appears fantastic at first, however as your dataset grows and your workflows develop into extra advanced, your laptop computer all of a sudden feels prefer it’s making ready for liftoff.
A couple of months in the past, I labored on a undertaking to research e-commerce transactions that contained over 3 million rows of knowledge.
This was a really fascinating expertise, most of which noticed easy groupby operations that usually take seconds all of a sudden stretch into minutes.
At that time, I spotted that whereas Pandas is nice, it is not essentially sufficient.
On this article, we’ll discover fashionable options to Pandas, equivalent to Polars and DuckDB, and discover how they simplify and enhance processing of huge datasets.
For the sake of readability, let me be upfront about just a few issues earlier than we start.
This text is just not a deep dive into reminiscence administration in Rust, neither is it a declaration that Pandas is out of date.
As a substitute, it is a sensible, hands-on information. You will see real-world examples, private experiences, and actionable insights into workflows that may prevent time and sanity.
Why pandas really feel sluggish
After I was engaged on an e-commerce undertaking, I keep in mind working with CSV information that have been over 2 GB. All Pandas filters and aggregations typically took a number of minutes to finish.
Throughout that point, I stared on the display screen, wishing I may have a cup of espresso or binge-watch just a few episodes of a present whereas the code was operating.
The principle points I encountered have been pace, reminiscence, and workflow complexity.
Everyone knows that enormous CSV information devour numerous RAM, typically greater than my laptop computer can comfortably deal with. Moreover, chaining a number of transformations makes the code troublesome to keep up and runs slowly.
Polars and DuckDB are addressing these challenges in a wide range of methods.
Constructed into Rust, Polars makes use of multithreaded execution to effectively course of massive datasets.
DuckDB, then again, is designed for analytics and executes SQL queries with out loading every part into reminiscence.
Principally, every one has their very own superpower. Polars are speedsters and DuckDB is sort of a reminiscence wizard.
And the very best half? Each combine seamlessly with Python, permitting you to reinforce your workflow with out having to fully rewrite it.
Organising the setting
Earlier than you begin coding, be sure your setting is prepared. For consistency, we used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.
Pin a model so you do not have to fret about following tutorials or sharing code.
pip set up pandas==2.2.0 polars==0.20.0 duckdb==1.9.0
Import the library in Python.
import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")
For instance, use an e-commerce gross sales dataset that features columns equivalent to order ID, product ID, area, nation, income, and date. Related datasets could be downloaded from: Kaguru or generate artificial knowledge.
Loading knowledge
Loading knowledge effectively units the tone for the remainder of your workflow. I keep in mind a undertaking that had almost 5 million rows in a CSV file.
Pandas did the trick, however the load occasions have been lengthy and reloading repeatedly throughout testing was a ache.
That was a kind of moments the place I want my laptop computer had a “quick ahead” button.
Switching to Polars and DuckDB fully improved every part, and all of a sudden I used to be in a position to entry and manipulate knowledge virtually immediately, and actually, the method of testing and iterating turned way more pleasurable.
For pandas:
df_pd = pd.read_csv("gross sales.csv")
print(df_pd.head(3))
For Polars:
df_pl = pl.read_csv("gross sales.csv")
print(df_pl.head(3))
With DuckDB:
con = duckdb.join()
df_duck = con.execute("SELECT * FROM 'gross sales.csv'").df()
print(df_duck.head(3))
DuckDB permits you to question CSVs instantly with out loading all the dataset into reminiscence, making working with massive information a lot simpler.
Filtering knowledge
The issue right here is that filtering with Pandas could be sluggish when coping with hundreds of thousands of rows. As soon as, I wanted to research European transactions in a big gross sales dataset. Pandas takes a number of minutes, which slowed down the evaluation.
For pandas:
filtered_pd = df_pd[df_pd.region == "Europe"]
Polars are sooner and might deal with a number of filters effectively.
filtered_pl = df_pl.filter(pl.col("area") == "Europe")
DuckDB makes use of SQL syntax.
filtered_duck = con.execute("""
SELECT *
FROM 'gross sales.csv'
WHERE area = 'Europe'
""").df()
Now you can filter massive datasets in seconds as a substitute of minutes, providing you with extra time to concentrate on the insights that basically matter.
Mixture massive datasets rapidly
Aggregation is commonly the place Panda begins to really feel sluggish. Think about you need to calculate the overall income by nation for a advertising report.
For pandas:
agg_pd = df_pd.groupby("nation")["revenue"].sum().reset_index()
Within the polar areas:
agg_pl = df_pl.groupby("nation").agg(pl.col("income").sum())
For DuckDB:
agg_duck = con.execute("""
SELECT nation, SUM(income) AS total_revenue
FROM 'gross sales.csv'
GROUP BY nation
""").df()
I keep in mind operating this aggregation on a dataset of 10 million rows. Pandas took almost half-hour. Polars accomplished the identical job in lower than a minute.
It was a reduction, like ending a marathon and realizing I may nonetheless use my legs.
Becoming a member of massive datasets
Combining datasets is a kind of duties that sounds straightforward till you actually perceive your knowledge.
In real-world initiatives, knowledge sometimes resides in a number of sources, so it’s essential be a part of the sources utilizing a shared column, equivalent to buyer ID.
I discovered this the exhausting approach whereas engaged on a undertaking that required me to mix hundreds of thousands of buyer orders with an equally massive demographic dataset.
Every file was massive sufficient by itself, however combining them felt like attempting to power two puzzle items collectively whereas my laptop computer begged for forgiveness.
Panda’s case is so time-consuming that he began timing the joins the identical approach folks time their microwave popcorn.
Spoiler: The popcorn received each time.
Polars and DuckDB gave me a life.
For pandas:
merged_pd = df_pd.merge(pop_df_pd, on="nation", how="left")
Polar:
merged_pl = df_pl.be a part of(pop_df_pl, on="nation", how="left")
Duck DB:
merged_duck = con.execute("""
SELECT *
FROM 'gross sales.csv' s
LEFT JOIN 'pop.csv' p
USING (nation)
""").df()
combine Giant datasets that beforehand would freeze workflows now run easily and effectively.
Lazy analysis in Polars
One factor I did not perceive early on in my knowledge science journey was how a lot time was wasted performing transformations line by line.
Polars takes a special method to this.
use A method referred to as lazy analysis. Principally, wait till the transformation definition is full earlier than performing the operation..
It inspects all the pipeline, determines essentially the most environment friendly path, and runs every part on the identical time.
It is like having a pal who listens to your total order earlier than you go to the kitchen, as a substitute of a pal who takes every instruction individually and retains going backwards and forwards.
This TDS article explains lazy analysis intimately.
The move is as follows.
Panda:
df = df[df["amount"] > 100]
df = df.groupby("phase").agg({"quantity": "imply"})
df = df.sort_values("quantity")
Polars lazy mode:
import polars as pl
df_lazy = (
pl.scan_csv("gross sales.csv")
.filter(pl.col("quantity") > 100)
.groupby("phase")
.agg(pl.col("quantity").imply())
.kind("quantity")
)
outcome = df_lazy.acquire()
The primary time I used deferred mode, I discovered it unusual that I did not see outcomes straight away. But when I run within the finals .acquire()the pace distinction was apparent.
Lazy analysis would not magically remedy all efficiency issues, nevertheless it does present a stage of effectivity for which Pandas was not designed.
Conclusion and details
No have to wrestle with instruments when working with massive datasets.
With Polars and DuckDB, we have discovered that the issue is not at all times with the information. In some circumstances, it was because of the instruments we have been utilizing to deal with it.
If there’s one factor you will take away from this tutorial, let or not it’s this. You do not have to desert Pandas, however you’ll be able to attain for one thing higher when your dataset begins to exceed its limits.
Polars offers you pace and good execution, and DuckDB permits you to question large information like they’re small. Collectively, these make working with massive knowledge extra manageable and fewer tiring.
If you want to be taught extra in regards to the concepts explored on this tutorial, please consult with the next official documentation: polar region and duck db An excellent place to start out is.

