The use of big data in all disciplines is growing:
Usually, we load datasets directly in R (or Stata, or whatever).
This means that we are limited by the amount of RAM1 on our computer.
You want to handle more data? Get more RAM.
But we can’t necessarily have all the RAM we need.
And we don’t necessarily need all this RAM in the first place.
polars
polars
is a recent DataFrame library that is available for several languages:
Built to have a syntax similar (though not identical) to Python Pandas.
Built to be very fast and memory-efficient thanks to several mechanisms.
Eager evaluation: all operations are run line by line, in order, and directly applied on the data. This is the way we’re used to.
Lazy evaluation: operations are only run when we call a specific function at the end of the chain, usually called collect()
.
When dealing with large data, it is better to use lazy evaluation.
Why is it better to use lazy evaluation?
The code below takes some data, sorts it by a variable, and then filters it based on a condition:
Do you see what could be improved here?
The problem lies in the order of operations: sorting data is much slower than filtering it.
Let’s test with a dataset of 50M observations and 10 columns:
There is probably tons of suboptimal code in our scripts.
But it’s already hard enough to make scripts that work and that are reproducible, we don’t want to spend even more time trying to optimize them.
Let polars
do this automatically.
Workflow:
This only returns the schema of the data: the column names and their types (character, integers, …).
Workflow:
Workflow:
collect()
at the end of the code to execute it.polars
doesn’t directly execute the code. Before that, it does a lot of optimizations to be sure that we don’t do inefficient operations1.
Examples of optimizations:
Calling collect()
doesn’t start computations right away.
First, polars
scans the code to ensure there are no schema errors, i.e. check that we don’t do “forbidden” operations.
For instance, doing pl.col("gdp") > "France"
would be an error: we can’t compare a number to a character.
In this case, that would return:
polars.exceptions.ComputeError: cannot compare string with numeric data
It is possible that the collected dataset is still too big for our RAM.
In this case, this will crash the Python or R session.
Streaming is a way to run the code on data by batches to avoid using all memory at the same time.
Polars takes care of splitting the data and runs the code piece by piece.
Using this is extremely simple: instead of calling collect()
, we call collect(streaming = TRUE)
.
Warning
This is still an early, incomplete feature of polars
for now.
Not all operations can be run in streaming mode.
We’re used to a few file formats: CSV, Excel, .dta
. Polars can read most of them (.dta
is not possible for now).
When possible, use the Parquet format (.parquet
).
Pros:
Cons:
UK Census data
About 40M observations, 110+ variables
If you’re interested:
Getting bigger computers shouldn’t necessarily be the first reflex to handle large data.
There are several other tools available: arrow
, DuckDB
, Spark
, and others.
Do you use any of them? How do you deal with large datasets?
Slides: https://github.com/etiennebacher/handling-large-data
Typos, comments: https://github.com/etiennebacher/handling-large-data/issues
Using the previous example, the non-optimized query looks like this:
FILTER col("country").is_in([Series]) FROM
SORT BY [col("country")]
DF ["country", "year", "var1", "var2"]; PROJECT */10 COLUMNS; SELECTION: None
And the optimized query looks like:
SORT BY [col("country")]
DF ["country", "year", "var1", "var2"]; PROJECT */10 COLUMNS; SELECTION: col("country").is_in([Series])