The use of big data in economics is growing:
Usually, we load datasets directly in R (or Stata, or whatever).
This means that we are limited by the amount of RAM1 on our computer.
You want to handle more data? Get more RAM.
But we can’t necessarily have all the RAM we need.
And we don’t necessarily need all this RAM in the first place.
polars
polars
is a recent DataFrame library that is available for several languages:
Built to have a syntax similar (though not identical) to Python Pandas.
Built to be very fast and memory-efficient thanks to several mechanisms.
Eager evaluation: all operations are run line by line, in order, and directly applied on the data. This is the way we’re used to.
Lazy evaluation: operations are only run when we call a specific function at the end of the chain, usually called collect()
.
When dealing with large data, it is better to use lazy evaluation.
Why is it better to use lazy evaluation?
The code below takes some data, sorts it by a variable, and then filters it based on a condition:
Do you see what could be improved here?
The problem lies in the order of operations: sorting data is much slower than filtering it.
Let’s test with a dataset of 2M observations and 2 columns:
We gained a lot of time just by reordering these two operations.
There are probably tons of suboptimal code in our scripts.
But it’s already hard enough to make scripts that work and that are reproducible, we don’t want to spend even more time trying to optimize them.
Let polars
do this automatically.
Workflow:
This only returns the schema of the data: the column names and their types (character, integers, …).
Workflow:
Workflow:
collect()
at the end of the query to execute it.polars
doesn’t directly execute the code. Before that, it does a lot of optimizations to be sure that we don’t do inefficient operations.
Examples of optimizations:
Using the previous example, the non-optimized query looks like this:
FILTER col("country").is_in([Series]) FROM
SORT BY [col("country")]
DF ["country", "year"]; PROJECT */2 COLUMNS; SELECTION: "None"
Calling collect()
doesn’t start computations right away.
First, polars
scans the query to ensure there are no schema errors, e.g doing pl.col("gdp") > "France"
.
In this specific case, that would return:
polars.exceptions.ComputeError: cannot compare string with numeric data
It is possible that the collected dataset is still too big for our RAM.
In this case, this will provoke a crash of the R / Python session.
Streaming is a way to run the code on batches of data to avoid using all memory at the same time.
Using this is extremely simple: instead of calling collect()
, we call collect(streaming = TRUE)
.
Warning
This is still an early, incomplete feature of polars
for now.
If you’re interested:
Those projects are rapidly evolving, please report any bug on GitHub!
Getting bigger computers shouldn’t necessarily be the first reflex to handle large data.
There are several other tools available: arrow
, DuckDB
, Spark
, and others.
Do you use any of them? How do you deal with large datasets?