If you prefer a prettier rendering of the code and dataframes displayed in this article, you are welcome to read the notebook at my Github repo.

pandas_polars

How fast is Polars for realistic data processing jobs?¶

The dataframe library Pandas has long been a workhorse for data scientists, ML engineers, and other data professionals.
It has a broad, convenient, user-friendly API for transforming and summarizing data, which allows us to quickly discern insights from our data and engineer data pipelines for production. But how quickly, exactly? A new dataframe library Polars claims to be much faster at processing data than Pandas and other common tools.

I was curious about how much faster Polars might be not only on benchmarks but also as part of realistic data workflows. So, I created some data to be processed by Pandas or Polars. Since file-reading is typically among the slower parts of a data pipeline, I stored the data in either the older, ubiquitous comma-separated-values (CSV) format or the newer, column-based Parquet format. Additionally, Polars can be used from either Python or Rust, so comparing the workflow in those two languages would be useful. Finally, Polars jobs can be processed eagerly or lazily, so we can compare both execution approaches in our workflow.

To summarize, we’ll compare:

Pandas vs. Polars
Python Polars vs. Rust Polars
CSV vs. Parquet
Polars Eager vs. Polars Lazy

For each comparison pair in the list above, the bolded variation is the one I expected to be faster. That is, I expected Polars to be faster than Pandas, Rust to be faster than Python, Parquet to be faster than CSV, and Lazy to be faster than Eager.

Creating the Data¶

In [1]:

import pandas as pd
from pathlib import Path

In [2]:

def get_synthetic_data_path() -> Path:
    """
    Returns path to directory where synthetic data is stored
    """

    path = Path.home() / 'data' / 's01_synthetic_data'
    path.mkdir(parents=True, exist_ok=True)

    return path  

In [3]:

def load_key_table(dir_path: Path, file_extension: str) -> pd.DataFrame:
    """
    Load table with key column that joins to the main data set
    """

    filename_pattern = '*table' + file_extension
    file_list = list(dir_path.glob(filename_pattern))
    df_key_list = [pd.read_csv(e) for e in file_list]
    df_key = df_key_list[0]

    return df_key

I created a data set of 100 tables, each with 100,000 rows and 20 columns (lines 88-90). The 100 CSV files occupied 2.0 GB of space on my hard drive, while the 100 Parquet files occupied 1.0 GB. The first 5 columns of each table were composed of randomly selected characters; the next 5 columns, of random integers; and the final 10 columns, of random floats. An example is shown below.

In [4]:

input_filepath = Path.home() / 'data' / 's01_synthetic_data' / 'table_0.csv'
df_0 = pd.read_csv(input_filepath)
df_0

Out[4]:

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T
0	j	p	x	r	s	5	15	19	10	15	0.646004	0.120232	0.026551	0.023453	0.606443	0.813186	0.640565	0.994134	0.774991	0.667491
1	d	q	n	q	y	4	7	14	26	4	0.340173	0.745175	0.876945	0.768881	0.019672	0.084978	0.371513	0.688281	0.210810	0.159730
2	f	d	e	z	d	16	14	5	10	24	0.619300	0.152788	0.403224	0.767596	0.952308	0.008318	0.759884	0.464614	0.127938	0.333332
3	i	s	n	l	o	6	6	15	14	1	0.940586	0.722262	0.107942	0.853040	0.312398	0.713029	0.285081	0.382372	0.121249	0.883717
4	a	f	t	v	g	12	12	18	14	26	0.286733	0.168195	0.636159	0.949421	0.897232	0.723068	0.425204	0.565747	0.250786	0.822446
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
99995	d	u	f	f	b	3	12	24	8	15	0.751882	0.719684	0.486673	0.138891	0.303526	0.540719	0.628315	0.664754	0.593231	0.546143
99996	a	v	n	u	m	5	23	12	4	14	0.292462	0.791517	0.425860	0.455843	0.272213	0.944931	0.695094	0.910036	0.168693	0.293627
99997	i	q	f	d	y	23	26	5	15	11	0.331877	0.852896	0.193597	0.249045	0.952547	0.816969	0.718859	0.467703	0.826881	0.969302
99998	z	s	m	l	q	7	22	17	9	17	0.195640	0.450491	0.017788	0.992169	0.743146	0.460716	0.626521	0.556164	0.913457	0.121041
99999	d	g	o	q	p	2	13	20	15	8	0.899047	0.602126	0.022941	0.596927	0.673683	0.409312	0.694958	0.277388	0.645721	0.676891

100000 rows × 20 columns

All 100 tables like this example above have the same column names and the same data types for their corresponding columns. After loading all these data tables, the first data processing step will be to vertically concatenate all these tables into a single, 10-million-row table (i.e., 100 tables x 100,000 rows = 10,000,000 rows).

I created an additional key table with 26 rows and 4 columns (lines 50-72). The first key column contains every letter of the alphabet, and the other 3 columns contain randomly selected integers.

In [5]:

synthetic_data_path = get_synthetic_data_path()
file_extension = '.csv'
df_key = load_key_table(synthetic_data_path, file_extension)
df_key

Out[5]:

	key	0	1	2
0	a	-5	-2	-2
1	b	-4	-4	-4
2	c	-2	-4	-3
3	d	-5	-5	-2
4	e	-5	-5	-5
5	f	-3	-2	-5
6	g	-3	-5	-4
7	h	-2	-3	-3
8	i	-4	-2	-5
9	j	-3	-3	-3
10	k	-3	-3	-4
11	l	-4	-2	-2
12	m	-5	-5	-3
13	n	-5	-5	-4
14	o	-4	-4	-4
15	p	-4	-2	-3
16	q	-5	-4	-2
17	r	-2	-3	-4
18	s	-4	-2	-3
19	t	-2	-3	-3
20	u	-4	-4	-3
21	v	-5	-2	-5
22	w	-2	-5	-4
23	x	-5	-5	-5
24	y	-2	-5	-3
25	z	-4	-3	-3

The key table is shown above. It is used in the second data processing step, when it is joined on its key column to a column of characters in the 10-million-row table.

Data Processing Steps¶

Only 3 columns from each of the 100 tables are used from the CSV or Parquet files: a column of characters, a column of integers, and a column of floats. An example is shown below.

In [6]:

filename_pattern = 'table_*' + file_extension
file_list = list(synthetic_data_path.glob(filename_pattern))
col_list = ['A', 'I', 'P']
df_list = [pd.read_csv(e, usecols=col_list) for e in file_list]
df_list[0].head()

Out[6]:

	A	I	P
0	j	10	0.813186
1	d	26	0.084978
2	f	10	0.008318
3	i	14	0.713029
4	a	14	0.723068

We’ve already mentioned the first two data processing steps after reading the tables: vertically concatenating the 100 tables and then joining the key table to the resulting, concatenated table. We then drop the key column, because it is no longer needed after the join. These 3 steps are shown below.

The vertically concatenated table is shown below.

In [7]:

df_all_01 = pd.concat(df_list, axis=0).reset_index(drop=True)
df_all_01

Out[7]:

	A	I	P
0	j	10	0.813186
1	d	26	0.084978
2	f	10	0.008318
3	i	14	0.713029
4	a	14	0.723068
…	…	…	…
9999995	d	8	0.540719
9999996	a	4	0.944931
9999997	i	15	0.816969
9999998	z	9	0.460716
9999999	d	15	0.409312

10000000 rows × 3 columns

The joined table is shown below.

In [8]:

df_all_02 = df_all_01.merge(df_key, left_on='A', right_on='key', how='left')
df_all_02

Out[8]:

	A	I	P	key	0	1	2
0	j	10	0.813186	j	-3	-3	-3
1	d	26	0.084978	d	-5	-5	-2
2	f	10	0.008318	f	-3	-2	-5
3	i	14	0.713029	i	-4	-2	-5
4	a	14	0.723068	a	-5	-2	-2
…	…	…	…	…	…	…	…
9999995	d	8	0.540719	d	-5	-5	-2
9999996	a	4	0.944931	a	-5	-2	-2
9999997	i	15	0.816969	i	-4	-2	-5
9999998	z	9	0.460716	z	-4	-3	-3
9999999	d	15	0.409312	d	-5	-5	-2

10000000 rows × 7 columns

The joined table without the key column is shown below.

In [9]:

df_all_03 = df_all_02.drop('key', axis=1)
df_all_03.head()

Out[9]:

	A	I	P	0	1	2
0	j	10	0.813186	-3	-3	-3
1	d	26	0.084978	-5	-5	-2
2	f	10	0.008318	-3	-2	-5
3	i	14	0.713029	-4	-2	-5
4	a	14	0.723068	-5	-2	-2

The next step is to group the rows of the 10-million-row table by the column of characters and calculate the means of the remaining 5 columns (2 columns from the 100 tables and 3 columns from the key table) for each group. The resulting table is shown below.

In [10]:

df_all_04 = df_all_03.groupby('A').mean()
df_all_04

Out[10]:

	I	P	0	1	2
A
a	13.591212	0.506935	-5.0	-2.0	-2.0
b	13.538183	0.502054	-4.0	-4.0	-4.0
c	13.341660	0.501150	-2.0	-4.0	-3.0
d	13.498158	0.499589	-5.0	-5.0	-2.0
e	13.415902	0.491097	-5.0	-5.0	-5.0
f	13.643302	0.496962	-3.0	-2.0	-5.0
g	13.517366	0.500657	-3.0	-5.0	-4.0
h	13.407870	0.497103	-2.0	-3.0	-3.0
i	13.425826	0.501525	-4.0	-2.0	-5.0
j	13.305504	0.494456	-3.0	-3.0	-3.0
k	13.476422	0.497540	-3.0	-3.0	-4.0
l	13.433053	0.496376	-4.0	-2.0	-2.0
m	13.467360	0.506149	-5.0	-5.0	-3.0
n	13.544786	0.503874	-5.0	-5.0	-4.0
o	13.314315	0.502663	-4.0	-4.0	-4.0
p	13.231397	0.499731	-4.0	-2.0	-3.0
q	13.546316	0.497826	-5.0	-4.0	-2.0
r	13.639637	0.504420	-2.0	-3.0	-4.0
s	13.570135	0.503811	-4.0	-2.0	-3.0
t	13.609737	0.496497	-2.0	-3.0	-3.0
u	13.288251	0.500405	-4.0	-4.0	-3.0
v	13.569074	0.496215	-5.0	-2.0	-5.0
w	13.516598	0.503870	-2.0	-5.0	-4.0
x	13.331851	0.497673	-5.0	-5.0	-5.0
y	13.828782	0.504892	-2.0	-5.0	-3.0
z	13.411154	0.499781	-4.0	-3.0	-3.0

Finally, the means of the groups are themselves averaged for each column, and the results are printed.

In [11]:

print('means\n', df_all_04.mean())

means
 I    13.479379
P     0.500125
0    -3.692308
1    -3.538462
2    -3.500000
dtype: float64

In summary, our steps are:

1. Read 3 columns from each of the 100 tables
2. Read the "key" table
3. Vertically concatenate the 100 tables
4. Join the "key" table to the concatenated table
5. Drop the "key" column
6. Group by the column of characters and calculate the group means
7. Calculate the means of the group means for each column 
8. Print the final set of means

These steps provide us with a set of tasks that we might see in a typical data pipeline. While the data values are randomly selected and thus not meaningful, they do offer a mix of characters/strings, integers, and floats, much like what we might find in a lot of data sets.

These steps are implemented for each of the processing approaches that we want to compare.

Data Processing Variations¶

As a reminder, we want to compare:

Pandas vs. Polars
Python Polars vs. Rust Polars
CSV vs. Parquet
Polars Eager vs. Polars Lazy

To do so, we need to implement the data processing steps for each variation that we want to compare: Python Pandas loading CSV files, Python Polars loading Parquet files, and so on.

Here is a list of all the variations, which are laid out in the directory here:

In [12]:

path = Path.cwd().parent / 'src'
dir_list = list(path.glob('s02_*'))
var_list = [e.name[4:] for e in dir_list]
var_list.sort()

In [13]:

for e in var_list:
    print(e)

python_pandas_csv
python_pandas_parquet
python_polars_eager_csv
python_polars_eager_parquet
python_polars_lazy_csv
python_polars_lazy_parquet
rust_polars_eager_csv
rust_polars_eager_parquet
rust_polars_lazy_csv
rust_polars_lazy_parquet

Data Processing Timing¶

To measure how long each data processing variation required, I ran each variation 50 times (line 54) and used the lowest recorded time. Specifically, each measured time was the sum of the reported system time and the user time. During timing runs, my machine wasn’t running any web browsers, multimedia players, or other processing-intense applications, and I watched the system monitor to verify that no unexpected processes were using substantial system resources.

Timing Results¶

Pandas vs. Polars¶

First we’ll compare processing the data in Pandas dataframes versus in Polars dataframes.

The barplot above shows the processing times for Pandas (blue) and Polars (orange). The gray bars are for Rust, which doesn’t use Pandas, so they aren’t relevant for our comparison.

When we compare “like” to “like”, i.e., when we compare the Python-Pandas-CSV blue bar to the two Python-Polars-CSV orange bars (for Eager and Lazy), it is clear that Polars processing times are lower than Python times. Likewise, the corresponding comparison for the Parquet bars shows that Polars is faster than Pandas. This confirms what we anticipated from those benchmarks.

Python Polars vs. Rust Polars¶

Our next comparison is between Python-Polars and Rust-Polars.

Python-Polars (blue) is faster than Rust-Polars (orange). Since Rust is a compiled language and has a reputation for speed, this is quite surprising.

First, did I do something wrong? That’s entirely possible, since Rust and Polars are fairly new to me. I ensured that Rust compilation times were excluded from the measurement by first running the Rust program a few times manually after compilation to verify that it was optimized and speedy, and then by calling Cargo with the “release” and “frozen” parameters during measurement runs (line 51).

Are they fair comparisons? My purpose was not to write the Python and Rust versions so that they did the exact same lower-level computations. After all, Python and Rust are different languages: I expect that they will do things somewhat differently “under the hood”, even if I write very similar-looking code. My purpose was to write hopefully-straightforward Pythonic and Rustacean code that implemented the same algorithmic ideas without awkward gymnastics to make them exactly the “same”. I think my implementations did that reasonably well. Plus, most of the heavy computational lifting is done by the same library Polars, so that may limit the impact of my particular Pythonic or Rustacean choices. So, I think the comparisons are fair, though someone else might produce fair comparisons that are rather different from mine.

Why is Rust-Polars slower than Python-Polars? Let’s look at the timing measurements in more detail.

In [14]:

path = Path.cwd().parent / 'results'
table_filepath = path / 'min_times_df.csv'
df = pd.read_csv(table_filepath)
df['ratio'] = (df['user_system'] / df['elapsed']).round(2)
df

Out[14]:

	program	user	system	elapsed	user_system	ratio
0	s02_python_pandas_csv	7.44	2.11	8.51	9.84	1.16
1	s02_python_pandas_parquet	2.33	1.55	2.50	4.03	1.61
2	s02_python_polars_eager_csv	5.43	0.77	0.70	6.31	9.01
3	s02_python_polars_eager_parquet	1.99	0.55	0.50	2.78	5.56
4	s02_python_polars_lazy_csv	5.39	0.57	0.73	6.37	8.73
5	s02_python_polars_lazy_parquet	2.02	0.59	0.51	2.80	5.49
6	s02_rust_polars_eager_csv	10.18	0.85	1.46	11.28	7.73
7	s02_rust_polars_eager_parquet	6.12	0.66	1.21	7.00	5.79
8	s02_rust_polars_lazy_csv	8.28	0.87	0.95	9.42	9.92
9	s02_rust_polars_lazy_parquet	4.14	0.78	0.68	5.13	7.54

Notice that the sum of the user and system times exceeds the elapsed times for both Python-Polars and Rust-Polars. This suggests that Polars is parallelizing its workload (“embarrassingly parallel“). The ratios for user+system:elapsed are sometimes higher for Rust-Polars than for Python-Polars — specifically for the Lazy variations. Does Rust-Polars incur more overhead for parallelization operations than Python-Polars even while the workload isn’t heavy enough to justify this extra cost? That’s an interesting hypothesis. But the differences in the ratios aren’t that large, and small fluctuations in the small elapsed denominators could alter the ratios substantially. Additionally, the ratios for the Eager cases either favor Python or are virtually identical, so we would need a better explanation that might account for both Eager and Lazy comparisons between the languages.

A deeper investigation might involve line-profiling the Python and Rust variations of each comparison to locate the most time-consuming operations. Absent that closer look, I would recommend not just assuming (as I did) that one language would be faster than another for a particular use case: if optimization for speed is that important, it’s worth testing the alternatives to know for sure.

CSV vs. Parquet¶

Next we compare reading data from CSV files to reading from Parquet files.

This comparison turned out as expected: loading data from Parquet files (orange) is faster than reading CSV files (blue). And that was true for every comparison pair: Python-Pandas, Python-Polars-Eager, Python-Polars-Lazy, Rust-Polars-Eager, and Rust-Polars-Lazy.

One feature of Parquet files that surely makes them faster in this case is that they store data as columns, meaning that values that are adjacent in a column are also adjacent in the file structure or in memory when they are read. By contrast, CSV files store data in rows. When only a subset of columns are loaded from a file, read times from Parquet are often fast because of this column-based storage.

Polars Eager vs. Polars Lazy¶

Finally we compare eager versus lazy execution in Polars.

There’s no discernible difference between eager (blue) and lazy (orange) Polars execution with the Python code. This would make me question whether I completely botched coding the Lazy implementation, except that for Rust, Lazy is indeed faster than Eager for both CSV and Parquet. That’s reassuring; I must’ve done something right.

Why is Lazy faster than Eager for Rust but not for Python? Perhaps Rust’s compiler allows optimizations between Polars’s Lazy execution mode and other Rust code that the Python interpreter can’t do. A similar improvement in Python might require a lot of manual investigation and optimization effort.

Conclusion¶

For the most part, the results were as I expected: Polars is faster than Pandas, Parquet is faster than CSV, and Lazy is faster than Eager.

The major exception was that Python-Polars was faster than Rust-Polars when I expected that Rust would be faster than Python. On the one hand, a computer scientist might scoff at the difference because it’s less than an order of magnitude; in fact, Rust is more than 2-times slower than Python in only one case, and even that is only 2.5-times slower:

In [15]:

rust_to_python_ratios = [
    (df['program'].iloc[i], 
     df['program'].iloc[i+4], 
     df['user_system'].iloc[i+4] / 
     df['user_system'].iloc[i]) 
    for i in range(2, 6)]

In [16]:

ratio_df = pd.DataFrame(rust_to_python_ratios)
ratio_df.iloc[:, -1] = ratio_df.iloc[:, -1].round(2)
ratio_df

Out[16]:

	0	1	2
0	s02_python_polars_eager_csv	s02_rust_polars_eager_csv	1.79
1	s02_python_polars_eager_parquet	s02_rust_polars_eager_parquet	2.52
2	s02_python_polars_lazy_csv	s02_rust_polars_lazy_csv	1.48
3	s02_python_polars_lazy_parquet	s02_rust_polars_lazy_parquet	1.83

On the other hand, a factor of 2 or 3 can provide important time savings in real-world applications, so optimizing for such a difference might be worthwhile.

Further investigation should focus on whether some minor adjustments to the Rust code would greatly improve its performance or whether Python has a general advantage for this type of data processing with Polars.