Useful patterns for Python dataframe library, Polars.
polars
is a relatively new library for processing columnar data, or dataframes.
It's similar to pandas
and R's dplyr
, but it's based on Apache Arrow, providing a much more convenient interface.
It uses expressions to describe calculations, and the expressions are parallelized and executed within a Rust runtime, enabling an impressive speed gain.
This paradigm also allows it to do lazy computations on datasets too large to fit into the memory.
Polar's API is very flexible and powerful. This post intends to note down some common useful patterns when using Polars.
Also check the official user guide and reference.
Batching
Here, batching refers to aggregating entries (rows) of the frame into lists of the same length (except the last batch).
polars
has good support for the List
datatype of Arrow.
The List
datatype can be converted into Python native lists easily, suitable for piping the batches into other Python libraries.
def batch(expr: pl.Expr, batch_size: int):
"""
Batch the expression into `pl.List` of `batch_size` length each.
The last batch may have less than `batch_size` elements.
Args:
expr (pl.Expr): The expression to batch.
batch_size (int): The batch size.
Returns:
pl.Expr: The batched expression.
"""
return (
# First, declare we are aggregating groups into `pl.List`s
expr.implode()
#
# Then, starts an "inline" grouping
.over(
# Group over a batch_index column
pl.int_range(pl.len()).floordiv(batch_size),
# Do not broadcast the list back to each row
mapping_strategy="explode",
)
)
For example, given a frame with the column a
being i64
s created from range(6)
, batch(pl.col("a"), 3)
will result in a List<Int64>
column with two rows [0, 1, 2]
and [3, 4, 5]
.
Lazy Sampling
LazyFrame
is missing the DataFrame.sample
API, and it’s nowhere near landing soon, as this operation can be computationally heavy. However, if the sample size is small, it’s possible to use time to scan each row and filter based on whether the row index falls within an externally drawn sample.
def sample(
df: pl.LazyFrame,
*,
size: int | None = None,
fraction: float | None = None,
seed: int | None = None,
row_index_name: str = "row_index"
):
if not ((size is None) ^ (fraction is None)):
raise ValueError("One and only one of `size` or `fraction` must be specified.")
# First let's scan the whole LazyFrame to see how many data we have
# If you are sure that the LazyFrame is homogenous,
# you can probably just sample from the head
height: int = df.select(pl.len()).collect().item()
if size is None:
size = int(height * fraction) # type: ignore
# Use numpy to create a sample of the indices
samples = set(np.random.default_rng(seed).choice(height, size=size, replace=False))
return (
df.with_row_index(row_index_name)
.filter(pl.col(row_index_name).is_in(samples))
# Let's short-circuit when we already have the right size
# this makes the best case time to be O(m^2)
.head(size)
.drop(row_index_name)
)
Advanced UDFs
It’s hard to avoid user-defined functions (UDF) during data processing. Python UDFs are, sadly, very slow and unparallelizable. However, there are cases that UDFs can be accelerated.
Polars uses Apache Arrow memory layout, which is the same as the array or tensor layout used in Numpy, PyTorch, JAX or Numba JIT functions. Polars natively support map_batches
, which accepts any function shape[batch, ...] -> shape[batch, ...]
. This covers all the vectorized functions from these libraries, such as Numpy ufuncs. If your use case is to transform one single column which contains homogeneous values (i.e. the column itself constitutes a tensor), then this API is good.
Sadly, real world data can (often) be heterogenous. Texts, images and even arbitrary binaries can differ on size across rows. If the computation is implemented in native code and releases the Python Global Interpreter Lock (GIL), we can resort to multithreading-based parallelism to compute in batches. For example, OpenAI’s tiktoken implements the LLM tokenization logic in Rust, and the batching is simply powered by Python standard library’s ThreadPoolExecutor
under concurrent.futures
, similar to the following generalization:
def multithreaded_scalar_calculation(series):
with ThreadPoolExecutor(THREADPOOL_SIZE) as executor:
return pl.Series(executor.map(
some_function_that_releases_gil,
series,
))
For Polars, you may reuse the Polars threads, if your release-GIL function operates on Python objects (scalars, lists or dictionaries). Just feed it to map_elements
with strategy=threading
. Notice that this API casts each element to a corresponding Python object, which could be heavy. Besides, it is still unstable at the time of writing (2025-08-09).
So far, we only covered unary UDFs. If you want to pass multiple columns as multiple arguments of your polyadic UDF, you have to pass them as pl.Struct
s. If you are using map_elements
, the UDF will receive a Python dictionary. If you are using map_batches
, then the UDF will receive a Polars series, wrapped over an Arrow StructArray
. This structure is efficient, as it uses pointers underneath. We can cast the series to the underlying Arrow array, and feed the arrays to vectorized functions or thread pools. Use arr.field(<col_name>)
to get an invidual array. For example:
def polyadic_udf(series):
arr = series.to_arrow()
return your_vectorized_function(
np.asarray(arr.field("your-col-0"), copy=False),
np.asarray(arr.field("your-col-1"), copy=False),
)
Another common use case is polyadic UDFs on array/matrix/tensor columns with heterogenous shape across rows. In this case, each cell is using Arrow’s ListScalar
underneath, so use .values
to access its underlying memory.
def polyadic_udf_on_tensors(series):
arr = series.to_arrow()
with ThreadPoolExecutor(THREADPOOL_SIZE) as e:
return pl.Series(e.map(
your_polyadic_nogil_function,
*[
(np.asarray(val.values, copy=False) for val in arr.field(field.name))
for field in arr.type.fields
],
))