Useful patterns for Python dataframe library, Polars.
polars
is a relatively new library for processing columnar data, or dataframes.
It's similar to pandas
and R's dplyr
, but it's based on Apache Arrow, providing a much more convenient interface.
It uses expressions to describe calculations, and the expressions are parallelized and executed within a Rust runtime, enabling an impressive speed gain.
This paradigm also allows it to do lazy computations on datasets too large to fit into the memory.
Polar's API is very flexible and powerful. This post intends to note down some common useful patterns when using Polars.
Also check the official user guide and reference.
Batching
Here, batching refers to aggregating entries (rows) of the frame into lists of the same length (except the last batch).
polars
has good support for the List
datatype of Arrow.
The List
datatype can be converted into Python native lists easily, suitable for piping the batches into other Python libraries.
def batch(expr: pl.Expr, batch_size: int):
"""
Batch the expression into `pl.List` of `batch_size` length each.
The last batch may have less than `batch_size` elements.
Args:
expr (pl.Expr): The expression to batch.
batch_size (int): The batch size.
Returns:
pl.Expr: The batched expression.
"""
return (
# First, declare we are aggregating groups into `pl.List`s
expr.implode()
#
# Then, starts an "inline" grouping
.over(
# Group over a batch_index column
pl.int_range(pl.len()).floordiv(batch_size),
# Do not broadcast the list back to each row
mapping_strategy="explode",
)
)
For example, given a frame with the column a
being i64
s created from range(6)
, batch(pl.col("a"), 3)
will result in a List<Int64>
column with two rows [0, 1, 2]
and [3, 4, 5]
.