Polars Cheatsheet

Useful patterns for Python dataframe library, Polars.

polars is a relatively new library for processing columnar data, or dataframes. It's similar to pandas and R's dplyr, but it's based on Apache Arrow, providing a much more convenient interface. It uses expressions to describe calculations, and the expressions are parallelized and executed within a Rust runtime, enabling an impressive speed gain. This paradigm also allows it to do lazy computations on datasets too large to fit into the memory.

Polar's API is very flexible and powerful. This post intends to note down some common useful patterns when using Polars.

Note

Also check the official user guide and reference.

Batching

Here, batching refers to aggregating entries (rows) of the frame into lists of the same length (except the last batch). polars has good support for the List datatype of Arrow. The List datatype can be converted into Python native lists easily, suitable for piping the batches into other Python libraries.


def batch(expr: pl.Expr, batch_size: int):
    """
    Batch the expression into `pl.List` of `batch_size` length each.
    The last batch may have less than `batch_size` elements.

    Args:
        expr (pl.Expr): The expression to batch.
        batch_size (int): The batch size.

    Returns:
        pl.Expr: The batched expression.
    """
    return (
        # First, declare we are aggregating groups into `pl.List`s
        expr.implode()
        #
        # Then, starts an "inline" grouping
        .over(
            # Group over a batch_index column
            pl.int_range(pl.len()).floordiv(batch_size),
            # Do not broadcast the list back to each row
            mapping_strategy="explode",
        )
    )

For example, given a frame with the column a being i64s created from range(6), batch(pl.col("a"), 3) will result in a List<Int64> column with two rows [0, 1, 2] and [3, 4, 5].