Introduction to Polars: A Fast and Efficient Data Processing Library

By Yangming Li

Light Dark

Background

Polars is a high-performance data processing library designed to handle large datasets with efficiency and ease. Whether you're working with small or large-scale data, Polars provides fast and expressive methods for transforming, analyzing, and processing data.

Data Types and Structures

Core Data Types

  • Numeric Types:
    • Signed integers
    • Unsigned integers
    • Floating-point numbers
    • Decimals
  • Nested Types:
    • Lists
    • Structs
    • Arrays
  • Temporal Types:
    • Dates
    • Times
    • Datetimes
    • Time deltas

Core Data Structures

Series

One-dimensional, homogeneous collection of data elements.


import polars as pl

s = pl.Series("ints", [1, 2, 3, 4, 5])
print(s)
                        

DataFrame

Two-dimensional, heterogeneous data structure containing named Series.


df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "height": [1.65, 1.80, 1.75]
})
                        

Working with Data

Missing Values and Special Types


import polars as pl
import numpy as np

# Creating series with null values
s_null = pl.Series("nulls", [1, None, 3, None, 5])
print("Series with null values:")
print(s_null)

# Creating series with NaN values
s_nan = pl.Series("nans", [1.0, np.nan, 3.0, np.nan, 5.0])
print("\nSeries with NaN values:")
print(s_nan)

# Checking null vs NaN
print("\nNull count:", s_null.null_count())
print("NaN count:", s_nan.is_nan().sum())
                        

Advanced Series Operations


# Different ways to create series
s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
s3 = pl.Series("floats", [1.1, 2.2, 3.3, 4.4, 5.5])

print("Series types:")
print(f"s1 dtype: {s1.dtype}")
print(f"s2 dtype: {s2.dtype}")
print(f"s3 dtype: {s3.dtype}")

# Series operations
print("\nBasic operations:")
print("Sum:", s1.sum())
print("Mean:", s1.mean())
print("Standard deviation:", s1.std())
                        

Complex DataFrame Operations


from datetime import date

# Create a sample DataFrame
df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # kg
        "height": [1.56, 1.77, 1.65, 1.75],  # m
    }
)

print("Original DataFrame:")
print(df)

# Inspection methods
print("\nFirst 2 rows:")
print(df.head(2))

print("\nDataFrame summary:")
print(df.describe())

print("\nSchema information:")
print(df.schema)
                        

Advanced Transformations


# Calculate BMI and age-related statistics
result = df.with_columns([
    pl.col("weight")
    .div(pl.col("height").pow(2))
    .alias("bmi"),
    
    pl.col("birthdate")
    .dt.year()
    .alias("birth_year"),
    
    (2024 - pl.col("birthdate").dt.year())
    .alias("age")
])

print("Enhanced DataFrame:")
print(result)

# Filtering and grouping
filtered = result.filter(
    (pl.col("bmi") > 20) & 
    (pl.col("age") < 40)
)

print("\nFiltered results:")
print(filtered)

# Group by decade of birth
decades = result.group_by(
    (pl.col("birth_year") // 10 * 10)
    .alias("decade")
).agg([
    pl.col("bmi").mean().alias("avg_bmi"),
    pl.col("age").mean().alias("avg_age"),
    pl.count("name").alias("count")
])

print("\nDecade-wise statistics:")
print(decades)
                        

Data Manipulation Techniques


# Adding calculated columns
enhanced_df = df.with_columns([
    # Calculate BMI
    (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
    
    # Extract year from birthdate
    pl.col("birthdate").dt.year().alias("birth_year"),
    
    # Create a categorical column
    pl.when(pl.col("weight") > 70)
    .then("heavy")
    .otherwise("light")
    .alias("weight_category")
])

print("Enhanced DataFrame with calculations:")
print(enhanced_df)

# Complex filtering
filtered_df = enhanced_df.filter(
    (pl.col("bmi") > 20) &
    (pl.col("birth_year") > 1980)
)

print("\nFiltered results:")
print(filtered_df)
                        

Conclusion

Polars offers a powerful alternative to traditional data processing libraries, combining high performance with an intuitive API. Its efficient memory usage and fast execution make it particularly suitable for large-scale data processing tasks.

References

  • Polars Official Documentation
  • Python Data Processing Benchmarks
  • Rust-based Data Processing Libraries