Background
Polars is a high-performance data processing library designed to handle large datasets with efficiency and ease. Whether you're working with small or large-scale data, Polars provides fast and expressive methods for transforming, analyzing, and processing data.
Data Types and Structures
Core Data Types
- Numeric Types:
- Signed integers
- Unsigned integers
- Floating-point numbers
- Decimals
- Nested Types:
- Lists
- Structs
- Arrays
- Temporal Types:
- Dates
- Times
- Datetimes
- Time deltas
Core Data Structures
Series
One-dimensional, homogeneous collection of data elements.
import polars as pl
s = pl.Series("ints", [1, 2, 3, 4, 5])
print(s)
DataFrame
Two-dimensional, heterogeneous data structure containing named Series.
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"height": [1.65, 1.80, 1.75]
})
Working with Data
Missing Values and Special Types
import polars as pl
import numpy as np
# Creating series with null values
s_null = pl.Series("nulls", [1, None, 3, None, 5])
print("Series with null values:")
print(s_null)
# Creating series with NaN values
s_nan = pl.Series("nans", [1.0, np.nan, 3.0, np.nan, 5.0])
print("\nSeries with NaN values:")
print(s_nan)
# Checking null vs NaN
print("\nNull count:", s_null.null_count())
print("NaN count:", s_nan.is_nan().sum())
Advanced Series Operations
# Different ways to create series
s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
s3 = pl.Series("floats", [1.1, 2.2, 3.3, 4.4, 5.5])
print("Series types:")
print(f"s1 dtype: {s1.dtype}")
print(f"s2 dtype: {s2.dtype}")
print(f"s3 dtype: {s3.dtype}")
# Series operations
print("\nBasic operations:")
print("Sum:", s1.sum())
print("Mean:", s1.mean())
print("Standard deviation:", s1.std())
Complex DataFrame Operations
from datetime import date
# Create a sample DataFrame
df = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # kg
"height": [1.56, 1.77, 1.65, 1.75], # m
}
)
print("Original DataFrame:")
print(df)
# Inspection methods
print("\nFirst 2 rows:")
print(df.head(2))
print("\nDataFrame summary:")
print(df.describe())
print("\nSchema information:")
print(df.schema)
Advanced Transformations
# Calculate BMI and age-related statistics
result = df.with_columns([
pl.col("weight")
.div(pl.col("height").pow(2))
.alias("bmi"),
pl.col("birthdate")
.dt.year()
.alias("birth_year"),
(2024 - pl.col("birthdate").dt.year())
.alias("age")
])
print("Enhanced DataFrame:")
print(result)
# Filtering and grouping
filtered = result.filter(
(pl.col("bmi") > 20) &
(pl.col("age") < 40)
)
print("\nFiltered results:")
print(filtered)
# Group by decade of birth
decades = result.group_by(
(pl.col("birth_year") // 10 * 10)
.alias("decade")
).agg([
pl.col("bmi").mean().alias("avg_bmi"),
pl.col("age").mean().alias("avg_age"),
pl.count("name").alias("count")
])
print("\nDecade-wise statistics:")
print(decades)
Data Manipulation Techniques
# Adding calculated columns
enhanced_df = df.with_columns([
# Calculate BMI
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
# Extract year from birthdate
pl.col("birthdate").dt.year().alias("birth_year"),
# Create a categorical column
pl.when(pl.col("weight") > 70)
.then("heavy")
.otherwise("light")
.alias("weight_category")
])
print("Enhanced DataFrame with calculations:")
print(enhanced_df)
# Complex filtering
filtered_df = enhanced_df.filter(
(pl.col("bmi") > 20) &
(pl.col("birth_year") > 1980)
)
print("\nFiltered results:")
print(filtered_df)
Conclusion
Polars offers a powerful alternative to traditional data processing libraries, combining high performance with an intuitive API. Its efficient memory usage and fast execution make it particularly suitable for large-scale data processing tasks.
References
- Polars Official Documentation
- Python Data Processing Benchmarks
- Rust-based Data Processing Libraries