PySpark Number Operations - Math Functions for DataFrames - pyspark Cheatsheets

This section demonstrates essential number operations available in PySpark for manipulating numerical data within DataFrames. These functions are crucial for data cleaning, transformation, and analysis in big data pipelines.

PySpark DataFrame Math Functions

PySpark provides a rich set of mathematical functions that can be applied to DataFrame columns. Below are examples of common number operations:

Rounding Numbers

The F.round() function rounds a column to a specified number of decimal places. F.floor() rounds down to the nearest integer, and F.ceil() rounds up.

# Round to 0 decimal places
df = df.withColumn('price_rounded', F.round('price', 0))

# Floor to nearest integer
df = df.withColumn('price_floored', F.floor('price'))

# Ceiling to nearest integer
df = df.withColumn('price_ceiled', F.ceil('price'))

Absolute Value and Power Operations

F.abs() returns the absolute value of a number. F.pow(x, y) calculates x raised to the power of y.

# Absolute value
df = df.withColumn('absolute_value', F.abs('numeric_column'))

# X raised to power Y
df = df.withColumn('exponential_growth', F.pow('base_value', 'exponent_value'))

Selecting Min/Max Values

F.least(*cols) returns the smallest value from a list of columns. F.greatest(*cols) returns the largest value.

# Select smallest value out of multiple columns
df = df.withColumn('least_value', F.least('column_a', 'column_b', 'column_c'))

# Select largest value out of multiple columns
df = df.withColumn('greatest_value', F.greatest('column_a', 'column_b', 'column_c'))

Further Resources

For more advanced operations and detailed explanations, refer to the official PySpark documentation:

PySpark SQL Functions Documentation
MDN Web Docs - Math Object (for conceptual understanding of math operations)