This section demonstrates essential number operations available in PySpark for manipulating numerical data within DataFrames. These functions are crucial for data cleaning, transformation, and analysis in big data pipelines.
PySpark provides a rich set of mathematical functions that can be applied to DataFrame columns. Below are examples of common number operations:
The F.round() function rounds a column to a specified number of decimal places. F.floor() rounds down to the nearest integer, and F.ceil() rounds up.
# Round to 0 decimal places
df = df.withColumn('price_rounded', F.round('price', 0))
# Floor to nearest integer
df = df.withColumn('price_floored', F.floor('price'))
# Ceiling to nearest integer
df = df.withColumn('price_ceiled', F.ceil('price'))
F.abs() returns the absolute value of a number. F.pow(x, y) calculates x raised to the power of y.
# Absolute value
df = df.withColumn('absolute_value', F.abs('numeric_column'))
# X raised to power Y
df = df.withColumn('exponential_growth', F.pow('base_value', 'exponent_value'))
F.least(*cols) returns the smallest value from a list of columns. F.greatest(*cols) returns the largest value.
# Select smallest value out of multiple columns
df = df.withColumn('least_value', F.least('column_a', 'column_b', 'column_c'))
# Select largest value out of multiple columns
df = df.withColumn('greatest_value', F.greatest('column_a', 'column_b', 'column_c'))
For more advanced operations and detailed explanations, refer to the official PySpark documentation:
- PySpark SQL Functions Documentation
- MDN Web Docs - Math Object (for conceptual understanding of math operations)