This module provides a simple tool for anonymizing a dataset using PySpark. Given a spark.sql.dataframe
with relevant metadata mondrian_privacy_preserver generates an anonymized spark.sql.dataframe
. This provides following privacy preserving techniques for the anonymization.
- K Anonymity
- L Diversity
- T Closeness
This also include Differential Privacy.
Note: Only works with PySpark
Jupyter notebook for each of the following modules are included.
- Mondrian Based Anonymity (Single User Anonymization included)
- Clustering Based Anonymity
- Differential Privacy
- Python
Python versions above Python 3.6 and below Python 3.8 are recommended. The module is developed and tested on: Python 3.7.7 and pip 20.0.2. (It is better to avoid Python 3.8 as it has some compatibility issues with Spark)
- PySpark
Spark 2.4.5 is recommended.
- Java
Java 8 is recommended. Newer versions of java are incompatible with Spark.
The module is developed and tested on:
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
*Requirements for submodules are given in the relevant section.
Use pip install spark_privacy_preserver
to install library
Clone the repository to your PC and run pip install .
to build and install the package.
Usage of each module is described in the relevant section.
You'll need to construct a schema to get the anonymized spark.sql.dataframe
dataframe.
You need to consider the column names and thier data types to construct this. Output of functions of the Mondrian and Clustering Anonymization is described in thier relevant sections.
Following code snippet shows how to construct an example schema.
from spark.sql.type import *
#age, occupation - feature columns
#income - sensitive column
schema = StructType([
StructField("age", DoubleType()),
StructField("occupation", StringType()),
StructField("income", StringType()),
])
- PySpark 2.4.5. You can easily install it with
pip install pyspark
. - PyArrow. You can easily install it with
pip install pyarrow
. - Pandas. You can easily install it with
pip install pandas
.
The spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
You need to always consider the count column when constructing the schema. Count column is an integer type column.
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.k_anonymize(df,
k,
feature_columns,
sensitive_column,
categorical,
schema)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, k-anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling k_anonymize
, call k_anonymize_w_user
.
Same as the K Anonymity, the spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
You need to always consider the count column when constructing the schema. Count column is an integer type column.
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#l - int - value of the l
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.l_diversity(df,
k,
l,
feature_columns,
sensitive_column,
categorical,
schema)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, l-diversity anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling l_diversity
, call l_diversity_w_user
.
Same as the K Anonymity, the spark.sql.dataframe
you get after anonymizing will always contain a extra column count
which indicates the number of similar rows.
Return type of all the non categorical columns will be string
You need to always consider the count column when constructing the schema. Count column is an integer type column.
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#l - int - value of the l
#feature_columns - list - what you want in the output dataframe
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
df = spark.read.csv(your_csv_file).toDF('age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
feature_columns = ['age', 'occupation']
sensitive_column = 'income'
your_anonymized_dataframe = Preserver.t_closeness(df,
k,
t,
feature_columns,
sensitive_column,
categorical,
schema)
This function provides a simple way to anonymize a dataset which has an user identification attribute without grouping the rows.
This function doesn't return a dataframe with the count variable as above function. Instead it returns the same dataframe, t-closeness anonymized. Return type of all the non categorical columns will be string.
User attribute column must not be given as a feature column and its return type will be same as the input type.
Function takes exact same parameters as the above function. To use this method to anonymize the dataset, instead of calling t_closeness
, call t_closeness_w_user
.
This function provides a simple way to anonymize a given user in a dataset. Even though this doesn't use the mondrian algorithm, function is included in the mondrian_preserver
. User identification attribute and the column name of the user identification atribute is needed as parameters.
This doesn't return a dataframe with count variable. Instead this returns the same dataframe, anonymized for the given user. Return type of user column and all the non categorical columns will be string.
from spark_privacy_preserver.mondrian_preserver import Preserver #requires pandas
#df - spark.sql.dataframe - original dataframe
#k - int - value of the k
#user - name, id, number of the user. Unique user identification attribute.
#usercolumn_name - name of the column containing unique user identification attribute.
#sensitive_column - string - what you need as senstive attribute
#categorical - set -all categorical columns of the original dataframe as a set
#schema - spark.sql.types StructType - schema of the output dataframe you are expecting
#random - a flag by default set to false. In a case where algorithm can't find similar rows for given user, if this is set to true, slgorithm will randomly select rows from dataframe.
df = spark.read.csv(your_csv_file).toDF('name',
'age',
'occupation',
'race',
'sex',
'hours-per-week',
'income')
categorical = set((
'occupation',
'sex',
'race'
))
sensitive_column = 'income'
user = 'Jon'
usercolumn_name = 'name'
random = True
your_anonymized_dataframe = Preserver.anonymize_user(df,
k,
user,
usercolumn_name,
sensitive_column,
categorical,
schema,
random)
Differential privacy is one of the data preservation paradigms similar to K-Anonymity, T-Closeness and L-Diversity. It alters each value of actual data in a dataset according to specific constraints set by the owner and produces a differentially-private dataset. This anonymized dataset is then released for public utilization.
ε-differential privacy is one of the methods in differential privacy. Laplace based ε-differential privacy is applied in this library. The method states that the randomization should be according the epsilon (ε) (should be >0) value set by data owner. After randomization a typical noise is added to the dataset. It is calibrated according to the sensitivity value (λ) set by the data owner.
Other than above parameters, a third parameter delta (δ) is added into the mix to increase accuracy of the algorithm. A scale is computed from the above three parameters and a new value is computed.
scale = λ / (ε - log(1 - δ))
random_number = random_generator(0, 1) - 0.5
sign = get_sign(random_number)
new_value = value - scale Ă— sign Ă— log(1 - 2 Ă— mod(random_number))
In essence above steps mean that a laplace transform is applied to the value and a new value is randomly selected according to the parameters. When the scale becomes larger, the deviation from original value will increase.
Make sure the following Python packages are installed:
- PySpark:
pip install pyspark==2.4.5
- PyArrow:
pip install pyarrow==0.17.1
- IBM Differential Privacy Library:
pip install diffprivlib==0.2.1
- MyPy:
pip install mypy==0.770
- Tabulate:
tabulate==0.8.7
Step by step procedure on how to use the module with explanation is given in the following notebook: differential_privacy_demo.ipynb
- Create a Spark Session. Make sure to enable PyArrow configuration.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('differential_privacy') \
.config('spark.some.config.option', 'some-value') \
.getOrCreate()
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
- Create a Spark DataFrame (sdf).
Generate an sdf with random values. It is better to manually specify the schema of sdf so as to avoid any TypeErrors.
Here I will generate an sdf with 3 columns: 'Numeric', 'Rounded_Numeric', 'Boolean' and 10,000 rows to show 3 ways of using DPLib.
from random import random, randint, choice
from pyspark.sql.types import *
# generate a row with random numbers of range(0, 100000) and random strings of either 'yes' or 'no'
def generate_rand_tuple():
number_1 = randint(0, 100000) + random()
number_2 = randint(0, 100000) + random()
string = choice(['yes', 'no'])
return number_1, number_2, string
data = [generate_rand_tuple() for _ in range(100000)]
schema = StructType([
StructField('Number', FloatType()),
StructField('Rounded_Number', DoubleType()),
StructField('Boolean', StringType())
])
sdf = spark.createDataFrame(data=data, schema=schema)
sdf.show(n=5)
- Setup and configure DPLib
DPLib can work with numbers and binary strings. To anonymize a number based column, you have to setup the column category as 'numeric'. To anonymize a string based column, you have to setup the column category as 'boolean'.
3.1 Initializing the module
The module takes in 3 optional parameters when initializing: Spark DataFrame, epsilon and delta. Module can also be initialized without any parameters and they can be added later.
from spark_privacy_preserver.differential_privacy import DPLib
epsilon = 0.00001
delta = 0.5
sensitivity = 10
# method 1
dp = DPLib(global_epsilon=epsilon, global_delta=delta, sdf=sdf)
dp.set_global_sensitivity(sensitivity=sensitivity)
# method 2
dp = DPLib()
dp.set_sdf(sdf=sdf)
dp.set_global_epsilon_delta(epsilon=epsilon, delta=delta)
dp.set_global_sensitivity(sensitivity=sensitivity)
Note: The reason behind the word global in above functions
Suppose the user want to anonymize 3 columns of a DataFrame with same epsilon, delta and sensitivity and another column with different parameters. Now all the user has to do is to set up global parameters for 3 columns and local parameters for 4th column.
This will simplify when multiple columns of a DataFrame have to be processed with same parameters.
3.2 Configuring columns
User can configure columns with column specific parameters. Column specific parameters will be given higher priority over global parameters if explicitly specified.
parameters that can be applied to method set_column():
- column_name: name of column as string -> compulsory
- category: category of column. can be either 'numeric' or 'boolean' -> compulsory
- epsilon: column specific value -> optional
- delta: column specific value -> optional
- sensitivity: column specific value -> optional
- lower_bound: set minimum number a column can have. can only be applied to category 'numeric' -> optional
- upper_bound: set maximum number a column can have. can only be applied to category 'numeric' -> optional
- label1: string label for a column. can only be applied to category 'binary' -> optional
- label2: string label for a column. can only be applied to category 'binary' -> optional
- round: value by which to round the result. can only be applied to category 'numeric' -> optional
dp.set_column(column_name='Number',
category='numeric')
# epsilon, delta, sensitivity will be taken from global parameters and applied.
dp.set_column(column_name='Rounded_Number',
category='numeric',
epsilon=epsilon * 10,
sensitivity=sensitivity * 10,
lower_bound=round(sdf.agg({'Rounded_Number': 'min'}).collect()[0][0]) + 10000,
upper_bound=round(sdf.agg({'Rounded_Number': 'max'}).collect()[0][0]) - 10000,
round=2)
# epsilon, sensitivity will be taken from user input instead of global parameters
# delta will be taken from global parameters.
dp.set_column(column_name='Boolean',
category='boolean',
label1='yes',
label2='no',
delta=delta if 0 < delta <= 1 else 0.5)
# sensitivity will be taken from user input instead of global parameters
# epsilon will be taken from global parameters.
# 'boolean' category does not require delta
3.2.1 To view existing configuration for the class, use following method
dp.get_config()
3.2.2 To drop a column or to drop all columns use the drop_column() method. To drop all columns use '*' as input parameter
dp.drop_column('Rounded_Number', 'Number')
dp.drop_column('*')
3.3 Executing
# gets first 20 rows of DataFrame before anonymizing and after anonymizing
sdf.show()
dp.execute()
dp.sdf.show()
As you can see, there is a clear difference between original DataFrame and anonymized DataFrame.
-
Column 'Number' is anonymized but the values are not bound to a certain range. The algorithm produces the result with maximum precision as it can achieve.
-
Column 'Rounded_Number' is both anonymized and bounded to the values set by user. As you can see, the values never rise above upper bound and never become lower than lower bound. Also they are rounded to 2nd decimal place as set.
-
Column 'Boolean' undergoes through a mechanism that randomly decides to flip to the other binary value or not, in order to satisfy differential privacy.
- PySpark 2.4.5. You can easily install it with
pip install pyspark
- PyArrow
pip install pyarrow
- Pandas
pip intall pandas
- kmodes
pip install kmodes
Only recommend if there are more catogorical columns, than numerical column. if there are more numerical column, then modrian algorithm is recommended.
It is recommended to use 5 <= k <= 20 to minimize the data loss, if your data set is small better to use a small k value
he spark.sql.dataframe you get after anonymizing will always contain a extra column count which indicates the number of similar rows. Return type of all the non categorical columns will be string
In Clustering base Anonymizer you can choose how the how to initialize the cluster centroids.
- 'fcbg' = This method return cluster centroids weight on the probability of row's column values appear in dataframe. Default Value.
- 'rsc' = This method will choose centroids weight according to the column that has most number of unique values.
- 'random = Return cluster centroids in randomly.
just enter the center_type= 'fcbg'
to use fcbg, default is fcbg
You can also decide the clustering method.
- default is special method
- kmodes method
if you want to use default dont enter anything to attribute mode=
, else if you want to use the kmodes method mode= 'kmode'
if you have a huge data amount default is recommended.
you can also decide the return mode. If this value equal to return_mode=''equal
; K anonymization will done with equal member clusters. Default value is 'Not_Equal'
Not equal is often run fast, but could be data lossy. equal is vice versa.
Below is a full example:
from pyspark.sql.types import *
from pyspark.sql.functions import PandasUDFType, lit, pandas_udf
from clustering_preserver import Kanonymizer
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from gv import init
from anonymizer import Anonymizer
df = spark.read.format('csv').option("header", "true").option("inferSchema", "true").load("reduced_adult.csv")
schema = StructType([
StructField("age", StringType()),
StructField("workcalss", StringType()),
StructField("education_num", StringType()),
StructField("matrital_status", StringType()),
StructField("occupation", StringType()),
StructField("sex", StringType()),
StructField("race", StringType()),
StructField("native_country", StringType()),
StructField("class", StringType())
])
QI = ['age', 'workcalss','education_num', 'matrital_status', 'occupation', 'race', 'sex', 'native_country']
SA = ['class']
CI = [1,3,4,5,6,7]
k_df = Anonymizer.k_anonymize(
df, schema, QI, SA, CI, k=10, mode='', center_type='random', return_mode='Not_equal', iter=1)
k_df.show()
This method is recommended only for k anonymized dataframe. Input anonymized dataframe will group into similar k clusters and clusters that have not L number of distinct sensitive attributes will be suspressed. Recommended small number of l to minimum the data loss. Default value is l = 2.
## k_df - K anonymized spark dataframe
## schema - output spark dataframe schema
## QI - Quasi Identifiers. Type list
## SA = Sensitive attributes . Type list
QI = ['column1', 'column2', 'column3']
CI = [1, 2]
SA = ['column4']
schema = StructType([
StructField("column1", StringType()),
StructField("column2", StringType()),
StructField("column3", StringType()),
StructField("column4", StringType()),
])
l_df = Anonymizer.l_diverse(k_df,schema, QI,SA, l=2)
l_df.show()
This method is recommended only for k anonymized dataframe. Input anonymized dataframe will group into similar k clusters and clusters that not have sensitive attribute distribution according to t value will be suspressed. t should be in between 0 and 1. Larger value of t to minimum the data loss. Default value is t = 0.2.
## k_df - K anonymized spark dataframe
## schema - output spark dataframe schema
## QI - Quasi Identifiers. Type list
## SA = Sensitive attributes . Type list
QI = ['column1', 'column2', 'column3']
CI = [1, 2]
SA = ['column4']
schema = StructType([
StructField("column1", StringType()),
StructField("column2", StringType()),
StructField("column3", StringType()),
StructField("column4", StringType()),
])
t_df = Anonymizer.t_closer(
k_df,schema, QI, SA, t=0.3, verbose=1)
t_df.show()