Detect semantically similar python code using fine-tuned GraphCodeBERT model.
This modified GraphCodeBERT model was fine-tuned for 11 hours using an A40 server on the PoolC (1fold) dataset, which contains over 6M pairs of semantically similar python code snippets.
It is then used to predict the similarity of python code snippets in other folds of the PoolC dataset, as well as the C4 dataset. It achieved F1 scores of greater than 0.96 on all datasets in several experiments, where balanced sampling was applied.
-
pip
In your virtual environment, run:
pip install -r requirements.txt
to install the required packages.
-
conda
To create a new conda environment called
PythonCloneDetection
with the required packages, run:conda env create -f environment.yml
(this may take a while to finish)
The above commands will install cpu-only version of the pytorch
package. Please refer to PyTorch's official website for instructions on how to install other versions of pytorch
on your machine.
-
Run
python main.py --input <input_path> --output <output_path>
to runCloneClassifier
on the csv file at<input_path>
and save its predictions at<output_path>
. For example:python main.py --input examples/c4.csv --output results/res.csv
The input of
main.py
is a csv file containing two columns namedcode1
andcode2
, where each row contains a pair of python code snippets to be compared. The output csv file will have three columns namedcode1
,code2
, andpredictions
, wherepredictions
indicates whether the two code snippets in the corresponding row are semantically similar. -
Use the command
python main.py --help
to see other optional arguments includingmax_token_size
,fp16
, andper_device_eval_batch_size
. -
You could also import
CloneClassifier
class fromclone_classifier.py
and use it in your own code, for example:import pandas as pd from clone_classifier import CloneClassifier classifier = CloneClassifier( max_token_size=512, fp16=False, # set to True for faster inference if available per_device_eval_batch_size=8, ) df = pd.read_csv("examples/c4.csv").head(10) res_df = classifier.predict( df[["code1", "code2"]], # save_path="results/res.csv" ) print(res_df["predictions"] == df["similar"])
Distributed under the MIT License. See LICENSE
for more information.