Neural search engine for discovering semantically similar Python repositories on GitHub.
Searching an indexed repository:
RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.
It uses the RepoSim pipeline to create embeddings for Python repositories. We have created a vector dataset (stored as docarray index) of over 9700 GitHub Python repositories that has license and over 300 stars by the time of 20th May, 2023.
Download the repository and install the required packages:
git clone https://github.com/RepoAnalysis/RepoSnipy
cd RepoSnipy
pip install -r requirements.txt
Then run the app on your local machine using:
streamlit run app.py
The evaluation script finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for python
and python3
). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from here, including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.
Distributed under the MIT License. See LICENSE for more information.
The model and the fine-tuning dataset used: