Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Computational Linguistics Final Project, Winter Semester 2019, University of Saarland
GSDMM Implementation as described in:
Yin, J. and Wang., J. A dirichlet multinomial mixturemodel-based approach for short text clustering. In SIGKDD,2014
Experimenting with different beta parameters on the Stack Overflow Titles dataset made available by Kaggle.com from the paper by:
Xu, J., et al., 2015. Short Text Clustering via Convolutional Neural Networks, NAACL.

Project File Structure

GSDMM
- README.md
- gsdmm_noonPokaratsiriGoldstein.pdf project report
- data: all corpus and label files are here
  - title_StackOverflow.txt
  - label_StackOverflow.txt
- logs: logs of run_gsdmm.py execution
  - run_gsdmm_{run_id}.log
- output: plots of gsdmm performance and representative words in clusters
  - cluster_per_iteration_at_different_beta.png
  - performance_at_different_beta.png
  - gsdmm_clusters_and_representative_words_{run_id}.out
- pickled: pickle files from run_gsdmm.py
  - predicted_{run_id}_freq_words_by_beta.pickle
  - predicted_{run_id}_labels_by_beta.pickle
  - predicted_{run_id}_num_clusters_by_it_per_beta_list.pickle
  - true_most_frequent_words_by_topic.pickle
- source_code: config file for default parameters and all source code files
  execute run_gsdmm.py from this directory
  - default_config.cfg default parameters to execute the program are defined here
  - eval.py this module calculates NMI, Homogeneity, and Completeness and plot graphs
  - gsdmm.py this module does the GSDMM algorithm
  - preprocess.py this module tokenizes and pre process the corpus file
  - run_gsdmm.py this is the main program that runs the experiment

Requirements

Python 3.7

numpy
sklearn
matplotlib
nltk
tqdm

Instructions

cd to the source_code directory to execute the program
python run_gsdmm.py -h will display all the command line options
commandline options will override options in the default_config.cfg file
python run_gsdmm.py will run GSDMM experiments with the default values in the .cfg file
the last run_id was 3; change to a different run_id number to execute the full program
program will output 2 plots (plot titles are self-explanatory), an output file showing the GSDMM predicted number of clusters, words in the clusters + frequencies
running the program with the same run_id will simply load data from pickled files and re-plot the 2 graphs
runtime: for K = 100 (starting with 100 clusters as an upper bound), the program takes approximately 1 hour for each beta value computation in the experiment. For K = 50, each cycle takes approximately 30 minutes.
The default setting experiments with 5 beta values; therefore, the total runtime for the entire program takes approximately 5-6 hours.
Please see the log file for runtime details as it includes time stamps from the last run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Project File Structure

Requirements

Instructions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
logs		logs
output		output
pickled		pickled
source_code		source_code
README.md		README.md
gsdmm_noonPokaratsiriGoldstein.pdf		gsdmm_noonPokaratsiriGoldstein.pdf

pokarats/gsdmm

Folders and files

Latest commit

History

Repository files navigation

Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Project File Structure

Requirements

Instructions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages