[go: up one dir, main page]

Skip to content

ekramasif/SMS-Spam-Prediction-Using-BERT

Repository files navigation

SMS spam detection using Bidirectional Encoder Representations from Transformers[BERT]

GitHub license

Objective

To develop an efficient translation-free recurrent neural architecture using BERT to perform SMS spam detection which actually remembers the context of SMS.

This is based on the context of message rather than the contact numbers, from where SMS is arrived.

Plan of Attack

  1. Data Aquisition
  2. Text Preprocessing
  3. Model
  4. Results

1. Data Aquisition

Collected dataset from kaggle, that contains only english messages. We also added our own dataset, collected from real world messages that is of three languages English, Hindi, Telugu. We manually labelled the data into SPAM or HAM.

Dataset consists of three columns index, sms, label. label = { SPAM, HAM}

Total dataset contains around 5214 records. Its an unbalanced dataset, because we have 92% of them as HAM messages and remaining 7% SPAM messages. So we balanced it as 50% of both label.

942	spam	How about getting in touch with folks waiting ...
2278	ham	Hmm...Bad news...Hype park plaza $700 studio t...	
15	spam	XXXMobileMovieClub: To use your credit, click ...	
2327	spam	URGENT! Your mobile number *************** WON...	
5214	spam	Natalja (25/F) is inviting you to be her frien...	

2. Text Preprocessing

I use BERT Preprocessed and BERT Encodin for Processed text.

3. Model

  Activation = 'sigmoid'
  Accuracy = 'accuracy'
  optimizer = 'adam'
  loss = 'binary_crossentropy'
  Precision = 'precision'
  Recall = 'recall'

4.Result

                    precision    recall  f1-score   support

             0       0.95      0.91      0.93       187
             1       0.92      0.95      0.93       187

      accuracy                           0.93       374
     macro avg       0.93      0.93      0.93       374
  weighted avg       0.93      0.93      0.93       374

Note: This is open source.