To develop an efficient translation-free recurrent neural architecture using BERT to perform SMS spam detection which actually remembers the context of SMS.
This is based on the context of message rather than the contact numbers, from where SMS is arrived.
- Data Aquisition
- Text Preprocessing
- Model
- Results
Collected dataset from kaggle, that contains only english messages. We also added our own dataset, collected from real world messages that is of three languages English, Hindi, Telugu. We manually labelled the data into SPAM or HAM.
Dataset consists of three columns index, sms, label. label = { SPAM, HAM}
Total dataset contains around 5214 records. Its an unbalanced dataset, because we have 92% of them as HAM messages and remaining 7% SPAM messages. So we balanced it as 50% of both label.
942 spam How about getting in touch with folks waiting ...
2278 ham Hmm...Bad news...Hype park plaza $700 studio t...
15 spam XXXMobileMovieClub: To use your credit, click ...
2327 spam URGENT! Your mobile number *************** WON...
5214 spam Natalja (25/F) is inviting you to be her frien...
I use BERT Preprocessed and BERT Encodin for Processed text.
Activation = 'sigmoid'
Accuracy = 'accuracy'
optimizer = 'adam'
loss = 'binary_crossentropy'
Precision = 'precision'
Recall = 'recall'
precision recall f1-score support
0 0.95 0.91 0.93 187
1 0.92 0.95 0.93 187
accuracy 0.93 374
macro avg 0.93 0.93 0.93 374
weighted avg 0.93 0.93 0.93 374
Note: This is open source.