zefang-liu/phishing-email-dataset
Viewer • Updated • 18.7k • 1.33k • 33
A fine-tuned DistilBERT-based model for phishing email detection, trained on the Phishing Emails Dataset. This model is optimized for identifying spam and phishing emails with high accuracy.
The model extends DistilBERT with a custom classification head:
class DistilBERTSpamClassifier(nn.Module):
def __init__(self, distilbert):
super(DistilBERTSpamClassifier, self).__init__()
self.distilbert = distilbert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)
Evaluated on a test set of 3,021 samples, the model achieves performance across metrics:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Non-Spam (0) | 0.98 | 0.98 | 0.98 | 1,870 |
| Spam (1) | 0.96 | 0.97 | 0.96 | 1,151 |
pip install transformers onnxruntime torch
from transformers import DistilBertTokenizer
import onnxruntime as ort
import numpy as np
# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# Load ONNX model
session = ort.InferenceSession("path_to_model.onnx")
# Tokenize input
text = "Your example email text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
# Run inference
outputs = session.run(None, dict(inputs))[0]
prediction = np.argmax(outputs, axis=1)
print("Spam" if prediction == 1 else "Non-Spam")
The model was fine-tuned on the Phishing Emails Dataset, which contains labeled email samples for spam and phishing detection.
Base model
distilbert/distilbert-base-uncased