File size: 13,641 Bytes
44d26a5 186f1b1 44d26a5 186f1b1 44d26a5 186f1b1 44d26a5 186f1b1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 | ---
language: en
license: apache-2.0
tags:
- resume-parsing
- named-entity-recognition
- ner
- bert
- information-extraction
- cv-parser
- token-classification
datasets:
- resume-corpus
- dataturks-resume-ner
- custom-training-data
metrics:
- f1
- precision
- recall
---
# Resume NER BERT v2 - Advanced Resume Information Extraction
A state-of-the-art Named Entity Recognition (NER) model specifically designed for extracting structured information from resumes and CVs. This model achieves **90.87% F1 score** and is trained on a comprehensive dataset of **22,542 resume samples** from multiple sources.
## π― Model Description
This model has been extensively fine-tuned to extract key information from resumes including personal details, contact information, work experience, education, skills, and more. The model uses a BERT-based architecture with token classification capabilities, making it highly effective for resume parsing tasks.
### Key Features:
- **High Accuracy**: 90.87% F1 score on comprehensive resume parsing
- **Comprehensive Coverage**: 25 entity types covering all major resume sections
- **Large Training Dataset**: 22,542 samples from multiple sources
- **Production Ready**: Tested and optimized for real-world applications
- **Memory Efficient**: CPU-optimized with reasonable model size (431MB)
## π Performance Metrics
| Metric | Score | Status |
|--------|-------|--------|
| **F1 Score** | **90.87%** | β
Excellent |
| **Precision** | 91.44% | β
High |
| **Recall** | 90.81% | β
High |
| **Training Loss** | 0.2604 | β
Low |
## π·οΈ Label Schema
The model recognizes **25 entity types** using BIO (Beginning-Inside-Outside) tagging:
### Core Personal Information:
- **Name**: Person's full name (e.g., "John Smith", "Sarah Johnson")
- **Email Address**: Email contact information (e.g., "john.smith@gmail.com")
- **Phone**: Phone number (e.g., "(555) 123-4567", "+1-555-123-4567")
- **Location**: Geographic location (e.g., "San Francisco, CA", "New York")
### Professional Information:
- **Companies worked at**: Previous employers and organizations (e.g., "Google", "Microsoft", "Amazon")
- **Designation**: Job titles and roles (e.g., "Senior Software Engineer", "Data Scientist", "Marketing Manager")
- **Skills**: Technical and soft skills (e.g., "Python", "JavaScript", "Machine Learning", "Leadership")
- **Years of Experience**: Work experience duration (e.g., "5 years", "10+ years")
### Educational Information:
- **Degree**: Educational qualifications (e.g., "Bachelor's in Computer Science", "Master's in Data Science")
- **College Name**: Educational institutions (e.g., "Stanford University", "MIT", "Harvard")
- **Graduation Year**: Year of degree completion (e.g., "2020", "2018")
### Additional:
- **UNKNOWN**: Unclassified entities that don't fit other categories
### BIO Tags:
- `B-` (Beginning): Start of an entity
- `I-` (Inside): Continuation of an entity
- `O` (Outside): Non-entity tokens
## π Usage
### Using the Model Directly
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example resume text
text = "John Smith is a senior software engineer with 8 years of experience at Google. He has expertise in Python, JavaScript, and machine learning. Contact: john.smith@gmail.com"
# Tokenize
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128,
padding=True
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Extract entities
entities = []
current_entity = None
for i, pred in enumerate(predictions[0]):
label = model.config.id2label[pred.item()]
token = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][i])
if label.startswith('B-'):
if current_entity:
entities.append(current_entity)
current_entity = {
'text': token,
'label': label[2:], # Remove 'B-' prefix
'start': i
}
elif label.startswith('I-') and current_entity:
current_entity['text'] += ' ' + token
elif label == 'O':
if current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
print("Extracted Entities:")
for entity in entities:
print(f"- {entity['label']}: {entity['text']}")
```
### Using the Pipeline
```python
from transformers import pipeline
# Create NER pipeline
ner_pipeline = pipeline(
"token-classification",
model="yashpwr/resume-ner-bert-v2",
aggregation_strategy="simple"
)
# Extract entities
text = "Sarah Johnson holds a Master's degree in Computer Science from Stanford University. Skills: Python, TensorFlow, SQL."
results = ner_pipeline(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.3f})")
```
### Advanced Usage with Confidence Scores
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np
# Load model
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
def extract_entities_with_confidence(text, confidence_threshold=0.5):
"""Extract entities with confidence scores."""
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128,
padding=True,
return_offsets_mapping=True
)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
probabilities = torch.softmax(outputs.logits, dim=2)
entities = []
current_entity = None
offset_mapping = inputs.offset_mapping[0]
for i, (pred, offset) in enumerate(zip(predictions[0], offset_mapping)):
label = model.config.id2label[pred.item()]
confidence = probabilities[0][i][pred].item()
# Skip special tokens
if offset[0] == 0 and offset[1] == 0:
continue
if label.startswith('B-'):
if current_entity and current_entity['confidence'] >= confidence_threshold:
entities.append(current_entity)
entity_type = label[2:]
current_entity = {
'text': text[offset[0]:offset[1]],
'label': entity_type,
'start': offset[0],
'end': offset[1],
'confidence': confidence
}
elif label.startswith('I-') and current_entity:
entity_type = label[2:]
if entity_type == current_entity['label']:
current_entity['text'] += ' ' + text[offset[0]:offset[1]]
current_entity['end'] = offset[1]
current_entity['confidence'] = min(current_entity['confidence'], confidence)
elif label == 'O':
if current_entity and current_entity['confidence'] >= confidence_threshold:
entities.append(current_entity)
current_entity = None
if current_entity and current_entity['confidence'] >= confidence_threshold:
entities.append(current_entity)
return entities
# Example usage
text = "Michael Brown is a marketing manager with 10 years of experience at Coca-Cola. Contact: michael.brown@marketing.com"
entities = extract_entities_with_confidence(text, confidence_threshold=0.3)
for entity in entities:
print(f"{entity['label']}: '{entity['text']}' (confidence: {entity['confidence']:.3f})")
```
## π Training Details
### Dataset Composition
- **Total Samples**: 22,542
- **Sources**:
- **Resume-Corpus Dataset**: 349 samples (structured resume data)
- **DataTurks Resume NER**: 420 samples (manually annotated resumes)
- **Custom Training Data**: 21,773 samples (rule-based extraction from conversation data)
- **Mehyaar Skills Dataset**: Integrated skills-focused data
### Training Configuration
- **Base Model**: `yashpwr/resume-ner-bert`
- **Learning Rate**: 3e-5
- **Batch Size**: 4 (effective: 32 with gradient accumulation)
- **Max Sequence Length**: 128 tokens
- **Epochs**: 1.0 (early stopping applied)
- **Device**: CPU (optimized for memory efficiency)
- **Gradient Accumulation Steps**: 8
- **Optimizer**: AdamW
- **Loss Function**: Cross-Entropy Loss
### Training Process Improvements
The model was trained using a comprehensive pipeline that addressed several key challenges:
1. **β
Tokenization Consistency**: Used `bert-base-cased` throughout the pipeline to ensure consistent tokenization between training and inference
2. **β
Entity Extraction Enhancement**: Implemented proper character-to-token alignment using `return_offsets_mapping=True` for accurate text reconstruction
3. **β
Label Mapping**: Unified diverse label schemas into the DataTurks format (25 labels) for consistency
4. **β
Performance Optimization**: CPU-optimized training with memory efficiency and gradient accumulation
5. **β
Dataset Integration**: Successfully integrated 21,773 additional samples from conversation data using rule-based extraction
## π― Use Cases
### Primary Applications
- **Recruitment Platforms**: Automated resume parsing and candidate screening
- **Resume Parsing Engines**: Extract structured data from unstructured resumes
- **Talent Analytics Tools**: Analyze candidate skills and experience patterns
- **Document Processing Pipelines**: Integrate with HR systems and ATS
- **ATS (Applicant Tracking Systems)**: Automated candidate data extraction and categorization
### Industry Sectors
- **Human Resources & Recruitment**: Streamline hiring processes
- **Technology & Software Development**: Technical skill assessment
- **Finance & Banking**: Compliance and background verification
- **Healthcare**: Medical credential verification
- **Education**: Academic credential processing
- **Government & Public Sector**: Public service recruitment
### Specific Use Cases
1. **Automated Resume Screening**: Filter candidates based on skills and experience
2. **Data Migration**: Convert legacy resume databases to structured formats
3. **Compliance Checking**: Verify educational and professional credentials
4. **Skill Gap Analysis**: Identify missing skills in candidate pools
5. **Market Research**: Analyze job market trends and skill demands
## β οΈ Limitations
1. **Language**: Currently optimized for English resumes only
2. **Format**: Works best with text-based resumes (PDF conversion may be required)
3. **Domain**: Primarily trained on technology and business resumes
4. **Length**: Optimal for resumes under 512 tokens (truncation applied for longer texts)
5. **Accuracy**: 90.87% F1 score - may miss some entities in complex or non-standard formats
6. **Context**: Limited to resume-specific entities (not general NER)
## π§ Technical Requirements
### System Requirements
- **Python**: 3.8 or higher
- **PyTorch**: 1.9 or higher
- **Transformers**: 4.20 or higher
- **Memory**: 2GB+ RAM recommended
- **Storage**: 431MB model size
- **CPU**: Multi-core recommended for inference
### Dependencies
```bash
pip install transformers torch datasets scikit-learn numpy
```
### Installation
```bash
# Install required packages
pip install transformers[torch] datasets scikit-learn
# Or using conda
conda install pytorch transformers -c pytorch
```
## π License
This model is licensed under the Apache 2.0 License. This means you can:
- Use the model for commercial purposes
- Modify and distribute the model
- Use it in proprietary software
- Distribute modified versions
See the [LICENSE](LICENSE) file for complete details.
## π€ Contributing
We welcome contributions from the community! Here's how you can contribute:
1. **Report Issues**: Create an issue for bugs or feature requests
2. **Submit Improvements**: Fork the repository and submit pull requests
3. **Share Datasets**: Contribute additional training data
4. **Documentation**: Help improve documentation and examples
5. **Testing**: Test the model on different resume formats
## π Support
For questions, issues, or support:
- **GitHub Issues**: Create an issue on the model repository
- **Hugging Face Discussions**: Use the discussion tab on the model page
- **Email**: Contact through the Hugging Face profile
- **Documentation**: Check the model card and examples
## π Acknowledgments
- **Base Model**: `yashpwr/resume-ner-bert` for the foundation architecture
- **Datasets**: Resume-Corpus, DataTurks, and custom training data contributors
- **Hugging Face**: For the transformers library and platform
- **Open Source Community**: For contributions and feedback
- **Research Community**: For advancing NER and information extraction techniques
## π Model Evolution
### Version History
- **v1**: Initial release with basic resume parsing
- **v2**: Comprehensive model with 22,542 samples and 90.87% F1 score
### Future Improvements
- Multi-language support
- Enhanced entity types
- Better handling of complex resume formats
- Integration with document processing pipelines
- Real-time inference optimization
---
**Last Updated**: August 7, 2025
**Version**: v2
**Status**: Production Ready β
**License**: Apache 2.0
**Repository**: https://huggingface.co/yashpwr/resume-ner-bert-v2 |