File size: 13,641 Bytes

---
language: en
license: apache-2.0
tags:
- resume-parsing
- named-entity-recognition
- ner
- bert
- information-extraction
- cv-parser
- token-classification
datasets:
- resume-corpus
- dataturks-resume-ner
- custom-training-data
metrics:
- f1
- precision
- recall
---

# Resume NER BERT v2 - Advanced Resume Information Extraction

A state-of-the-art Named Entity Recognition (NER) model specifically designed for extracting structured information from resumes and CVs. This model achieves **90.87% F1 score** and is trained on a comprehensive dataset of **22,542 resume samples** from multiple sources.

## 🎯 Model Description

This model has been extensively fine-tuned to extract key information from resumes including personal details, contact information, work experience, education, skills, and more. The model uses a BERT-based architecture with token classification capabilities, making it highly effective for resume parsing tasks.

### Key Features:
- **High Accuracy**: 90.87% F1 score on comprehensive resume parsing
- **Comprehensive Coverage**: 25 entity types covering all major resume sections
- **Large Training Dataset**: 22,542 samples from multiple sources
- **Production Ready**: Tested and optimized for real-world applications
- **Memory Efficient**: CPU-optimized with reasonable model size (431MB)

## 📊 Performance Metrics

| Metric | Score | Status |
|--------|-------|--------|
| **F1 Score** | **90.87%** | ✅ Excellent |
| **Precision** | 91.44% | ✅ High |
| **Recall** | 90.81% | ✅ High |
| **Training Loss** | 0.2604 | ✅ Low |

## 🏷️ Label Schema

The model recognizes **25 entity types** using BIO (Beginning-Inside-Outside) tagging:

### Core Personal Information:
- **Name**: Person's full name (e.g., "John Smith", "Sarah Johnson")
- **Email Address**: Email contact information (e.g., "john.smith@gmail.com")
- **Phone**: Phone number (e.g., "(555) 123-4567", "+1-555-123-4567")
- **Location**: Geographic location (e.g., "San Francisco, CA", "New York")

### Professional Information:
- **Companies worked at**: Previous employers and organizations (e.g., "Google", "Microsoft", "Amazon")
- **Designation**: Job titles and roles (e.g., "Senior Software Engineer", "Data Scientist", "Marketing Manager")
- **Skills**: Technical and soft skills (e.g., "Python", "JavaScript", "Machine Learning", "Leadership")
- **Years of Experience**: Work experience duration (e.g., "5 years", "10+ years")

### Educational Information:
- **Degree**: Educational qualifications (e.g., "Bachelor's in Computer Science", "Master's in Data Science")
- **College Name**: Educational institutions (e.g., "Stanford University", "MIT", "Harvard")
- **Graduation Year**: Year of degree completion (e.g., "2020", "2018")

### Additional:
- **UNKNOWN**: Unclassified entities that don't fit other categories

### BIO Tags:
- `B-` (Beginning): Start of an entity
- `I-` (Inside): Continuation of an entity
- `O` (Outside): Non-entity tokens

## 🚀 Usage

### Using the Model Directly

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example resume text
text = "John Smith is a senior software engineer with 8 years of experience at Google. He has expertise in Python, JavaScript, and machine learning. Contact: john.smith@gmail.com"

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding=True
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Extract entities
entities = []
current_entity = None

for i, pred in enumerate(predictions[0]):
    label = model.config.id2label[pred.item()]
    token = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][i])
    
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {
            'text': token,
            'label': label[2:],  # Remove 'B-' prefix
            'start': i
        }
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += ' ' + token
    elif label == 'O':
        if current_entity:
            entities.append(current_entity)
            current_entity = None

if current_entity:
    entities.append(current_entity)

print("Extracted Entities:")
for entity in entities:
    print(f"- {entity['label']}: {entity['text']}")
```

### Using the Pipeline

```python
from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model="yashpwr/resume-ner-bert-v2",
    aggregation_strategy="simple"
)

# Extract entities
text = "Sarah Johnson holds a Master's degree in Computer Science from Stanford University. Skills: Python, TensorFlow, SQL."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.3f})")
```

### Advanced Usage with Confidence Scores

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

# Load model
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

def extract_entities_with_confidence(text, confidence_threshold=0.5):
    """Extract entities with confidence scores."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128,
        padding=True,
        return_offsets_mapping=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2)
        probabilities = torch.softmax(outputs.logits, dim=2)
    
    entities = []
    current_entity = None
    offset_mapping = inputs.offset_mapping[0]
    
    for i, (pred, offset) in enumerate(zip(predictions[0], offset_mapping)):
        label = model.config.id2label[pred.item()]
        confidence = probabilities[0][i][pred].item()
        
        # Skip special tokens
        if offset[0] == 0 and offset[1] == 0:
            continue
        
        if label.startswith('B-'):
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
            
            entity_type = label[2:]
            current_entity = {
                'text': text[offset[0]:offset[1]],
                'label': entity_type,
                'start': offset[0],
                'end': offset[1],
                'confidence': confidence
            }
        
        elif label.startswith('I-') and current_entity:
            entity_type = label[2:]
            if entity_type == current_entity['label']:
                current_entity['text'] += ' ' + text[offset[0]:offset[1]]
                current_entity['end'] = offset[1]
                current_entity['confidence'] = min(current_entity['confidence'], confidence)
        
        elif label == 'O':
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
                current_entity = None
    
    if current_entity and current_entity['confidence'] >= confidence_threshold:
        entities.append(current_entity)
    
    return entities

# Example usage
text = "Michael Brown is a marketing manager with 10 years of experience at Coca-Cola. Contact: michael.brown@marketing.com"
entities = extract_entities_with_confidence(text, confidence_threshold=0.3)

for entity in entities:
    print(f"{entity['label']}: '{entity['text']}' (confidence: {entity['confidence']:.3f})")
```

## 📚 Training Details

### Dataset Composition
- **Total Samples**: 22,542
- **Sources**:
  - **Resume-Corpus Dataset**: 349 samples (structured resume data)
  - **DataTurks Resume NER**: 420 samples (manually annotated resumes)
  - **Custom Training Data**: 21,773 samples (rule-based extraction from conversation data)
  - **Mehyaar Skills Dataset**: Integrated skills-focused data

### Training Configuration
- **Base Model**: `yashpwr/resume-ner-bert`
- **Learning Rate**: 3e-5
- **Batch Size**: 4 (effective: 32 with gradient accumulation)
- **Max Sequence Length**: 128 tokens
- **Epochs**: 1.0 (early stopping applied)
- **Device**: CPU (optimized for memory efficiency)
- **Gradient Accumulation Steps**: 8
- **Optimizer**: AdamW
- **Loss Function**: Cross-Entropy Loss

### Training Process Improvements
The model was trained using a comprehensive pipeline that addressed several key challenges:

1. **✅ Tokenization Consistency**: Used `bert-base-cased` throughout the pipeline to ensure consistent tokenization between training and inference
2. **✅ Entity Extraction Enhancement**: Implemented proper character-to-token alignment using `return_offsets_mapping=True` for accurate text reconstruction
3. **✅ Label Mapping**: Unified diverse label schemas into the DataTurks format (25 labels) for consistency
4. **✅ Performance Optimization**: CPU-optimized training with memory efficiency and gradient accumulation
5. **✅ Dataset Integration**: Successfully integrated 21,773 additional samples from conversation data using rule-based extraction

## 🎯 Use Cases

### Primary Applications
- **Recruitment Platforms**: Automated resume parsing and candidate screening
- **Resume Parsing Engines**: Extract structured data from unstructured resumes
- **Talent Analytics Tools**: Analyze candidate skills and experience patterns
- **Document Processing Pipelines**: Integrate with HR systems and ATS
- **ATS (Applicant Tracking Systems)**: Automated candidate data extraction and categorization

### Industry Sectors
- **Human Resources & Recruitment**: Streamline hiring processes
- **Technology & Software Development**: Technical skill assessment
- **Finance & Banking**: Compliance and background verification
- **Healthcare**: Medical credential verification
- **Education**: Academic credential processing
- **Government & Public Sector**: Public service recruitment

### Specific Use Cases
1. **Automated Resume Screening**: Filter candidates based on skills and experience
2. **Data Migration**: Convert legacy resume databases to structured formats
3. **Compliance Checking**: Verify educational and professional credentials
4. **Skill Gap Analysis**: Identify missing skills in candidate pools
5. **Market Research**: Analyze job market trends and skill demands

## ⚠️ Limitations

1. **Language**: Currently optimized for English resumes only
2. **Format**: Works best with text-based resumes (PDF conversion may be required)
3. **Domain**: Primarily trained on technology and business resumes
4. **Length**: Optimal for resumes under 512 tokens (truncation applied for longer texts)
5. **Accuracy**: 90.87% F1 score - may miss some entities in complex or non-standard formats
6. **Context**: Limited to resume-specific entities (not general NER)

## 🔧 Technical Requirements

### System Requirements
- **Python**: 3.8 or higher
- **PyTorch**: 1.9 or higher
- **Transformers**: 4.20 or higher
- **Memory**: 2GB+ RAM recommended
- **Storage**: 431MB model size
- **CPU**: Multi-core recommended for inference

### Dependencies
```bash
pip install transformers torch datasets scikit-learn numpy
```

### Installation
```bash
# Install required packages
pip install transformers[torch] datasets scikit-learn

# Or using conda
conda install pytorch transformers -c pytorch
```

## 📄 License

This model is licensed under the Apache 2.0 License. This means you can:
- Use the model for commercial purposes
- Modify and distribute the model
- Use it in proprietary software
- Distribute modified versions

See the [LICENSE](LICENSE) file for complete details.

## 🤝 Contributing

We welcome contributions from the community! Here's how you can contribute:

1. **Report Issues**: Create an issue for bugs or feature requests
2. **Submit Improvements**: Fork the repository and submit pull requests
3. **Share Datasets**: Contribute additional training data
4. **Documentation**: Help improve documentation and examples
5. **Testing**: Test the model on different resume formats

## 📞 Support

For questions, issues, or support:

- **GitHub Issues**: Create an issue on the model repository
- **Hugging Face Discussions**: Use the discussion tab on the model page
- **Email**: Contact through the Hugging Face profile
- **Documentation**: Check the model card and examples

## 🙏 Acknowledgments

- **Base Model**: `yashpwr/resume-ner-bert` for the foundation architecture
- **Datasets**: Resume-Corpus, DataTurks, and custom training data contributors
- **Hugging Face**: For the transformers library and platform
- **Open Source Community**: For contributions and feedback
- **Research Community**: For advancing NER and information extraction techniques

## 📈 Model Evolution

### Version History
- **v1**: Initial release with basic resume parsing
- **v2**: Comprehensive model with 22,542 samples and 90.87% F1 score

### Future Improvements
- Multi-language support
- Enhanced entity types
- Better handling of complex resume formats
- Integration with document processing pipelines
- Real-time inference optimization

---

**Last Updated**: August 7, 2025  
**Version**: v2  
**Status**: Production Ready ✅  
**License**: Apache 2.0  
**Repository**: https://huggingface.co/yashpwr/resume-ner-bert-v2