Create README.md with comprehensive documentation and add project images

Browse files

Files changed (4) hide show

README.md +190 -3
images/demo.png +0 -0
images/mainlayout1.png +0 -0
images/mainlayout2.png +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,190 @@
----
-license: apache-2.0
----

+# Mehfil-e-Sukhan: Har Lafz Ek Mehfil
+![Main Layout 1](images/mainlayout1.png)
+![Main Layout 2](images/mainlayout2.png)
+## Roman Urdu Poetry Generation Model
+A bidirectional LSTM neural network for generating Roman Urdu poetry, fine-tuned on a curated dataset of Urdu poetry in Latin script.
+## Table of Contents
+- [Overview](#overview)
+- [Repository Structure](#repository-structure)
+- [Model Architecture](#model-architecture)
+- [Dataset](#dataset)
+- [Data Processing](#data-processing)
+- [Training Methodology](#training-methodology)
+- [Text Generation Process](#text-generation-process)
+- [Results and Performance](#results-and-performance)
+- [Usage](#usage)
+- [Interactive Demo](#interactive-demo)
+- [Installation](#installation)
+- [Future Improvements](#future-improvements)
+- [License](#license)
+- [Contact](#contact)
+## Overview
+Mehfil-e-Sukhan (meaning "Poetry Gathering" in Urdu) is a natural language generation model specifically designed for Roman Urdu poetry creation. This repository contains the complete model implementation, including data preprocessing, tokenization, model architecture, training code, and inference utilities.
+The model uses a Bidirectional LSTM architecture trained on a dataset of approximately 1,300 lines of Roman Urdu poetry to learn patterns, rhythms, and stylistic elements of Urdu poetry written in Latin script.
+## Repository Structure
+The repository contains the following key files:
+- `poetry_generation.ipynb`: Complete notebook with data preparation, model definition, training code, and generation utilities
+- `model_weights.pth`: Trained model weights (243 MB)
+- `urdu_sp.model`: SentencePiece tokenizer model (429 KB)
+- `urdu_sp.vocab`: SentencePiece vocabulary file (181 KB)
+- `all_texts.txt`: Preprocessed dataset used for training (869 KB)
+- `requirements.txt`: Required Python packages
+- `.gitattributes`: Git LFS tracking for large files
+## Model Architecture
+The poetry generation model uses a Bidirectional LSTM architecture:
+- **Embedding Layer**: 512-dimensional embeddings
+- **BiLSTM Layers**: 3 stacked bidirectional LSTM layers with 768 hidden units in each direction
+- **Dropout**: 0.2 dropout rate for regularization
+- **Output Layer**: Linear projection to vocabulary size (12,000 tokens)
+This architecture was chosen to capture both preceding and following context in poetry lines, which is essential for maintaining coherence and style in the generated text.
+## Dataset
+The model is trained on the Roman Urdu Poetry dataset, which contains approximately 1,300 lines of Urdu poetry written in Latin script (Roman Urdu). The dataset includes works from various poets and covers a range of poetic styles and themes.
+Dataset Source: [Roman Urdu Poetry Dataset on Kaggle](https://www.kaggle.com/datasets/mianahmadhasan/roman-urdu-poetry-csv)
+## Data Processing
+Raw poetry lines undergo several preprocessing steps:
+1. **Diacritic Removal**: Unicode diacritics are normalized and removed
+2. **Text Cleaning**: Excessive punctuation, symbols, and repeated spaces are eliminated
+3. **Tokenization**: SentencePiece BPE (Byte Pair Encoding) tokenization with a vocabulary size of 12,000
+The tokenization approach allows the model to handle out-of-vocabulary words by breaking them into subword units, which is particularly important for Roman Urdu where spelling variations are common.
+## Training Methodology
+The model was trained with the following parameters:
+- **Train/Validation/Test Split**: 80% / 10% / 10%
+- **Loss Function**: Cross-Entropy with ignore_index for padding tokens
+- **Optimizer**: Adam with learning rate 1e-3 and weight decay 1e-5
+- **Learning Rate Schedule**: StepLR with step size 2 and gamma 0.5
+- **Gradient Clipping**: Maximum norm of 5.0
+- **Epochs**: 10 (sufficient for convergence on this dataset size)
+- **Batch Size**: 64
+Training was performed on both CPU and GPU environments, with automatic device detection.
+## Text Generation Process
+Poetry generation uses nucleus sampling (top-p) with adjustable parameters:
+- **Temperature**: Controls randomness in word selection (default: 1.2)
+- **Top-p (nucleus) sampling**: Limits token selection to the smallest set whose cumulative probability exceeds the threshold (default: 0.85)
+- **Formatting**: Automatically formats output with 6 words per line for aesthetic presentation
+This sampling approach balances creativity and coherence in the generated text, allowing for controlled variation in the output.
+![Demo Screenshot](images/demo.png)
+## Results and Performance
+The final model achieves a test loss of approximately 3.17, which is reasonable considering the dataset size. The model demonstrates the ability to:
+- Generate contextually relevant continuations from a seed word
+- Maintain some aspects of Urdu poetic style in Roman script
+- Produce text with thematic consistency
+The limited dataset size (1,300 lines) does result in some repetitiveness in longer generations, which could be improved with additional training data.
+## Usage
+To use the model for generating poetry:
+```python
+# Import required libraries (these are included in the notebook)
+import torch
+import sentencepiece as spm
+# Load the SentencePiece model
+sp = spm.SentencePieceProcessor()
+sp.load("urdu_sp.model")
+# Load the BiLSTM model
+model = BiLSTMLanguageModel(vocab_size=sp.get_piece_size(),
+                           embed_dim=512,
+                           hidden_dim=768,
+                           num_layers=3,
+                           dropout=0.2)
+model.load_state_dict(torch.load("model_weights.pth", map_location=device))
+model.eval()
+# Generate poetry
+start_word = "ishq"  # Example: "love"
+generated_poetry = generate_poetry_nucleus(model, sp, start_word,
+                                          num_words=12,
+                                          temperature=1.2,
+                                          top_p=0.85)
+print(generated_poetry)
+```
+## Interactive Demo
+An interactive demo of this model is available as a Streamlit application, which provides a user-friendly interface to generate Roman Urdu poetry with adjustable parameters:
+[Mehfil-e-Sukhan Demo on HuggingFace Spaces](https://huggingface.co/spaces/zaiffi/Mehfil-e-Sukhan)
+The Streamlit app allows users to:
+- Enter a starting word or phrase
+- Adjust the number of words to generate
+- Control the creativity (temperature) and focus (top-p) parameters
+- View the formatted poetry output in an elegant interface
+## Installation
+To set up this model locally:
+1. Clone the repository
+2. Install the required dependencies:
+   ```
+   pip install -r requirements.txt
+   ```
+3. Open and run `poetry_generation.ipynb` to explore the complete implementation
+The required packages include:
+- torch
+- sentencepiece
+- pandas
+- scikit-learn
+- numpy
+## Future Improvements
+Potential enhancements for the model include:
+1. **Expanded Dataset**: Increasing the training data size to thousands of poetry lines for improved diversity and coherence
+2. **Transformer Architecture**: Replacing BiLSTM with a Transformer-based model for better long-range dependencies
+3. **Style Control**: Adding mechanisms to control specific poetic styles or meters
+4. **Multi-Language Support**: Extending the model to handle both Roman Urdu and Nastaliq script
+5. **Fine-Tuning Options**: Adding more parameters to control the generation style and themes
+## License
+This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
+## Contact
+- LinkedIn: [Muhammad Huzaifa Saqib](https://www.linkedin.com/in/muhammad-huzaifa-saqib-90a1a9324/)
+- GitHub: [zaiffishiekh01](https://github.com/zaiffishiekh01)
+- Email: [zaiffishiekh@gmail.com](mailto:zaiffishiekh@gmail.com)

images/demo.png ADDED Viewed

images/mainlayout1.png ADDED Viewed

images/mainlayout2.png ADDED Viewed