zaiffi commited on
Commit
dabc67a
·
1 Parent(s): 2480fdf

Create README.md with comprehensive documentation and add project images

Browse files
Files changed (4) hide show
  1. README.md +190 -3
  2. images/demo.png +0 -0
  3. images/mainlayout1.png +0 -0
  4. images/mainlayout2.png +0 -0
README.md CHANGED
@@ -1,3 +1,190 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mehfil-e-Sukhan: Har Lafz Ek Mehfil
2
+
3
+ ![Main Layout 1](images/mainlayout1.png)
4
+ ![Main Layout 2](images/mainlayout2.png)
5
+
6
+ ## Roman Urdu Poetry Generation Model
7
+
8
+ A bidirectional LSTM neural network for generating Roman Urdu poetry, fine-tuned on a curated dataset of Urdu poetry in Latin script.
9
+
10
+
11
+ ## Table of Contents
12
+
13
+ - [Overview](#overview)
14
+ - [Repository Structure](#repository-structure)
15
+ - [Model Architecture](#model-architecture)
16
+ - [Dataset](#dataset)
17
+ - [Data Processing](#data-processing)
18
+ - [Training Methodology](#training-methodology)
19
+ - [Text Generation Process](#text-generation-process)
20
+ - [Results and Performance](#results-and-performance)
21
+ - [Usage](#usage)
22
+ - [Interactive Demo](#interactive-demo)
23
+ - [Installation](#installation)
24
+ - [Future Improvements](#future-improvements)
25
+ - [License](#license)
26
+ - [Contact](#contact)
27
+
28
+ ## Overview
29
+
30
+ Mehfil-e-Sukhan (meaning "Poetry Gathering" in Urdu) is a natural language generation model specifically designed for Roman Urdu poetry creation. This repository contains the complete model implementation, including data preprocessing, tokenization, model architecture, training code, and inference utilities.
31
+
32
+ The model uses a Bidirectional LSTM architecture trained on a dataset of approximately 1,300 lines of Roman Urdu poetry to learn patterns, rhythms, and stylistic elements of Urdu poetry written in Latin script.
33
+
34
+
35
+ ## Repository Structure
36
+
37
+ The repository contains the following key files:
38
+
39
+ - `poetry_generation.ipynb`: Complete notebook with data preparation, model definition, training code, and generation utilities
40
+ - `model_weights.pth`: Trained model weights (243 MB)
41
+ - `urdu_sp.model`: SentencePiece tokenizer model (429 KB)
42
+ - `urdu_sp.vocab`: SentencePiece vocabulary file (181 KB)
43
+ - `all_texts.txt`: Preprocessed dataset used for training (869 KB)
44
+ - `requirements.txt`: Required Python packages
45
+ - `.gitattributes`: Git LFS tracking for large files
46
+
47
+ ## Model Architecture
48
+
49
+ The poetry generation model uses a Bidirectional LSTM architecture:
50
+
51
+ - **Embedding Layer**: 512-dimensional embeddings
52
+ - **BiLSTM Layers**: 3 stacked bidirectional LSTM layers with 768 hidden units in each direction
53
+ - **Dropout**: 0.2 dropout rate for regularization
54
+ - **Output Layer**: Linear projection to vocabulary size (12,000 tokens)
55
+
56
+ This architecture was chosen to capture both preceding and following context in poetry lines, which is essential for maintaining coherence and style in the generated text.
57
+
58
+ ## Dataset
59
+
60
+ The model is trained on the Roman Urdu Poetry dataset, which contains approximately 1,300 lines of Urdu poetry written in Latin script (Roman Urdu). The dataset includes works from various poets and covers a range of poetic styles and themes.
61
+
62
+ Dataset Source: [Roman Urdu Poetry Dataset on Kaggle](https://www.kaggle.com/datasets/mianahmadhasan/roman-urdu-poetry-csv)
63
+
64
+ ## Data Processing
65
+
66
+ Raw poetry lines undergo several preprocessing steps:
67
+
68
+ 1. **Diacritic Removal**: Unicode diacritics are normalized and removed
69
+ 2. **Text Cleaning**: Excessive punctuation, symbols, and repeated spaces are eliminated
70
+ 3. **Tokenization**: SentencePiece BPE (Byte Pair Encoding) tokenization with a vocabulary size of 12,000
71
+
72
+ The tokenization approach allows the model to handle out-of-vocabulary words by breaking them into subword units, which is particularly important for Roman Urdu where spelling variations are common.
73
+
74
+ ## Training Methodology
75
+
76
+ The model was trained with the following parameters:
77
+
78
+ - **Train/Validation/Test Split**: 80% / 10% / 10%
79
+ - **Loss Function**: Cross-Entropy with ignore_index for padding tokens
80
+ - **Optimizer**: Adam with learning rate 1e-3 and weight decay 1e-5
81
+ - **Learning Rate Schedule**: StepLR with step size 2 and gamma 0.5
82
+ - **Gradient Clipping**: Maximum norm of 5.0
83
+ - **Epochs**: 10 (sufficient for convergence on this dataset size)
84
+ - **Batch Size**: 64
85
+
86
+ Training was performed on both CPU and GPU environments, with automatic device detection.
87
+
88
+ ## Text Generation Process
89
+
90
+ Poetry generation uses nucleus sampling (top-p) with adjustable parameters:
91
+
92
+ - **Temperature**: Controls randomness in word selection (default: 1.2)
93
+ - **Top-p (nucleus) sampling**: Limits token selection to the smallest set whose cumulative probability exceeds the threshold (default: 0.85)
94
+ - **Formatting**: Automatically formats output with 6 words per line for aesthetic presentation
95
+
96
+ This sampling approach balances creativity and coherence in the generated text, allowing for controlled variation in the output.
97
+
98
+ ![Demo Screenshot](images/demo.png)
99
+
100
+ ## Results and Performance
101
+
102
+ The final model achieves a test loss of approximately 3.17, which is reasonable considering the dataset size. The model demonstrates the ability to:
103
+
104
+ - Generate contextually relevant continuations from a seed word
105
+ - Maintain some aspects of Urdu poetic style in Roman script
106
+ - Produce text with thematic consistency
107
+
108
+ The limited dataset size (1,300 lines) does result in some repetitiveness in longer generations, which could be improved with additional training data.
109
+
110
+
111
+ ## Usage
112
+
113
+ To use the model for generating poetry:
114
+
115
+ ```python
116
+ # Import required libraries (these are included in the notebook)
117
+ import torch
118
+ import sentencepiece as spm
119
+
120
+ # Load the SentencePiece model
121
+ sp = spm.SentencePieceProcessor()
122
+ sp.load("urdu_sp.model")
123
+
124
+ # Load the BiLSTM model
125
+ model = BiLSTMLanguageModel(vocab_size=sp.get_piece_size(),
126
+ embed_dim=512,
127
+ hidden_dim=768,
128
+ num_layers=3,
129
+ dropout=0.2)
130
+ model.load_state_dict(torch.load("model_weights.pth", map_location=device))
131
+ model.eval()
132
+
133
+ # Generate poetry
134
+ start_word = "ishq" # Example: "love"
135
+ generated_poetry = generate_poetry_nucleus(model, sp, start_word,
136
+ num_words=12,
137
+ temperature=1.2,
138
+ top_p=0.85)
139
+ print(generated_poetry)
140
+ ```
141
+
142
+ ## Interactive Demo
143
+
144
+ An interactive demo of this model is available as a Streamlit application, which provides a user-friendly interface to generate Roman Urdu poetry with adjustable parameters:
145
+
146
+ [Mehfil-e-Sukhan Demo on HuggingFace Spaces](https://huggingface.co/spaces/zaiffi/Mehfil-e-Sukhan)
147
+
148
+ The Streamlit app allows users to:
149
+ - Enter a starting word or phrase
150
+ - Adjust the number of words to generate
151
+ - Control the creativity (temperature) and focus (top-p) parameters
152
+ - View the formatted poetry output in an elegant interface
153
+
154
+ ## Installation
155
+
156
+ To set up this model locally:
157
+
158
+ 1. Clone the repository
159
+ 2. Install the required dependencies:
160
+ ```
161
+ pip install -r requirements.txt
162
+ ```
163
+ 3. Open and run `poetry_generation.ipynb` to explore the complete implementation
164
+
165
+ The required packages include:
166
+ - torch
167
+ - sentencepiece
168
+ - pandas
169
+ - scikit-learn
170
+ - numpy
171
+
172
+ ## Future Improvements
173
+
174
+ Potential enhancements for the model include:
175
+
176
+ 1. **Expanded Dataset**: Increasing the training data size to thousands of poetry lines for improved diversity and coherence
177
+ 2. **Transformer Architecture**: Replacing BiLSTM with a Transformer-based model for better long-range dependencies
178
+ 3. **Style Control**: Adding mechanisms to control specific poetic styles or meters
179
+ 4. **Multi-Language Support**: Extending the model to handle both Roman Urdu and Nastaliq script
180
+ 5. **Fine-Tuning Options**: Adding more parameters to control the generation style and themes
181
+
182
+ ## License
183
+
184
+ This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
185
+
186
+ ## Contact
187
+
188
+ - LinkedIn: [Muhammad Huzaifa Saqib](https://www.linkedin.com/in/muhammad-huzaifa-saqib-90a1a9324/)
189
+ - GitHub: [zaiffishiekh01](https://github.com/zaiffishiekh01)
190
+ - Email: [zaiffishiekh@gmail.com](mailto:zaiffishiekh@gmail.com)
images/demo.png ADDED
images/mainlayout1.png ADDED
images/mainlayout2.png ADDED