Armaggheddon commited on
Commit
6ae0d46
·
verified ·
1 Parent(s): 850f471

Initial commit

Browse files
Files changed (38) hide show
  1. .gitattributes +15 -0
  2. README.md +168 -3
  3. plots/class_distribution.jpg +3 -0
  4. plots/n_s_m_comparison/box_precision_per_label.png +3 -0
  5. plots/n_s_m_comparison/box_precision_percentage_improvement_per_label.png +0 -0
  6. plots/n_s_m_comparison/map50_95_per_label.png +3 -0
  7. plots/n_s_m_comparison/map50_95_percentage_improvement_per_label.png +0 -0
  8. plots/n_s_m_comparison/map50_per_label.png +0 -0
  9. plots/n_s_m_comparison/map50_percentage_improvement_per_label.png +0 -0
  10. plots/n_s_m_comparison/recall_per_label.png +0 -0
  11. plots/n_s_m_comparison/recall_percentage_improvement_per_label.png +0 -0
  12. plots/yolo11n_best/box_precision_per_label.png +0 -0
  13. plots/yolo11n_best/box_precision_percentage_improvement_per_label.png +0 -0
  14. plots/yolo11n_best/map50_95_per_label.png +0 -0
  15. plots/yolo11n_best/map50_95_percentage_improvement_per_label.png +0 -0
  16. plots/yolo11n_best/map50_per_label.png +0 -0
  17. plots/yolo11n_best/map50_percentage_improvement_per_label.png +0 -0
  18. plots/yolo11n_best/recall_per_label.png +0 -0
  19. plots/yolo11n_best/recall_percentage_improvement_per_label.png +0 -0
  20. plots/yolo11n_scores/box_precision_per_label.png +3 -0
  21. plots/yolo11n_scores/box_precision_percentage_improvement_per_label.png +0 -0
  22. plots/yolo11n_scores/map50_95_per_label.png +3 -0
  23. plots/yolo11n_scores/map50_95_percentage_improvement_per_label.png +0 -0
  24. plots/yolo11n_scores/map50_per_label.png +3 -0
  25. plots/yolo11n_scores/map50_percentage_improvement_per_label.png +0 -0
  26. plots/yolo11n_scores/recall_per_label.png +3 -0
  27. plots/yolo11n_scores/recall_percentage_improvement_per_label.png +0 -0
  28. runs/train4/confusion_matrix_normalized.png +3 -0
  29. runs/train4/results.png +3 -0
  30. runs/train5/confusion_matrix_normalized.png +3 -0
  31. runs/train5/results.png +3 -0
  32. runs/train6/confusion_matrix_normalized.png +3 -0
  33. runs/train6/results.png +3 -0
  34. runs/train9/confusion_matrix_normalized.png +3 -0
  35. runs/train9/results.png +3 -0
  36. yolo11m_doc_layout.pt +3 -0
  37. yolo11n_doc_layout.pt +3 -0
  38. yolo11s_doc_layout.pt +3 -0
.gitattributes CHANGED
@@ -33,3 +33,18 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ plots/class_distribution.jpg filter=lfs diff=lfs merge=lfs -text
37
+ plots/n_s_m_comparison/box_precision_per_label.png filter=lfs diff=lfs merge=lfs -text
38
+ plots/n_s_m_comparison/map50_95_per_label.png filter=lfs diff=lfs merge=lfs -text
39
+ plots/yolo11n_scores/box_precision_per_label.png filter=lfs diff=lfs merge=lfs -text
40
+ plots/yolo11n_scores/map50_95_per_label.png filter=lfs diff=lfs merge=lfs -text
41
+ plots/yolo11n_scores/map50_per_label.png filter=lfs diff=lfs merge=lfs -text
42
+ plots/yolo11n_scores/recall_per_label.png filter=lfs diff=lfs merge=lfs -text
43
+ runs/train4/confusion_matrix_normalized.png filter=lfs diff=lfs merge=lfs -text
44
+ runs/train4/results.png filter=lfs diff=lfs merge=lfs -text
45
+ runs/train5/confusion_matrix_normalized.png filter=lfs diff=lfs merge=lfs -text
46
+ runs/train5/results.png filter=lfs diff=lfs merge=lfs -text
47
+ runs/train6/confusion_matrix_normalized.png filter=lfs diff=lfs merge=lfs -text
48
+ runs/train6/results.png filter=lfs diff=lfs merge=lfs -text
49
+ runs/train9/confusion_matrix_normalized.png filter=lfs diff=lfs merge=lfs -text
50
+ runs/train9/results.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,168 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # YOLOv11 Document Layout
2
+
3
+ This page documents a comprehensive study on Document Layout Analysis using the YOLOv11 model family with the ultralytics library. The primary objective of this project is the training and evaluation of highly optimized YOLO models capable of accurately detecting various elements within documents, such as text blocks, tables, and figures. The models have been finetuned on the DocLayNet dataset that provides a rich variety of annotated document layouts.
4
+
5
+ The project involved fine-tuning three version of the YOLOv11 family, the `n` (nano), `s` (small), and `m` (medium) models.
6
+
7
+ The final, recommended model, **yolo11n_doc_layout.pt (train4)**, offers the best balance of speed and localization quality.
8
+
9
+
10
+
11
+ ## 🚀 How to use
12
+
13
+ ### Installation
14
+
15
+ To run the model locally, ensure the necessary libraries are installed:
16
+
17
+ ```bash
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ ### Inference Example
22
+
23
+ This Python snippet demonstrates how to load and run inference on the document layout analysis model:
24
+
25
+ ```python
26
+ from pathlib import Path
27
+ from huggingface_hub import hf_hub_download
28
+ from ultralytics import YOLO
29
+
30
+ DOWNLOAD_PATH = Path(__file__).parent / "models"
31
+
32
+ available_models = [
33
+ "yolo11n_doc_layout.pt",
34
+ "yolo11s_doc_layout.pt",
35
+ "yolo11m_doc_layout.pt",
36
+ ]
37
+
38
+ model_path = hf_hub_download(
39
+ repo_id="Armaggheddon/yolo11-document-layout",
40
+ filename=available_models[0], # Change index for different models
41
+ repo_type="model",
42
+ local_dir=DOWNLOAD_PATH,
43
+ )
44
+
45
+ # Initialize the model from the downloaded path
46
+ model = YOLO(model_path)
47
+
48
+ # Load an image (replace 'path/to/your/document.jpg' with your file)
49
+ results = model('path/to/your/document.jpg')
50
+
51
+ # Process and display results
52
+ results.print()
53
+ results.show()
54
+ ```
55
+
56
+
57
+ ## Dataset Overview: DocLayNet
58
+ The models were trained on the [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet), which contains a diverse collection of document images annotated with various layout elements. The dataset includes the following key labels:
59
+ - **Text:** Regular paragraphs.
60
+ - **Picture:** A graphic or photograph.
61
+ - **Caption:** Special text outside a picture or table that introduces this picture or table.
62
+ - **Section-header:** Any kind of heading in the text, except overall document title.
63
+ - **Footnote:** Typically small text at the bottom of a page, with a number or symbol that is referred to in the text above.
64
+ - **Formula:** Mathematical equation on its own line.
65
+ - **Table:** Material arranged in a grid alignment with rows and columns, often with separator lines.
66
+ - **List-item:** One element of a list, in a hanging shape, i.e., from the second line onwards the paragraph is indented more than the first line.
67
+ - **Page-header:** Repeating elements like page number at the top, outside of the normal text flow.
68
+ - **Page-footer:** Repeating elements like page number at the bottom, outside of the normal text flow.
69
+ * **Title:** Overall title of a document, (almost) exclusively on the first page and typically appearing in large font.
70
+
71
+ All the images of the dataset are 1250x1250 pixels and therefore the training resolution was set to 1280x1280. This is also driven by initial evaluations showing that using the default 640x640 resolution led to a significant drop in performance, especially for smaller elements like `footnote` and `caption`.
72
+
73
+ More information about the dataset and how it has been created can be found in the [DocLayNet Labeling Guide](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf).
74
+
75
+ The dataset labels were mapped to the YOLO format through `doclaynet_to_yolo.py`, ensuring compatibility with the YOLO training pipeline.
76
+ The dataset class distribution is visualized in the image below.
77
+
78
+ ![Class Distribution](plots/class_distribution.jpg)
79
+ ---
80
+
81
+ ## The Contenders: Models on the Spotlight
82
+
83
+ The study experimented with three core model configurations, varying primarily by size:
84
+
85
+ - **YOLOv11n (train4)**
86
+ - **YOLOv11s (train5)**
87
+ - **YOLOv11m (train6)**
88
+
89
+ In reality, the performance gains stepping up the model sizes are marginal especially when considering the increased resource demands. This is notable evident in the improvements from the `s` to `m` model size increase. With all the results considered, the `n` model family (nano) is the most efficient and effective choice for deployment providing a good balance of speed and accuracy.
90
+
91
+ ### Training and Evaluation at a Glance
92
+
93
+ The plots below illustrate the core convergence metrics (precision, recall, and mAP) as the models learned over time. The normalized confusion matrices provide a visual breakdown of how accurately the models distinguish between different document layout elements—a strong diagonal line indicates robust classification.
94
+
95
+ | Model | Training Metrics | Normalized Confusion Matrix |
96
+ | :---: | :---: | :---: |
97
+ | **`train4`** | <img src="runs/train4/results.png" alt="train4 results" height="200"> | <img src="runs/train4/confusion_matrix_normalized.png" alt="train4 confusion matrix" height="200"> |
98
+ | **`train5`** | <img src="runs/train5/results.png" alt="train5 results" height="200"> | <img src="runs/train5/confusion_matrix_normalized.png" alt="train5 confusion matrix" height="200"> |
99
+ | **`train6`** | <img src="runs/train6/results.png" alt="train6 results" height="200"> | <img src="runs/train6/confusion_matrix_normalized.png" alt="train6 confusion matrix" height="200"> |
100
+ | **`train9`** | <img src="runs/train9/results.png" alt="train9 results" height="200"> | <img src="runs/train9/confusion_matrix_normalized.png" alt="train9 confusion matrix" height="200"> |
101
+
102
+
103
+ ## Results and Performance Showdown
104
+
105
+ ### Nano vs. Small vs. Medium Size Comparison
106
+
107
+ The plots below compare the performance of the three main models across key metrics for each document layout label.
108
+
109
+ | **mAP@50-95** (Strict Accuracy) | **mAP@50** (Standard Accuracy) |
110
+ | :---: | :---: |
111
+ | <img src="plots/n_s_m_comparison/map50_95_per_label.png" alt="mAP@50-95" height="200"> | <img src="plots/n_s_m_comparison/map50_per_label.png" alt="mAP@50" height="200"> |
112
+
113
+ | **Precision** (Box Quality) | **Recall** (Detection Coverage) |
114
+ | :---: | :---: |
115
+ | <img src="plots/n_s_m_comparison/box_precision_per_label.png" alt="Precision" height="200"> | <img src="plots/n_s_m_comparison/recall_per_label.png" alt="Recall" height="200"> |
116
+
117
+ As anticipated, the larger models (`train5` and `train6`) generally exhibit superior raw performance due to increased complexity. However, the `train4` nano model provides significant efficiency, making the detailed analysis of the nano family essential.
118
+
119
+ ### 🔬 In-Depth Analysis: The Nano Model Family Performance (`YOLOv11n`)
120
+
121
+ The nano models (`yolo11n`) are the most suitable candidates for real-world deployment due to their minimal resource consumption. This deeper analysis was conducted to find the optimal balance of speed and accuracy within this efficient family.
122
+
123
+ | **mAP@50-95** (Strict Accuracy) | **mAP@50** (Standard Accuracy) |
124
+ | :---: | :---: |
125
+ | <img src="plots/yolo11n_scores/map50_95_per_label.png" alt="mAP@50-95" height="200"> | <img src="plots/yolo11n_scores/map50_per_label.png" alt="mAP@50" height="200"> |
126
+
127
+ | **Precision** (Box Quality) | **Recall** (Detection Coverage) |
128
+ | :---: | :---: |
129
+ | <img src="plots/yolo11n_scores/box_precision_per_label.png" alt="Precision" height="200"> | <img src="plots/yolo11n_scores/recall_per_label.png" alt="Recall" height="200"> |
130
+
131
+ #### Justification for `train4` and `train9` Selection
132
+
133
+ The iterations **`train4` (yolo11n.4)** and **`train9` (yolo11n.9)** were selected for direct comparison because they represent two distinct, near-peak optimization strategies within the nano family:
134
+
135
+ * **`train9` (Highest Average Performer):** This iteration consistently exhibits the highest or near-highest scores across general categories, particularly in mAP50 and Recall (pink line). It represents the model that achieved the highest overall score based on simple optimization criteria.
136
+ * **`train4` (Localization Integrity Champion):** This iteration (red line) shows exceptional strength in specific, critical areas relating to bounding box quality. It directly competes with `train9` in Box Precision across several labels (e.g., `section-header`, `table`), suggesting a superior focus on accurate localization.
137
+
138
+ ---
139
+
140
+ ### The `train4` vs. `train9` Showdown: Quality Over Quantity
141
+
142
+ Although both nano models converged to nearly identical overall mAP scores and `train9` displayed a smoother training curve, **`train4` ultimately proved to be the more optimal choice for production** due to its focus on localization accuracy.
143
+
144
+ The `train9` model's optimization path prioritized **detection coverage (Recall)**, which often sacrifices high-quality object boundaries, making it less reliable for tasks requiring data integrity.
145
+
146
+ The justification for selecting `train4` is rooted in its substantial gains in key quality metrics:
147
+
148
+ 1. **Superior Box Precision:** `train4` delivered highly accurate bounding boxes, evidenced by an improvement of over **9.0%** in Box Precision for the `title` category, and strong gains in `section-header` and `table`.
149
+ 2. **Maximized mAP Quality:** `train4` achieved a 2.4% improvement in mAP50 and a 2.05% improvement in mAP50_95 for the challenging `footnote` element. This demonstrates `train4`'s superior capability in reaching high Intersection over Union (IOU) quality thresholds.
150
+
151
+ | Box Precision Improvement | mAP50 Improvement | mAP50-95 Improvement |
152
+ | :---: | :---: | :---: |
153
+ | <img src="plots/yolo11n_best/box_precision_percentage_improvement_per_label.png" alt="Box Precision Improvement"> | <img src="plots/yolo11n_best/map50_percentage_improvement_per_label.png" alt="mAP50 Improvement"> | <img src="plots/yolo11n_best/map50_95_percentage_improvement_per_label.png" alt="mAP50-95 Improvement"> |
154
+
155
+ In essence, **`train9` traded bounding box quality for detection quantity.** For a robust production model that outputs accurately located data, the high localization precision of **`train4`** makes it the unequivocally optimal and reliable choice.
156
+
157
+ ---
158
+
159
+ ## Summary
160
+
161
+ This project successfully demonstrates the advanced capabilities of YOLOv11 for document layout analysis. While larger models offer higher raw accuracy, the YOLOv11n model (`train4`) stands out, providing an excellent compromise between performance and efficiency. The detailed analysis underscores the critical importance of prioritizing **localization accuracy (precision)** over sheer detection coverage (recall) when deploying models for mission-critical data extraction tasks.
162
+
163
+ ---
164
+
165
+ ## More details, code, and examples
166
+ For full training scripts, dataset conversion utilities, and end-to-end examples, see the GitHub repository:
167
+
168
+ https://github.com/Armaggheddon/yolo11_doc_layout
plots/class_distribution.jpg ADDED

Git LFS Details

  • SHA256: b05c46b26d19fdd897bd1ee5414554ac1d68ed46f6cebac7b0b00e93b50e3d64
  • Pointer size: 131 Bytes
  • Size of remote file: 153 kB
plots/n_s_m_comparison/box_precision_per_label.png ADDED

Git LFS Details

  • SHA256: 0a1c27bbfb9276545f27cb427bab3d4b6d62e53a02cefb64717fd0e736538d1f
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
plots/n_s_m_comparison/box_precision_percentage_improvement_per_label.png ADDED
plots/n_s_m_comparison/map50_95_per_label.png ADDED

Git LFS Details

  • SHA256: 6c33b646bd04be3b54af5bddd32f507cd5f1640a448e7c8c016258a1d5c480b1
  • Pointer size: 131 Bytes
  • Size of remote file: 108 kB
plots/n_s_m_comparison/map50_95_percentage_improvement_per_label.png ADDED
plots/n_s_m_comparison/map50_per_label.png ADDED
plots/n_s_m_comparison/map50_percentage_improvement_per_label.png ADDED
plots/n_s_m_comparison/recall_per_label.png ADDED
plots/n_s_m_comparison/recall_percentage_improvement_per_label.png ADDED
plots/yolo11n_best/box_precision_per_label.png ADDED
plots/yolo11n_best/box_precision_percentage_improvement_per_label.png ADDED
plots/yolo11n_best/map50_95_per_label.png ADDED
plots/yolo11n_best/map50_95_percentage_improvement_per_label.png ADDED
plots/yolo11n_best/map50_per_label.png ADDED
plots/yolo11n_best/map50_percentage_improvement_per_label.png ADDED
plots/yolo11n_best/recall_per_label.png ADDED
plots/yolo11n_best/recall_percentage_improvement_per_label.png ADDED
plots/yolo11n_scores/box_precision_per_label.png ADDED

Git LFS Details

  • SHA256: 60720ea9e645af63add50aedc78999b4e1e1fdbd3027d8a346753dc4873ee0fc
  • Pointer size: 131 Bytes
  • Size of remote file: 157 kB
plots/yolo11n_scores/box_precision_percentage_improvement_per_label.png ADDED
plots/yolo11n_scores/map50_95_per_label.png ADDED

Git LFS Details

  • SHA256: 64304305ef1c2e558dbbccb56137aa9a5c4249a64bd15702b25012bbf5548587
  • Pointer size: 131 Bytes
  • Size of remote file: 152 kB
plots/yolo11n_scores/map50_95_percentage_improvement_per_label.png ADDED
plots/yolo11n_scores/map50_per_label.png ADDED

Git LFS Details

  • SHA256: c746d16b12eae87b999bcbcf7c2bcdc8a22498b1c9b54046d325eeee4644653e
  • Pointer size: 131 Bytes
  • Size of remote file: 133 kB
plots/yolo11n_scores/map50_percentage_improvement_per_label.png ADDED
plots/yolo11n_scores/recall_per_label.png ADDED

Git LFS Details

  • SHA256: e1a0304ffc15600759f9b40bf41b94bf82f29a4ecf84f57264c0311d10ae5450
  • Pointer size: 131 Bytes
  • Size of remote file: 130 kB
plots/yolo11n_scores/recall_percentage_improvement_per_label.png ADDED
runs/train4/confusion_matrix_normalized.png ADDED

Git LFS Details

  • SHA256: b99d387fdf158052d64c7c6e35749176514bd7b58cec529460e90f081e9085eb
  • Pointer size: 131 Bytes
  • Size of remote file: 280 kB
runs/train4/results.png ADDED

Git LFS Details

  • SHA256: eaeb323887ee428f6cf53a13c8b7f6d662038c0d871d01369a19d96558acae33
  • Pointer size: 131 Bytes
  • Size of remote file: 253 kB
runs/train5/confusion_matrix_normalized.png ADDED

Git LFS Details

  • SHA256: 61079e518bff592097e0cc5c57d0ab1d6291d335da38abb0cd820ebb9261e513
  • Pointer size: 131 Bytes
  • Size of remote file: 284 kB
runs/train5/results.png ADDED

Git LFS Details

  • SHA256: 676c62b5dcf31ad8b8efeab952af059180621b71f1f954c47746d2ac861c78d3
  • Pointer size: 131 Bytes
  • Size of remote file: 262 kB
runs/train6/confusion_matrix_normalized.png ADDED

Git LFS Details

  • SHA256: e72a99b4df5acd74c505e311c2ba572249ce5f92accfadfad34f3149669a09ac
  • Pointer size: 131 Bytes
  • Size of remote file: 285 kB
runs/train6/results.png ADDED

Git LFS Details

  • SHA256: d3c05e1fb41c9e17d64bcdcb709284ee2de2af712aad65080311ccc31a432a51
  • Pointer size: 131 Bytes
  • Size of remote file: 238 kB
runs/train9/confusion_matrix_normalized.png ADDED

Git LFS Details

  • SHA256: 921338f979ccd98bd8d0f0a0ceba24dc1b73a4972b03efbd3e936d962bbc5518
  • Pointer size: 131 Bytes
  • Size of remote file: 284 kB
runs/train9/results.png ADDED

Git LFS Details

  • SHA256: 9653f685b0543e583dccf4040157fdb1fc1c911bbc81a63047141b78889f7528
  • Pointer size: 131 Bytes
  • Size of remote file: 264 kB
yolo11m_doc_layout.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:258c36b6c7e2d96088221aa2d85222cea4b9b9438405708450516cc6d6661c38
3
+ size 40684588
yolo11n_doc_layout.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3629fc7abe8cca55ff490e16cccad7a100cbd814881163258815513e0a37881f
3
+ size 5630426
yolo11s_doc_layout.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b52e5d8775b2b6f0463ddab52722b50e4035576bcd5f6f64409e46d051f7c94b
3
+ size 19339482