--- library_name: transformers tags: - page - classification - timm base_model: - google/vit-base-patch16-224 - google/vit-base-patch16-384 - google/vit-large-patch16-384 pipeline_tag: image-classification license: mit --- # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting ### Goal: solve a task of archive page images sorting (for their further content-based processing) **Scope:** Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model ## Versions 🏁 There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest `v5.3` is considered to be default and can be found in the `main` branch of HF 😊 hub [^1] πŸ”— | Version | Base | Pages | PDFs | Description | |--------:|----------------------------------|:-----:|:---------:|:-----------------------------------------------------------------------------------| | `v2.0` | `vit-base-patch16-224` | 10073 | **3896** | annotations with mistakes, more heterogenous data | | `v2.1` | `vit-base-patch16-224` | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes | | `v2.2` | `vit-base-patch16-224` | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0` | | `v3.2` | `vit-base-patch16-384` | 15855 | **5730** | same data as `v2.2`, but a bit larger model base with higher resolution | | `v5.2` | `vit-large-patch16-384` | 15855 | **5730** | same data as `v2.2`, but the largest model base with higher resolution | | `v1.2` | `efficientnetv2_s.in21k` | 15855 | **5730** | same data as `v2.2`, but the smallest model base (CNN) | | `v4.2` | `efficientnetv2_l.in21k_ft_in1k` | 15855 | **5730** | same data as `v2.2`, CNN base model smaller than the largest, may be more accurate | | `v2.3` | `vit-base-patch16-224` | 38625 | **37328** | new data annotation phase data, more single-page documents used, transformer model | | `v3.3` | `vit-base-patch16-384` | 38625 | **37328** | same data as `v2.3`, but a bit larger model base with higher resolution | | `v5.3` | `vit-large-patch16-384` | 38625 | **37328** | same data as `v2.3`, but the largest model base with higher resolution | | `v1.3` | `efficientnetv2_m.in21k_ft_in1k` | 38625 | **37328** | same data as `v2.3`, but the smallest model base (CNN) | | `v4.3` | `regnety_160.swag_ft_in1k` | 38625 | **37328** | same data as `v2.3`, CNN base model bigger than the smallest, may be more accurate | | **Version** | **Parameters (M)** | Resolution (px) | Revision | |----------------------------------|--------------------|-----------------|----------| | `efficientnetv2_s.in21k` | 48 | 300 | v1.X | | `efficientnetv2_m.in21k_ft_in1k` | 54 | 384 | v1.3 | | `vit-base-patch16-224` | 87 | 224 | v2.X | | `vit-base-patch16-384` | 87 | 384 | v3.X | | `regnety_160.swag_ft_in1k` | 84 | 224 | v4.3 | | `vit-large-patch16-384` | 305 | 384 | v5.X | | `regnety_640.seer` | 281 | 384 | v6.3 | | Base Model | Revision | max_cat | Best_Prec (%) | Best_Acc (%) | Fold | Note | |--------------------------------------------|----------|---------|---------------|--------------|------|--------------| | **google/vit-base-patch16-224** | **v2.3** | 14,000 | **98.79** | **98.79** | 5 | OK & Small | | **google/vit-base-patch16-384** | **v3.3** | 14,000 | **98.92** | **98.92** | 2 | Good & Small | | **google/vit-large-patch16-384** | **v5.3** | 14,000 | **99.12** | **99.12** | 2 | Best & Large | | microsoft/dit-base-finetuned-rvlcdip | v9.3 | 14,000 | 98.71 | 98.72 | 3 | | | microsoft/dit-large-finetuned-rvlcdip | v10.3 | 14,000 | 98.66 | 98.66 | 3 | | | microsoft/dit-large | v11.3 | 14,000 | 98.53 | 98.53 | 2 | | | timm/regnety_120.sw_in12k_ft_in1k | v12.3 | 14,000 | 98.29 | 98.29 | 3 | | | **timm/regnety_160.swag_ft_in1k** | **v4.3** | 14,000 | **99.17** | **99.16** | 1 | Best & Small | | timm/regnety_640.see | v6.3 | 14,000 | 98.79 | 98.79 | 5 | OK & Large | | timm/tf_efficientnetv2_l.in21k_ft_in1k | v8.3 | 14,000 | 98.62 | 98.62 | 5 | | | **timm/tf_efficientnetv2_m.in21k_ft_in1k** | **v1.3** | 14,000 | **98.83** | **98.83** | 1 | Good & Small | | timm/tf_efficientnetv2_s.in21k | v7.3 | 14,000 | 97.90 | 97.87 | 1 | | ## Model description πŸ“‡ ![architecture.png](https://github.com/ufal/atrium-page-classification/blob/vit/architecture.png?raw=true) πŸ”² Fine-tuned model repository: vit-historical-page [^1] πŸ”— πŸ”³ **Base** model repository: - Google's **vit-base-patch16-224**, **vit-base-patch16-384**, and **vit-large-patch16-284** [^2] [^6] [^7] πŸ”— - timm's **regnety_160.swag_ft_in1k**, **efficientnetv2_s.in21k**, **efficientnetv2_m.in21k_ft_in1k**, and **efficientnetv2_l.in21k_ft_in1k** [^11] [^8] [^12] [^9] πŸ”— ### Data πŸ“œ The dataset is provided under Public Domain license, and consists of **48,499** PNG images of pages from **37,328** archival documents. The source image files and their annotation can be found in the LINDAT repository [^10] πŸ”—. Manual ✍️ annotation was performed beforehand and took some time βŒ›, the categories πŸͺ§ tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span. | Category | Dataset 0 | Dataset 1 | Dataset 2 | Dataset 3 | |-----------------|-------------|--------------|--------------|---------------| | DRAW | 1090 (9.1%) | 1368 (8.8%) | 1472 (9.3%) | 2709 (5.6%) | | DRAW_L | 1091 (9.1%) | 1383 (8.9%) | 1402 (8.8%) | 2921 (6.0%) | | LINE_HW | 1055 (8.8%) | 1113 (7.2%) | 1115 (7.0%) | 2514 (5.2%) | | LINE_P | 1092 (9.1%) | 1540 (9.9%) | 1580 (10.0%) | 2439 (5.0%) | | LINE_T | 1098 (9.2%) | 1664 (10.7%) | 1668 (10.5%) | 9883 (20.4%) | | PHOTO | 1081 (9.1%) | 1632 (10.5%) | 1730 (10.9%) | 2691 (5.5%) | | PHOTO_L | 1087 (9.1%) | 1087 (7.0%) | 1088 (6.9%) | 2830 (5.8%) | | TEXT | 1091 (9.1%) | 1587 (10.3%) | 1592 (10.0%) | 14227 (29.3%) | | TEXT_HW | 1091 (9.1%) | 1092 (7.1%) | 1092 (6.9%) | 2008 (4.1%) | | TEXT_P | 1083 (9.1%) | 1540 (9.9%) | 1633 (10.3%) | 2312 (4.8%) | | TEXT_T | 1081 (9.1%) | 1476 (9.5%) | 1482 (9.3%) | 3965 (8.2%) | | **Unique PDFs** | 5001 | 5694 | 5729 | 37328 | | **Total Pages** | 11,940 | 15,482 | 15,854 | 48,499 | The table above shows category distribution for different model versions, where the last column (`Dataset 3`) corresponds to the latest `vX.3` models data, which actually used 14,000 pages of `TEXT` category, while other columns cover all the used samples - specifically 80% as training πŸ’ͺ, and 10% each as development and test πŸ† sets. The early model version used 90% of the data as training πŸ’ͺ and the remaining 10% as both development and test πŸ† set due to the lack of annotated (manually classified) pages. > [!NOTE] > Disproportion of the categories πŸͺ§ in both training and evaluation data is > **NOT** intentional, but rather a result of the source data nature. Training set of the model: **8950** images for `v2.0` Training set of the model: **10745** images for `v2.1` Training set of the model: **14565** images for `v2.X` Training set of the model: **38625** images for `vX.3` Plus, the test sets: Evaluation set: **1586** images (taken from `v2.2` annotations) Evaluation set: **4823** images (for `vX.3` models) ### Categories 🏷️ | Label️ | Description | |----------:|:-----------------------------------------------------------------------------------------------------------------| | `DRAW` | **πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions** | | `DRAW_L` | **πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table** | | `LINE_HW` | **βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure** | | `LINE_P` | **πŸ“ - printed text organized in a tabular or form-like structure** | | `LINE_T` | **πŸ“ - machine-typed text organized in a tabular or form-like structure** | | `PHOTO` | **πŸŒ„ - photographs or photographic cutouts, potentially with text captions** | | `PHOTO_L` | **πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations** | | `TEXT` | **πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements** | | `TEXT_HW` | **βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)** | | `TEXT_P` | **πŸ“„ - only printed text in paragraph or block form (non-tabular)** | | `TEXT_T` | **πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)** | ![dataset_timeline.png](https://github.com/ufal/atrium-page-classification/blob/vit/dataset_timeline.png?raw=true) #### Data preprocessing During training the following transforms were applied randomly with a 50% chance: * transforms.ColorJitter(brightness 0.5) * transforms.ColorJitter(contrast 0.5) * transforms.ColorJitter(saturation 0.5) * transforms.ColorJitter(hue 0.5) * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5))) * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2)))) ### Training Hyperparameters * eval_strategy "epoch" * save_strategy "epoch" * learning_rate 5e-5 * per_device_train_batch_size 8 * per_device_eval_batch_size 8 * num_train_epochs 3 * warmup_ratio 0.1 * logging_steps 10 * load_best_model_at_end True * metric_for_best_model "accuracy" ### Results πŸ“Š | **Revision** | **Top-1** | **Top-3** | |--------------|-----------|-----------| | `v1.2` | 97.73 | 99.87 | | `v2.2` | 97.54 | 99.94 | | `v3.2` | 96.49 | 99.94 | | `v4.2` | 97.73 | 99.87 | | `v5.2` | 97.86 | 99.87 | | `v1.3` | 96.81 | 99.78 | | `v2.3` | 98.79 | 99.96 | | `v3.3` | 98.92 | 99.98 | | `v4.3` | 98.92 | **100.0** | | `v5.3` | **99.12** | 99.94 | | `v6.3` | 98.79 | 99.94 | **v2.2** Evaluation set's accuracy (**Top-1**): **97.54%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1136_model_v220105p_conf_mat_TOP-1.png?raw=true) **v3.2** Evaluation set's accuracy (**Top-1**): **96.49%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1142_model_v320105p_conf_mat_TOP-1.png?raw=true) **v5.2** Evaluation set's accuracy (**Top-1**): **97.73%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1203_model_v520105p_conf_mat_TOP-1.png?raw=true) **v1.2** Evaluation set's accuracy (**Top-1**): **97.73%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250709-1831_model_v120106s_conf_mat_TOP-1.png?raw=true) **v4.2** Evaluation set's accuracy (**Top-1**): **97.86%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250709-1829_model_v120106l_conf_mat_TOP-1.png?raw=true) **v1.3** Evaluation set's accuracy (**Top-1**): **98.83%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1835_model_v13_conf_mat_TOP-1.png?raw=true) **v2.3** Evaluation set's accuracy (**Top-1**): **98.79%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1841_model_v23_conf_mat_TOP-1.png?raw=true) **v3.3** Evaluation set's accuracy (**Top-1**): **98.92%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1849_model_v33_conf_mat_TOP-1.png?raw=true) **v4.3** Evaluation set's accuracy (**Top-1**): **98.16%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1856_model_v43_conf_mat_TOP-1.png?raw=true) **v5.3** Evaluation set's accuracy (**Top-1**): **99.12%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1905_model_v53_conf_mat_TOP-1.png?raw=true) **v6.3** Evaluation set's accuracy (**Top-1**): **98.79%** ![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1913_model_v63_conf_mat_TOP-1.png?raw=true) #### Result tables - **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v220105p_TOP-1_EVAL.csv) πŸ”— - **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1925_model_v220105p_TOP-3_EVAL.csv) πŸ”— - **v3.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v320105p_TOP-1_EVAL.csv) πŸ”— - **v3.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1927_model_v320105p_TOP-3_EVAL.csv) πŸ”— - **v5.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v520105p_TOP-1_EVAL.csv) πŸ”— - **v5.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1928_model_v520105p_TOP-3_EVAL.csv) πŸ”— - **v1.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250709-1825_model_v120106s_TOP-1_EVAL.csv) πŸ”— - **v1.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1924_model_v120106s_TOP-3_EVAL.csv) πŸ”— - **v4.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250709-1823_model_v120106l_TOP-1_EVAL.csv) πŸ”— - **v4.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1921_model_v120106l_TOP-3_EVAL.csv) πŸ”— - **v1.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1825_5449_model_v13_TOP-1_EVAL.csv) πŸ“Ž - **v2.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1835_5449_model_v23_TOP-1_EVAL.csv) πŸ“Ž - **v3.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1841_5449_model_v33_TOP-1_EVAL.csv) πŸ“Ž - **v4.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1849_5449_model_v43_TOP-1_EVAL.csv) πŸ“Ž - **v5.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1856_5449_model_v53_TOP-1_EVAL.csv) πŸ“Ž - **v6.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1906_5449_model_v63_TOP-1_EVAL.csv) πŸ“Ž #### Table columns - **FILE** - name of the file - **PAGE** - number of the page - **CLASS-N** - label of the category 🏷️, guess TOP-N - **SCORE-N** - score of the category 🏷️, guess TOP-N - **TRUE** - actual label of the category 🏷️ ### Contacts πŸ“§ For support write to πŸ“§ lutsai.k@gmail.com πŸ“§ Official repository: UFAL [^3] ### Preprint πŸ“– For the full research background, check out our paper on arXiv: **[Page image classification for content-specific data processing](https://arxiv.org/abs/2507.21114)** It covers everything from raw data exploration and dataset construction πŸ—‚οΈ, through benchmarking of multiple image classification approaches (Random Forest, EfficientNetV2, RegNetY, DiT, ViT, and CLIP), to system architecture and real-world results on historical collections from Prague β›ͺ and Brno πŸ›οΈ. ### Acknowledgements πŸ™ - **Developed by** UFAL [^5] πŸ‘₯ - **Funded by** ATRIUM [^4] πŸ’° - **Shared by** ATRIUM [^4] & UFAL [^5] - **Model type:** - fine-tuned ViT with a 224x224 [^2] πŸ”— or 384x384 [^6] [^7] πŸ”— resolution size - fine-tuned EffNetV2 with a 300x300 [^8] πŸ”— or 384x384 [^9] πŸ”— resolution size **©️ 2022 UFAL & ATRIUM** [^1]: https://huggingface.co/k4tel/vit-historical-page [^2]: https://huggingface.co/google/vit-base-patch16-224 [^3]: https://github.com/ufal/atrium-page-classification [^4]: https://atrium-research.eu/ [^5]: https://ufal.mff.cuni.cz/home-page [^6]: https://huggingface.co/google/vit-base-patch16-384 [^7]: https://huggingface.co/google/vit-large-patch16-384 [^8]: https://huggingface.co/timm/tf_efficientnetv2_s.in21k [^9]: https://huggingface.co/timm/tf_efficientnetv2_l.in21k_ft_in1k [^10]: http://hdl.handle.net/20.500.12800/1-5959 [^11]: https://huggingface.co/timm/tf_efficientnetv2_m.in21k_ft_in1k [^12]: https://huggingface.co/timm/regnety_160.swag_ft_in1k