---
library_name: transformers
tags:
- page
- classification
- timm
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---

# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

### Goal: solve a task of archive page images sorting (for their further content-based processing)

**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class 🏷️ (category) results of top
N predictions output, predictions summarizing into a tabular format, 
HF 😊 hub support for the model

## Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, 
but different data annotations. The latest `v5.3` is considered to be default and can be found in the `main` branch
of HF 😊 hub [^1] 🔗 

| Version | Base                             | Pages |   PDFs    | Description                                                                        |
|--------:|----------------------------------|:-----:|:---------:|:-----------------------------------------------------------------------------------|
|  `v2.0` | `vit-base-patch16-224`           | 10073 | **3896**  | annotations with mistakes, more heterogenous data                                  |
|  `v2.1` | `vit-base-patch16-224`           | 11940 | **5002**  | `main`: more diverse pages in each category, less annotation mistakes              |
|  `v2.2` | `vit-base-patch16-224`           | 15855 | **5730**  | same data as `v2.1` + some restored pages from `v2.0`                              |
|  `v3.2` | `vit-base-patch16-384`           | 15855 | **5730**  | same data as `v2.2`, but a bit larger model base with higher resolution            |
|  `v5.2` | `vit-large-patch16-384`          | 15855 | **5730**  | same data as `v2.2`, but the largest model base with higher resolution             |
|  `v1.2` | `efficientnetv2_s.in21k`         | 15855 | **5730**  | same data as `v2.2`, but the smallest model base (CNN)                             |
|  `v4.2` | `efficientnetv2_l.in21k_ft_in1k` | 15855 | **5730**  | same data as `v2.2`, CNN base model smaller than the largest, may be more accurate |
|  `v2.3` | `vit-base-patch16-224`           | 38625 | **37328** | new data annotation phase data, more single-page documents used, transformer model |
|  `v3.3` | `vit-base-patch16-384`           | 38625 | **37328** | same data as `v2.3`, but a bit larger model base with higher resolution            |
|  `v5.3` | `vit-large-patch16-384`          | 38625 | **37328** | same data as `v2.3`, but the largest model base with higher resolution             |
|  `v1.3` | `efficientnetv2_m.in21k_ft_in1k` | 38625 | **37328** | same data as `v2.3`, but the smallest model base (CNN)                             |
|  `v4.3` | `regnety_160.swag_ft_in1k`       | 38625 | **37328** | same data as `v2.3`, CNN base model bigger than the smallest, may be more accurate |


| **Version**                      | **Parameters (M)** | Resolution (px) | Revision |
|----------------------------------|--------------------|-----------------|----------|
| `efficientnetv2_s.in21k`         | 48                 | 300             | v1.X     |
| `efficientnetv2_m.in21k_ft_in1k` | 54                 | 384             | v1.3     |
| `vit-base-patch16-224`           | 87                 | 224             | v2.X     |
| `vit-base-patch16-384`           | 87                 | 384             | v3.X     |
| `regnety_160.swag_ft_in1k`       | 84                 | 224             | v4.3     |
| `vit-large-patch16-384`          | 305                | 384             | v5.X     |
| `regnety_640.seer`               | 281                | 384             | v6.3     |


| Base Model                                 | Revision | max_cat | Best_Prec (%) | Best_Acc (%) | Fold | Note         |
|--------------------------------------------|----------|---------|---------------|--------------|------|--------------|
| **google/vit-base-patch16-224**            | **v2.3** | 14,000  | **98.79**     | **98.79**    | 5    | OK & Small   |
| **google/vit-base-patch16-384**            | **v3.3** | 14,000  | **98.92**     | **98.92**    | 2    | Good & Small |
| **google/vit-large-patch16-384**           | **v5.3** | 14,000  | **99.12**     | **99.12**    | 2    | Best & Large |
| microsoft/dit-base-finetuned-rvlcdip       | v9.3     | 14,000  | 98.71         | 98.72        | 3    |              |
| microsoft/dit-large-finetuned-rvlcdip      | v10.3    | 14,000  | 98.66         | 98.66        | 3    |              |
| microsoft/dit-large                        | v11.3    | 14,000  | 98.53         | 98.53        | 2    |              |
| timm/regnety_120.sw_in12k_ft_in1k          | v12.3    | 14,000  | 98.29         | 98.29        | 3    |              |
| **timm/regnety_160.swag_ft_in1k**          | **v4.3** | 14,000  | **99.17**     | **99.16**    | 1    | Best & Small |
| timm/regnety_640.see                       | v6.3     | 14,000  | 98.79         | 98.79        | 5    | OK & Large   |
| timm/tf_efficientnetv2_l.in21k_ft_in1k     | v8.3     | 14,000  | 98.62         | 98.62        | 5    |              |
| **timm/tf_efficientnetv2_m.in21k_ft_in1k** | **v1.3** | 14,000  | **98.83**     | **98.83**    | 1    | Good & Small |
| timm/tf_efficientnetv2_s.in21k             | v7.3     | 14,000  | 97.90         | 97.87        | 1    |              |


## Model description 📇

![architecture.png](https://github.com/ufal/atrium-page-classification/blob/vit/architecture.png?raw=true)

🔲 Fine-tuned model repository:  vit-historical-page [^1] 🔗

🔳 **Base** model repository: 
- Google's **vit-base-patch16-224**,  **vit-base-patch16-384**, and  **vit-large-patch16-284** [^2] [^6] [^7] 🔗
- timm's **regnety_160.swag_ft_in1k**, **efficientnetv2_s.in21k**, **efficientnetv2_m.in21k_ft_in1k**, and **efficientnetv2_l.in21k_ft_in1k** [^11] [^8] [^12] [^9] 🔗

### Data 📜

The dataset is provided under Public Domain license, and consists of **48,499** PNG images of pages from **37,328** archival documents.
The source image files and their annotation can be found in the LINDAT repository [^10] 🔗.

Manual ✍️ annotation was performed beforehand and took some time ⌛, the categories 🪧 tabulated below were formed from
different sources of the archival documents originated in the 1920-2020 years span. 

| Category        | Dataset 0   | Dataset 1    | Dataset 2    | Dataset 3     |
|-----------------|-------------|--------------|--------------|---------------|
| DRAW            | 1090 (9.1%) | 1368 (8.8%)  | 1472 (9.3%)  | 2709 (5.6%)   |
| DRAW_L          | 1091 (9.1%) | 1383 (8.9%)  | 1402 (8.8%)  | 2921 (6.0%)   |
| LINE_HW         | 1055 (8.8%) | 1113 (7.2%)  | 1115 (7.0%)  | 2514 (5.2%)   |
| LINE_P          | 1092 (9.1%) | 1540 (9.9%)  | 1580 (10.0%) | 2439 (5.0%)   |
| LINE_T          | 1098 (9.2%) | 1664 (10.7%) | 1668 (10.5%) | 9883 (20.4%)  |
| PHOTO           | 1081 (9.1%) | 1632 (10.5%) | 1730 (10.9%) | 2691 (5.5%)   |
| PHOTO_L         | 1087 (9.1%) | 1087 (7.0%)  | 1088 (6.9%)  | 2830 (5.8%)   |
| TEXT            | 1091 (9.1%) | 1587 (10.3%) | 1592 (10.0%) | 14227 (29.3%) |
| TEXT_HW         | 1091 (9.1%) | 1092 (7.1%)  | 1092 (6.9%)  | 2008 (4.1%)   |
| TEXT_P          | 1083 (9.1%) | 1540 (9.9%)  | 1633 (10.3%) | 2312 (4.8%)   |
| TEXT_T          | 1081 (9.1%) | 1476 (9.5%)  | 1482 (9.3%)  | 3965 (8.2%)   |
| **Unique PDFs** | 5001        | 5694         | 5729         | 37328         |
| **Total Pages** | 11,940      | 15,482       | 15,854       | 48,499        |


The table above shows category distribution for different model versions, where the last column
(`Dataset 3`) corresponds to the latest `vX.3` models data, which actually used 14,000 pages of
`TEXT` category, while other columns cover all the used samples - specifically 80% as training 💪, 
and 10% each as development and test 🏆 sets. The early model version used 90% of the data as training 💪
and the remaining 10% as both development and test 🏆 set due to the lack of annotated (manually 
classified) pages.

> [!NOTE]
> Disproportion of the categories 🪧 in both training and evaluation data is
> **NOT** intentional, but rather a result of the source data nature. 

Training set of the model: **8950** images for `v2.0`

Training set of the model: **10745** images for `v2.1`

Training set of the model: **14565** images for `v2.X` 

Training set of the model: **38625** images for `vX.3` 

Plus, the test sets:

Evaluation set:  **1586** images (taken from `v2.2` annotations)

Evaluation set:  **4823** images (for `vX.3` models)

### Categories 🏷️


|    Label️ | Description                                                                                                      |
|----------:|:-----------------------------------------------------------------------------------------------------------------|
|    `DRAW` | **📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions** |
|  `DRAW_L` | **📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table**      |
| `LINE_HW` | **✏️📏 - handwritten text organized in a tabular or form-like structure**                                        |
|  `LINE_P` | **📏 - printed text organized in a tabular or form-like structure**                                              |
|  `LINE_T` | **📏 - machine-typed text organized in a tabular or form-like structure**                                        |
|   `PHOTO` | **🌄 - photographs or photographic cutouts, potentially with text captions**                                     |
| `PHOTO_L` | **🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations**                     |
|    `TEXT` | **📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements**          |
| `TEXT_HW` | **✏️📄 - only handwritten text in paragraph or block form (non-tabular)**                                        |
|  `TEXT_P` | **📄 - only printed text in paragraph or block form (non-tabular)**                                              |
|  `TEXT_T` | **📄 - only machine-typed text in paragraph or block form (non-tabular)**                                        |


![dataset_timeline.png](https://github.com/ufal/atrium-page-classification/blob/vit/dataset_timeline.png?raw=true)

#### Data preprocessing 

During training the following transforms were applied randomly with a 50% chance:

* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

### Training Hyperparameters
        
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"      

### Results 📊

| **Revision** | **Top-1** | **Top-3** |
|--------------|-----------|-----------|
| `v1.2`       | 97.73     | 99.87     |
| `v2.2`       | 97.54     | 99.94     |
| `v3.2`       | 96.49     | 99.94     |
| `v4.2`       | 97.73     | 99.87     |
| `v5.2`       | 97.86     | 99.87     |
| `v1.3`       | 96.81     | 99.78     |
| `v2.3`       | 98.79     | 99.96     |
| `v3.3`       | 98.92     | 99.98     |
| `v4.3`       | 98.92     | **100.0** |
| `v5.3`       | **99.12** | 99.94     |
| `v6.3`       | 98.79     | 99.94     |


**v2.2** Evaluation set's accuracy (**Top-1**):  **97.54%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1136_model_v220105p_conf_mat_TOP-1.png?raw=true)

**v3.2** Evaluation set's accuracy (**Top-1**):  **96.49%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1142_model_v320105p_conf_mat_TOP-1.png?raw=true)

**v5.2** Evaluation set's accuracy (**Top-1**):  **97.73%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250701-1203_model_v520105p_conf_mat_TOP-1.png?raw=true)

**v1.2** Evaluation set's accuracy (**Top-1**):  **97.73%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250709-1831_model_v120106s_conf_mat_TOP-1.png?raw=true)

**v4.2** Evaluation set's accuracy (**Top-1**):  **97.86%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20250709-1829_model_v120106l_conf_mat_TOP-1.png?raw=true)

**v1.3** Evaluation set's accuracy (**Top-1**):  **98.83%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1835_model_v13_conf_mat_TOP-1.png?raw=true)

**v2.3** Evaluation set's accuracy (**Top-1**):  **98.79%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1841_model_v23_conf_mat_TOP-1.png?raw=true)

**v3.3** Evaluation set's accuracy (**Top-1**):  **98.92%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1849_model_v33_conf_mat_TOP-1.png?raw=true)

**v4.3** Evaluation set's accuracy (**Top-1**):  **98.16%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1856_model_v43_conf_mat_TOP-1.png?raw=true)

**v5.3** Evaluation set's accuracy (**Top-1**):  **99.12%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1905_model_v53_conf_mat_TOP-1.png?raw=true)

**v6.3** Evaluation set's accuracy (**Top-1**):  **98.79%** 

![TOP-1 confusion matrix](https://github.com/ufal/atrium-page-classification/blob/vit/result/plots/20251020-1913_model_v63_conf_mat_TOP-1.png?raw=true)


#### Result tables

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v220105p_TOP-1_EVAL.csv) 🔗

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1925_model_v220105p_TOP-3_EVAL.csv) 🔗

- **v3.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v320105p_TOP-1_EVAL.csv) 🔗

- **v3.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1927_model_v320105p_TOP-3_EVAL.csv) 🔗

- **v5.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250701-1057_model_v520105p_TOP-1_EVAL.csv) 🔗

- **v5.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1928_model_v520105p_TOP-3_EVAL.csv) 🔗

- **v1.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250709-1825_model_v120106s_TOP-1_EVAL.csv) 🔗

- **v1.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1924_model_v120106s_TOP-3_EVAL.csv) 🔗

- **v4.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250709-1823_model_v120106l_TOP-1_EVAL.csv) 🔗

- **v4.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20250710-1921_model_v120106l_TOP-3_EVAL.csv) 🔗

- **v1.3** Manually ✍  **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1825_5449_model_v13_TOP-1_EVAL.csv) 📎

- **v2.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1835_5449_model_v23_TOP-1_EVAL.csv) 📎

- **v3.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1841_5449_model_v33_TOP-1_EVAL.csv) 📎

- **v4.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1849_5449_model_v43_TOP-1_EVAL.csv) 📎

- **v5.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1856_5449_model_v53_TOP-1_EVAL.csv) 📎

- **v6.3** Manually ✍ **checked** evaluation dataset (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/vit/result/tables/20251020-1906_5449_model_v63_TOP-1_EVAL.csv) 📎


#### Table columns

- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category 🏷️, guess TOP-N 
- **SCORE-N** - score of the category 🏷️, guess TOP-N
- **TRUE** - actual label of the category 🏷️

### Contacts 📧

For support write to 📧 lutsai.k@gmail.com 📧

Official repository: UFAL [^3]

### Preprint 📖

For the full research background, check out our paper on arXiv:
**[Page image classification for content-specific data processing](https://arxiv.org/abs/2507.21114)**

It covers everything from raw data exploration and dataset construction 🗂️, through benchmarking 
of multiple image classification approaches (Random Forest, EfficientNetV2, RegNetY, DiT, ViT, 
and CLIP), to system architecture and real-world results on historical collections from Prague ⛪
and Brno 🏛️.

### Acknowledgements 🙏

- **Developed by** UFAL [^5] 👥
- **Funded by** ATRIUM [^4]  💰
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:**
  - fine-tuned ViT with a 224x224 [^2] 🔗 or 384x384 [^6] [^7] 🔗 resolution size 
  - fine-tuned EffNetV2 with a 300x300 [^8] 🔗 or 384x384 [^9] 🔗 resolution size 

**©️ 2022 UFAL & ATRIUM**

[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384
[^8]: https://huggingface.co/timm/tf_efficientnetv2_s.in21k
[^9]: https://huggingface.co/timm/tf_efficientnetv2_l.in21k_ft_in1k
[^10]: http://hdl.handle.net/20.500.12800/1-5959
[^11]: https://huggingface.co/timm/tf_efficientnetv2_m.in21k_ft_in1k
[^12]: https://huggingface.co/timm/regnety_160.swag_ft_in1k