--- language: - en tags: - vision-language - vqa - text-to-image-evaluation license: mit --- # Tiny Random VQAScore Model This is a tiny random version of the VQAScore architecture for educational and testing purposes. ## Model Architecture - **Vision Encoder**: Tiny CNN + Transformer (64 hidden size) - **Language Model**: Tiny Transformer (256 hidden size) - **Multimodal Projector**: MLP with 256 → 128 → 64 → 1 ## Usage ```python from create_tiny_vqa_model import TinyVQAScore # Load the model model = TinyVQAScore(device="cpu") # Score an image from PIL import Image image = Image.open("your_image.jpg") score = model.score(image, "What is shown in this image?") print(f"VQA Score: {score}") ``` ## Model Size - **Parameters**: ~50K (vs ~11B for the original XXL model) - **Memory**: ~200KB (vs ~22GB for the original XXL model) ## Disclaimer This is a randomly initialized model for testing and educational purposes. It is not trained and will not produce meaningful VQA results.