Reconstructive Visual Instruction Tuning
Paper
•
2410.09575
•
Published
•
1
Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.
If you are not using Linux, do NOT proceed.
git clone https://github.com/Haochen-Wang409/ross.git
cd ross
conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
import torch
from PIL import Image
from ross.model.builder import load_pretrained_model
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from ross.eval.run_llava import eval_model
model_path = "HaochenWang/ross-qwen2-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
model.cuda()
model.eval()
image = Image.open("...")
prompt = "..."
images_tensor = process_images(
images,
image_processor,
model.config,
).cuda()
input_ids = tokenizer_image_token(
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
).unsqueeze(0).cuda()
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=images_tensor,
do_sample=True,
temperature=0.8,
top_p=0.7,
top_k=20,
num_beams=5,
max_new_tokens=512,
use_cache=True,
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)
If you find Ross useful for your research and applications, please cite using this BibTeX:
@article{wang2024ross,
title={Reconstructive visual instruction tuning},
author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2410.09575},
year={2024}
}