STIC: Enhancing LVLMs with Self-Training on Image Comprehension

Introduction

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs.

To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts.

We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies dive into various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training.

STIC: Self-constructed Preferenced Data

STIC specifically emphasizes the image comprehension self-training of LVLMs where the model generates its own preference dataset focused on image description. The self-generated dis-preferred response is obtained by gathering mode responses from either (1) prompts likely to elicit inaccurate responses or (2) corrupted images. The preferred responses are collected via a detailed prompt that guides the model through a step-by-step image description process.

Examples of the self-constructed preference data in STIC.

Example of generated preference data, where the dis-preferred response is generated from bad prompting.

Example of generated preference data, where the dis-preferred response is generated from images with lower resolution.

Example of generated preference data, where the dis-preferred response is generated from images with color jittering.

STIC: Two-stage Self-Training

We introduce STIC, a two-stage self-training algorithm designed to enhance image comprehension capabilities. The first stage constructs its own preference dataset and the second stage infuses the used supervised fine-tuning (SFT) data with self-generated image descriptions for fine-tuning.

STIC specifically emphasizes the image comprehension self-training of LVLMs where the model generates its own preference dataset focused on image description. The self-generated dis-preferred response is obtained by gathering model responses from either (1) prompts likely to elicit inaccurate responses or (2) corrupted images. The preferred responses are collected via a detailed prompt that guides the model through a step-by-step image description process.

During fine-tuning, we consider a direct preference optimization (DPO) loss with an additional regularized term explicitly emphasizing the preferred response. Lastly, we allow the model to self-improve its reasoning ability based on its own extracted image information by reusing a small amount of existing instruction fine-tuning data and appending its self-generated image descriptions to the prompts. We refer to this second stage as description-infused fine-tuning. Notably, our STIC approach does not require pre-labeled information of the images, which contrasts to the recent works that rely on such information for constructing vision-language preference data.

Stage 1: Image comprehension self-training for STIC.

Stage 2: Description-infused fine-tuning for STIC.

STIC: Main Results

To demonstrate the effectiveness of STIC, we conduct extensive experiments on seven vision-language benchmarks, including ScienceQA, TextVQA, ChartQA, LLaVA-Bench, MMBench, MM-Vet, and MathVista. These benchmarks encompass scientific reasoning, math reasoning, optical character recognition (OCR), and conversation capabilities based on vision inputs, spanning various image sources such as natural, chart, and text-rich images. We employ LLaVA-v1.6 as the primary base LVLM for our experiments and unitize 6,000 images from MSCOCO to construct the image description preference data.

STIC achieves consistent and significant performance improvements across these benchmarks, with an average accuracy gain of 4.0% over the base LVLM and a notable gain of 6.4% on ScienceQA. These results demonstrate the remarkable effectiveness of our image comprehension self-training approach in enhancing the visual perception capabilities of LVLMs.

STIC: t-SNE Visualization

To gain further insight into the effectiveness of STIC across different benchmarks, we conducted a t-SNE visualization analysis comparing the image distributions of MSCOCO, which we used for preference data construction, with those of four benchmarks: ScienceQA, TextVQA, MathVista, and ChartQA.

Our analysis revealed a general trend: the greater the overlap between the MSCOCO image distribution and that of a benchmark, the higher the performance gain achieved by STIC on that benchmark. This observation held true for ScienceQA and TextVQA, which exhibited substantial distributional overlap with MSCOCO and yielded the highest performance gains of 6.4% and 4.9%, respectively. Conversely, MathVista, with its diverse image types and limited overlap with MSCOCO, saw a more modest gain of 2.4%. Interestingly, ChartQA was an outlier, achieving a high gain of 5.1% despite minimal overlap with MSCOCO, suggesting that the improved image comprehension from STIC played a fundamental role in understanding and reasoning about the charts.

t-SNE visualization of images from MSCOCO and four benchmarks, each sampling 1,000 images.

t-SNE visualization of images from MSCOCO and ScienceQA.

t-SNE visualization of images from MSCOCO and TextVQA.

t-SNE visualization of images from MSCOCO and MathVista.

t-SNE visualization of images from MSCOCO and ChartQA.

BibTeX

@article{deng2024enhancing,
  author = {Deng, Yihe Deng and Lu, Pan and Yin, Fan and Hu, Ziniu and Shen, Sheng and Zou, James and Chang, Kai-Wei and Wang, Wei},
  title = {Enhancing Large Vision Language Models with Self-Training on Image Comprehension},
  journal = {arXiv preprint arXiv:2405.19716},
  year = {2024}
}