Skip to the content.
Paper Link Dataset Link

Intro

Motivations

In this work

The Grounding Visual Illusion in Language (GVIL) Dataset

 Example illusion from each category and the corresponding explanations

The curated dataset contains over 1600 data points across five types of illusions and supports 4 different evaluation protocols.

Example illustration for each task setup. Left:SameDiff QA. Right: RefQA, AttrQA, RefLoc

Experiments and Results

We evaluate four representative models, LLaVA, InstructBLIP, Unified-IO and OFA on the VL-Illusion dataset. Please refer to our paper for more details.

How Well Do Models Recognize Illusions?

1

Under SameDiff-QA task, which specially challenges the ability to recognize illusions, we found that models rarely perceive illusions like humans.

But we do confirm that there’s a positive correlation between model scale and the ability to recognize illusions

How Humanlike Are Models Under Illusion?

2

Here, we define humanlike rate as the percentage of data points where the model’s prediction is the same as human’s prediction.

We find the answer to be complex and model’s performance varies significantly across different task formulations. For example, the humanlike rate is quite low on RefQA and AttrQA tasks, but is much higher on RefLoc task.

Again, we confirm that larger models consistently align better with human perception.

Does Alignment Vary By Illusion Type?

3

Yes, we found that the humanlike rate varies significantly across different illusion types. Color constancy illusion is the most challenging one for models, while the humanlike rate is much higher for the perspective illusion.

A Preliminary Study on GPT-4 Vision

As a preliminary study, we also evaluate the recently released GPT-4 Vision model on the GVIL dataset. As of November 4, 2023, our access to the model is limited to the ChatGPT interface, allowing only for qualitative analysis.

The model often succeeds in simpler tasks like examples 1 and 2 but struggles with hand-crafted ones such as examples 3-6. We hypothesize that this is due to the model being trained on many online illusion resources, but it has limited generalization to complex, unfamiliar scenarios.

Selected Cases

4 Example 1: GPT-4V Got Everything Right.

4 Example 2: GPT-4V refuses to give a straight answer, but its explanation is consistent and correct.

4 Example 3: GPT-4V recognizes the visual illusion, yet mistakenly suggests that the bottom bottle is bigger.

4 Example 4: GPT-4V recognizes the visual illusion, yet mistakenly suggests that the left solid circle is bigger.

4 Example 5: GPT-4V recognizes the visual illusion, yet mistakenly suggests that the right starred balloon is blue.

4 Example 6: GPT-4V recognizes the presence of visual illusion and acknowledges they are actual the same in size. However, it mistakenly suggests that the orange ball on the left looks smaller.

4 Example 7: GPT-4V’s prediction is consistent with human’s perception.