The Vision, Language, and Learning Lab at the University of Virginia

Attention Mask Consistency (AMC)

Here we showcase our model for localizing arbitrary objects and phrases in images through tuning the ALBEF vision-language model with human-like explanations from the Visual Genome dataset. Particularly we finetune ALBEF with our recenlty proposed Attention Mask Consistency (AMC) objective which will appear at CVPR 2023: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. In this paper, we propose a max-margin fine-tuning objective that encourages gradient-based explanations to become soft-aligned with explanations provided by humans. This greatly improves a vision-language model's ability to "ground" or localize objects in images for arbitrary phrases. Try your own images and textual phrases below and see what it does.

Original Image

P("a man in blue shirt") = 0.65

Demo by Ziyan and Vicente

Gallery of examples

A group of people

A red flower

Tree branches in the background

A woman with long curly hair

A man wearing glasses

Some grass

Highlighting: Buildings on the background

Buildings in the background

A woman posing for a picture

A man posing for a picture

Technical Notes: The images uploded through this demo are not stored in our server or anywhere (not even temporarily), we only hold images on the server-side in volatile memory while they are being processed and return the resulting image as base64 encoded strings directly to the user's browser. Any images presented here as examples were not uploaded by a user but were images directly uploaded by our team. This is not a demo that aims to collect any data from users.