Attention Mask Consistency (AMC)

Here we showcase our model for localizing arbitrary objects and phrases in images through tuning the ALBEF vision-language model with human-like explanations from the Visual Genome dataset. Particularly we finetune ALBEF with our recenlty proposed Attention Mask Consistency (AMC) objective which will appear at CVPR 2023: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. In this paper, we propose a max-margin fine-tuning objective that encourages gradient-based explanations to become soft-aligned with explanations provided by humans. This greatly improves a vision-language model's ability to "ground" or localize objects in images for arbitrary phrases. Try your own images and textual phrases below and see what it does.

Here we showcase our model for localizing arbitrary objects and phrases in images through tuning the ALBEF vision-language model with human-like explanations from the Visual Genome dataset. Particularly we finetune ALBEF with our recenlty proposed Attention Mask Consistency (AMC) objective which will appear at CVPR 2023: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations.

Original Image

original image is here

P("a man in blue shirt") = 0.65

output image is here
Demo by Ziyan and Vicente
Gallery of examples
Highlighting: a group of people

A group of people

Highlighting: a red flower

A red flower

Highlighting: tree branches in the background

Tree branches in the background

Highlighting: A woman with long curly hair

A woman with long curly hair

Highlighting: A man wearing glasses

A man wearing glasses

Highlighting: Some grass

Some grass

Highlighting: Buildings on the background

Buildings in the background

Highlighting: A woman posing for a picture

A woman posing for a picture

Highlighting: A man posing for a picture

A man posing for a picture

Technical Notes: The images uploded through this demo are not stored in our server or anywhere (not even temporarily), we only hold images on the server-side in volatile memory while they are being processed and return the resulting image as base64 encoded strings directly to the user's browser. Any images presented here as examples were not uploaded by a user but were images directly uploaded by our team. This is not a demo that aims to collect any data from users.