IT Pro is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more

Microsoft’s new vision-language model outranks humans at image captioning

The company will integrate the new model into Azure Cognitive Services to support vision-language tasks

Microsoft sign on a building

Microsoft researchers have developed a new object-attribute detection model for image encoding: VinVL (visual features in vision-language)

Vision-language (VL) systems make it possible to search relevant images for a text query (or vice versa). They also help describe an image’s content. 

In most cases, the systems use two modules to achieve the VL understanding: an image encoding module to generate feature maps of an input image and a vision-language fusion module to map the encoded image and text into vectors in the same semantic space.

Microsoft’s new research focuses on improving the image-encoding module. When combined with VL fusion modules such as OSCAR and VIVO,  Microsoft’s newest VL system scored big on the most competitive artificial intelligence (AI) benchmarks, including visual question answering (VQA), Microsoft COCO Image Captioning, and novel object captioning (nocaps). 

The tech giant also highlighted that VinVL significantly surpasses human performance on the nocaps leaderboard for consensus-based image description evaluation (CIDEr). 

Microsoft trained its VinVL object-attribute detection model using a large object detection dataset containing 2.49 million images ascribed to 1,848 object classes and 524 attribute classes to achieve the results mentioned above. Microsoft formed the dataset by merging four public object detection datasets (COCO, Open Images, Objects365, and VG).

Related Resource

Getting started with Azure Red Hat OpenShift

A developer’s guide to improving application building and deployment capabilities

How to improve application building and deployment capabilitiesDownload now

“We first pretrained an object detection model on the merged dataset, and then fine-tuned the model with an additional attribute branch on VG, making it capable of detecting both objects and attributes,” said Microsoft.

“Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes. As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments.”

Despite the promising results, Microsoft said its model is by no means close to the human-level VL understanding. 

Microsoft also announced VinVL would be available to the public for general use. Additionally, it will integrate VinVL into Azure Cognitive Services to power a wide range of Microsoft services, including Image Captioning in Office and LinkedIn, and Seeing AI.

Featured Resources

Four strategies for building a hybrid workplace that works

All indications are that the future of work is hybrid, if it's not here already

Free webinar

The digital marketer’s guide to contextual insights and trends

How to use contextual intelligence to uncover new insights and inform strategies

Free Download

Ransomware and Microsoft 365 for business

What you need to know about reducing ransomware risk

Free Download

Building a modern strategy for analytics and machine learning success

Turning into business value

Free Download

Recommended

Microsoft to double salary budget to retain workers
Careers & training

Microsoft to double salary budget to retain workers

17 May 2022
Microsoft warns of new botnet variant targeting Windows and Linux systems
Security

Microsoft warns of new botnet variant targeting Windows and Linux systems

16 May 2022
Windows Server admins say latest Patch Tuesday broke authentication policies
Server & storage

Windows Server admins say latest Patch Tuesday broke authentication policies

12 May 2022
Actively exploited Windows vulnerability reaches peak severity when paired with popular attack
Security

Actively exploited Windows vulnerability reaches peak severity when paired with popular attack

11 May 2022

Most Popular

Russian hackers declare war on 10 countries after failed Eurovision DDoS attack
hacking

Russian hackers declare war on 10 countries after failed Eurovision DDoS attack

16 May 2022
Windows Server admins say latest Patch Tuesday broke authentication policies
Server & storage

Windows Server admins say latest Patch Tuesday broke authentication policies

12 May 2022
Microsoft to double salary budget to retain workers
Careers & training

Microsoft to double salary budget to retain workers

17 May 2022