Wednesday, December 17, 2025 | 09:38 AM ISTहिंदी में पढें

Home / World News / After ChatGPT, Microsoft working on AI model that takes images as cues

After ChatGPT, Microsoft working on AI model that takes images as cues

As the war over artificial intelligence (AI) chatbots heat up, Microsoft has unveiled Kosmos-1, a new AI model that can also respond to visual cues or images, apart from text prompts or messages

IANS New Delhi

2 min read Last Updated : Mar 03 2023 | 8:31 PM IST

As the war over artificial intelligence (AI) chatbots heat up, Microsoft has unveiled Kosmos-1, a new AI model that can also respond to visual cues or images, apart from text prompts or messages.

The multimodal large language model (MLLM) can help in an array of new tasks, including image captioning, visual question answering and more.

Kosmos-1 can pave the way for the next-stage beyond ChatGPT's text prompts.

"A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context and follow instructions," said Microsoft's AI researchers in a paper.

The paper suggests that multimodal perception, or knowledge acquisition and "grounding" in the real world, is needed to move beyond ChatGPT-like capabilities to artificial general intelligence (AGI), reports ZDNet.

Also Read

Microsoft infuses billions of dollars in ChatGPT developer OpenAI

ChatGPT's paid version available for $42 a month for some early users

ChatGPT vs humans: What it can and cannot accomplish

Clients not clearing their dues on time? ChatGPT might be able to help you

Microsoft in talks to invest $10 bn in ChatGPT owner, says report

"More importantly, unlocking multimodal input greatly widens the applications of language models to more high-value areas, such as multimodal machine learning, document intelligence, and robotics," the paper read.

The goal is to align perception with LLMs, so that the models are able to see and talk.

Experimental results showed that Kosmos-1 achieves impressive performance on language understanding, generation, and even when directly fed with document images.

It also showed good results in perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and vision tasks, such as image recognition with descriptions (specifying classification via text instructions).

"We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs," said the team.

--IANS

na/

(Only the headline and picture of this report may have been reworked by the Business Standard staff; the rest of the content is auto-generated from a syndicated feed.)

More From This Section

Ericsson to pay $206 million for breaking US deal in bribery case

First Published: Mar 03 2023 | 8:31 PM IST

Explore News

Stock Market LIVE Updates Stocks to Watch Today ICICI Prudential AMC IPO Allotment Parliament Winter Session LIVE Gold-Silver Price Today OnePlus Turbo Series Goa Nightclub Fire Redmi Note 15 5G Specs IPL 2026 Auction Live Personal Finance