NEW YORK – Slightly more than 10 months ago OpenAI’s ChatGPT was first released to the public. Its arrival ushered in an era of nonstop headlines about artificial intelligence and accelerated the development of competing large language models (LLMs) from Google, Meta and other tech giants.

Since that time, these chatbots have demonstrated an impressive capacity for generating text and code, albeit not always accurately. And now multimodal AIs that are capable of parsing not only text but also images, audio, and more are on the rise.

OpenAI released a multimodal version of ChatGPT, powered by its LLM GPT-4, to paying subscribers for the first time last week, months after the company first announced these capabilities.

Google began incorporating similar image and audio features to those offered by the new GPT-4 into some versions of its LLM-powered chatbot, Bard, back in May. Meta, too, announced big strides in multimodality this past spring. Though it is in its infancy, the burgeoning technology can perform a variety of tasks.

Scientific American tested out two different chatbots that rely on multimodal LLMs: a version of ChatGPT powered by the updated GPT-4 (dubbed GPT-4 with vision, or GPT-4V) and Bard, which is currently powered by Google’s PaLM 2 model. Both can both hold hands-free vocal conversations using only audio, and they can describe scenes within images and decipher lines of text in a picture.

These abilities have myriad applications. In our test, using only a photograph of a receipt and a two-line prompt, ChatGPT accurately split a complicated bar tab and calculated the amount owed for each of four different people—including tip and tax. Altogether, the task took less than 30 seconds. Bard did nearly as well, but it interpreted one “9” as a “0,” thus flubbing the final total.

In another trial, when given a photograph of a stocked bookshelf, both chatbots offered detailed descriptions of the hypothetical owner’s supposed character and interests that were almost like AI-generated horoscopes. Both identified the Statue of Liberty from a single photograph, deduced that the image was snapped from an office in lower Manhattan and offered spot-on directions from the photographer’s original location to the landmark (though ChatGPT’s guidance was more detailed than Bard’s). And ChatGPT also outperformed Bard in accurately identifying insects from photographs.

Read more at Scientific American