Combining vision and language could hold the key to more capable AI – TechCrunch


Depending on the theory of intelligence you subscribe to, achieving human-level AI requires a system that can use multiple modalities — for example, sound, image, and text — to reason about the world. For example, when shown an image of an overturned truck and a police car on a snowy highway, a human-level AI could conclude that dangerous road conditions caused an accident. Or, running on a robot, when asked to grab a can of soda from the fridge, they navigate around people, furniture, and pets to pick up the can and place it within the applicant’s reach.

Today’s AI falls short. But new research shows signs of encouraging progress, from robots that can devise steps to fulfill basic commands (e.g., “get a water bottle”) to text-producing systems that learn from explanations. In this revived edition of Deep Science, our weekly series on the latest advances in AI and the wider scientific field, we cover the work of DeepMind, Google, and OpenAI advancing toward systems that make up the world—if not perfectly. – be able to understand – solve small tasks such as generating images with impressive robustness.

AI research lab OpenAI’s enhanced DALL-E, DALL-E 2, is arguably the most impressive project to emerge from the depths of an AI research lab. As my colleague Devin Coldewey writes, the DALL-E 2 goes even further, while the original DALL-E showed a remarkable ability to create images that fit almost any prompt (e.g., “a dog wearing a beret”). The images it produces are much more detailed, and the DALL-E 2 can intelligently replace a particular area in an image, for example inserting a table into a photograph of a marble floor full of appropriate reflections.


An example of the types of images that DALL-E 2 can generate.

DALL-E 2 got the most attention this week. But on Thursday, researchers at Google described an equally impressive visual comprehension system called Visually-Driven Prosody for Text-to-Speech – VDTTS – in a post published on Google’s AI blog. VDTTS can generate realistic-sounding, lip-synced speech with nothing more than text and video frames of the person speaking.

The speech generated by VDTTS, while not a perfect replacement for recorded dialogue, is still quite good, with convincingly human expressiveness and timing. Google sees it one day being used in a studio to replace original audio that may have been recorded in noisy conditions.

Visual understanding is, of course, only one step towards more capable AI. Another component is language comprehension, which is lagging in many ways — even setting aside AI’s well-documented toxicity and bias issues. In a stark example, a sophisticated system from Google, Pathways Language Model (PaLM), withheld 40% of the data used to “train” it, according to a paper, resulting in PaLM plagiarizing text to copyright notices in code fragments.

Fortunately, DeepMind, the AI ​​lab backed by Alphabet, is one of those exploring techniques to tackle this. In a new study, DeepMind researchers examine whether AI language systems — which learn to generate text from many examples of existing text (think books and social media) — could benefit from explaining those texts. After annotating dozens of language tasks (e.g., “Answer these questions by determining whether the second sentence is an appropriate paraphrase of the first, metaphorical sentence”) with explanations (e.g., “David’s eyes weren’t literal daggers, it’s a metaphor that is used to imply that David was looking fiercely at Paul.”) and judging the performance of different systems on them, the DeepMind team found that examples do indeed improve the performance of the systems.

If DeepMind’s approach succeeds within the academic community, it could one day be applied in robotics and form the building blocks of a robot that can understand vague requests (e.g., “take out the trash”) without step-by-step instructions. Google’s new “Do As I Can, Not As I Say” project offers a glimpse into this future, albeit with significant limitations.

A collaboration between Robotics at Google and the Everyday Robotics team in Alphabet’s X lab, Do As I Can, Not As I Say, seeks to condition an AI language system to suggest actions that are “feasible” and “contextually appropriate.” ” are for a robot given any task. The robot acts as the “hands and eyes” of the language system, while the system provides high-level semantic knowledge about the task – the theory being that the language system encodes a wealth of knowledge useful to the robot.

Google robotics

Image Credits: Robotics at Google

A system called SayCan selects which skill the robot should perform in response to a command, taking into account (1) the probability of a particular skill being useful and (2) the ability to successfully perform that skill. For example, if someone says, “I spilled my Coke, can you bring me something to clean it up?” SayCan can tell the robot to find a sponge, pick up the sponge and go to the person. who has requested it. the.

SayCan is limited by robotics hardware – more than once the research team observed the robot they chose to conduct experiments and accidentally drop objects. Yet, along with DALL-E 2 and DeepMind’s work on contextual understanding, it illustrates how AI systems, when combined, can bring us much closer to a Jetsons-like future.

This post Combining vision and language could hold the key to more capable AI – TechCrunch was original published at “”


Please enter your comment!
Please enter your name here