A gallery of a variety of images that can be analyzed by Unified-IO.
An arrow pointing leftHome

Unified-IO is a Swiss Army AI for images and text

  • Mike Pearl

One AI; tons of jobs.

On June 17, the computer vision-centric PRIOR team at the Allen Institute for AI (which is, in the interest of full disclosure, pnw.ai’s supporting partner), published its work on a highly ambitious AI platform called Unified-IO, intended as a tool that performs a huge number of increasingly commonplace language and image processing tasks, all in one tidy package. A blog post from the PRIOR team calls it “a significant milestone in the pursuit of a single, unified general purpose system capable of parsing and producing visual, linguistic and other structured data.”

The impression one gets from looking at the Unified-IO demo is that these are different models that have been somehow rigged together in one interface. For instance, an image is presented, and Unified-IO correctly identifies it as “a large clock in a large room with a glass ceiling.” A moment later in the demo, Unified-IO is asked to do more or less the opposite, and churn out a photo of a “small personal pizza with bacon and spinach.”

What emerges is slightly more recognizable (and edible-looking) than anything Dall-E Mini, the ultra-viral image generator that everyone toyed with on the internet this past month, can create. Then the system throws you for a loop by performing simple text-only tasks like paraphrase detection.

Yet these are all happening via Unified-IO’s “single streamlined transformer encoder-decoder architecture,” to quote the paper by the PRIOR team.

A display of the variety of tasks Unified-IO can perform from the paper. (The Allen Institute and University of Washington Seattle)

This isn’t the first time a researcher has derived task-specific outputs from a single multi-task model. To cite one example, Jiasen Lu, Vedanuj Goswami, at al. published a paper on a similar process back in 2019, but for what it’s worth, the PRIOR team created a benchmark called GRIT (General Robust Image Task Benchmark) for testing an AI’s ability to perform image processing tasks, and Unified-IO is “the first model to support all seven tasks in GRIT,” according to the paper.

How exactly does this work in simplified terms? Well AI’s are “trained,” and as we all know, AI training involves an AI eating mountains and mountains of data — in other words: billions and billions of words and images, which are processed as “tokens” of information. But according to the paper, Unified-IO’s “broad unification is achieved by homogenizing every task’s output into a sequence of discrete tokens,” emphasis mine. So, the “homogenized” tokens are agnostic in the sense that they don’t necessarily want to be an image or a word, and can instead be whatever the output needs them to be.

The range of possibilities for Unified-IO includes the ability to identify what an image is of (the phrasing “an image of” was one key to Unified-IO’s programing), as well as what’s in a selected piece of an image. It can fill in color-coded sections of a blank image with image content of your choosing. It can spot the positions of people in an image. It can answer natural language questions about selected parts of an image like “What direction is the arrow pointing?”

An example of Unified-IO's image generation capabilities: pizza that — blessedly — looks like a pizza. (The Allen Institute)

It may not be able to do a version of every task you could ever ask a computer to do to an image or snippet of text, but it can undeniably do a lot.

The question is, can someone chop it down into a bite-sized version like Dall-E Mini, so the internet can use its wide array of functions to make a wide array of goofy jokes and japes about Seinfeld and The Muppets? If yes, then perhaps humanity’s most fervent needs will have been served by the hard work of these AI engineers.