Introducing #Imagen, a new text-to-image synthesis model that can generate high-fidelity, photorealistic images from a deep level of language understanding; created by Google AI.
Another month, another flood of weird and wonderful images generated by an artificial intelligence. In April, OpenAI showed off its new picture-making neural network, DALL-E 2, which could produce remarkable high-res images of almost anything it was asked to. It outstripped the original DALL-E in almost every way.
Now, just a few weeks later, Google Brain has revealed its own image-making AI, called Imagen. And it performs even better than DALL-E 2: it scores higher on a standard measure for rating the quality of computer-generated images, and the pictures it produced were preferred by a group of human judges.
“We’re living through the AI space race!” one Twitter user commented. “The stock image industry is officially toast,” tweeted another.
We are thrilled to announce Imagen, a text-to-image model with unprecedented photorealism and deep language understanding. Explore https://t.co/mSplg4FlsM and Imagen!— Chitwan Saharia (@Chitwan_Saharia) May 24, 2022
A large rusted ship stuck in a frozen lake. Snowy mountains and beautiful sunset in the background. #imagen pic.twitter.com/96Vfo2kXJz
Many of Imagen’s images are indeed jaw-dropping. At a glance, some of its outdoor scenes could have been lifted from the pages of National Geographic. Marketing teams could use Imagen to produce billboard-ready advertisements with just a few clicks.
But as OpenAI did with DALL-E, Google is going all in on cuteness. Both firms promote their tools with pictures of anthropomorphic animals doing adorable things: a fuzzy panda dressed as a chef making dough, a corgi sitting in a house made of sushi, a teddy bear swimming the 400-meter butterfly at the Olympics—and it goes on.
New @GoogleAI work:— Jeff Dean (@🏡) (@JeffDean) May 24, 2022
Input: "Two meerkats sitting next to each other on top of a mountain and looking at the beautiful landscape. There is a mountain, a river lake, and fields of yellow flowers. There are hot air balloons in the sky."#imagen https://t.co/JEgyNrcJjl
Output: https://t.co/uj4urjnZPF pic.twitter.com/I1zx8ZARBl
There’s a technical, as well as PR, reason for this. Mixing concepts like “fuzzy panda” and “making dough” forces the neural network to learn how to manipulate those concepts in a way that makes sense. But the cuteness hides a darker side to these tools, one that the public doesn’t get to see because it would reveal the ugly truth about how they are created.
Most of the images that OpenAI and Google make public are cherry-picked. We only see cute images that match their prompts with uncanny accuracy—that’s to be expected. But we also see no images that contain hateful stereotypes, racism, or misogyny. There is no violent, sexist imagery. There is no panda porn. And from what we know about how these tools are built—there should be.
Not a single human face depicted in the hundreds of pictures in the paper, haha. I guess that's one way to eliminate concerns over representation bias. https://t.co/tKX8khoTDR— mike cook (@mtrc) May 23, 2022
It’s no secret that large models, such as DALL-E 2 and Imagen, trained on vast numbers of documents and images taken from the web, absorb the worst aspects of that data as well as the best. OpenAI and Google explicitly acknowledge this.
Scroll down the Imagen website—past the dragon fruit wearing a karate belt and the small cactus wearing a hat and sunglasses—to the section on societal impact and you get this: “While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized [the] LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.”
It’s the same kind of acknowledgement that OpenAI made when it revealed GPT-3 in 2019: “internet-trained models have internet-scale biases.” And as Mike Cook, who researches AI creativity at Queen Mary University of London, has pointed out, it’s in the ethics statements that accompanied Google’s large language model PaLM and OpenAI’s DALL-E 2. In short, these firms know that their models are capable of producing awful content, and they have no idea how to fix that.
For now, the solution is to keep them caged up. OpenAI is making DALL-E 2 available only to a handful of trusted users; Google has no plans to release Imagen.
That’s fine if these were simply proprietary tools. But these firms are pushing the boundaries of what AI can do and their work shapes the kind of AI that all of us live with. They are creating new marvels, but also new horrors— and moving on with a shrug. When Google’s in-house ethics team raised problems with the large language models, in 2020 it sparked a fight that ended with two of its leading researchers being fired.
Large language models and image-making AIs have the potential to be world-changing technologies, but only if their toxicity is tamed. This will require a lot more research. There are small steps to open these kinds of neural network up for widespread study. A few weeks ago Meta released a large language model to researchers, warts and all. And Hugging Face is set to release its open-source version of GPT-3 in the next couple of months.
For now, enjoy the teddies.