We’re excited to bring Transform 2022 back in person on July 19 and pretty much July 20-28. Join AI and data leaders for insightful conversations and exciting networking opportunities. Register today!
OpenAI recently released DALL-E 2, a more advanced version of DALL-E, an ingenious multimodal AI that can generate images based purely on text descriptions. DALL-E 2 does this by using advanced deep learning techniques that improve the quality and resolution of the generated images and offers further possibilities such as editing an existing image or creating new versions of it.
Many AI enthusiasts and researchers are tweeting about how awesome DALL-E 2 is at generating art and images from a thin word, but in this article I’d like to explore another application for this powerful text-to-image model: the generating data sets to solve the biggest challenges of computer vision.
Caption: An image generated by DALL-E 2. “A rabbit detective sitting on a park bench and reading a newspaper in a Victorian setting.” Source: Twitter
The shortcomings of computer vision
Computer vision AI applications can range from detecting benign tumors in CT scans to enabling self-driving cars. But what everyone has in common is the need for abundant data. One of the most prominent performance predictors of a deep learning algorithm is the size of the underlying dataset on which it is trained. For example, the JFT dataset, an internal Google dataset used for training image classification models, consists of 300 million images and more than 375 million labels.
Consider how an image classification model works: A neural network transforms pixel colors into a series of numbers that represent its attributes, known as the “embedding” of an input. Those attributes are then mapped to the output layer, which contains a probability score for each class of images that the model should detect. During training, the neural network tries to learn the best trait representations that differentiate between the classes, e.g. a pointed ear for a Doberman versus a Poodle.
Ideally, the machine learning model would generalize learning across different lighting conditions, angles and background environments. But more often than not, deep learning models teach the wrong representations. For example, a neural network might infer that blue pixels are a hallmark of the “frisbee” class because all images of a Frisbee it saw during training were on the beach.
A promising way to solve such shortcomings is to increase the training set, for example by adding more pictures of Frisbees with different backgrounds. Still, this exercise can prove to be a costly and lengthy undertaking.
First you need to collect all the necessary samples, for example by searching online or by capturing new images. Next, you need to make sure that each class has enough labels to prevent the model from fitting too much or too little for some. Finally, you need to label each image and indicate which image corresponds to which class. In a world where more data translates into a better performing model, these three steps act as a bottleneck to achieving state-of-the-art performance.
But even then, computer vision models are easily fooled, especially when attacked with hostile examples. Guess what another way to reduce enemy attacks? You guessed it right: more labeled, well-curated and diverse data.
Caption: OpenAI’s CLIP incorrectly classified an apple as an iPod due to a text label. Source: OpenAI
Enter DALL-E 2 . in
Let’s take an example of a classification of dog breeds and a class for which it is a bit more difficult to find images: Dalmatian dogs. Can we use DALL-E to solve our lack of data problem?
Consider applying the following techniques, all powered by DALL-E 2:Vanilla use. Enter the class name as part of a text prompt to DALL-E and append the generated images to that class’s labels. For example, “A Dalmatian dog in the park chasing a bird.” Different environments and styles. To improve the model’s ability to generalize, use prompts with different environments while keeping the same class. For example: “A Dalmatian dog on the beach chasing a bird.” The same goes for the style of the generated image, for example “A Dalmatian dog in the park chasing a bird in the style of a cartoon.” Adversarial examples. Use the class name to create a dataset of hostile examples. For example, “A Dalmatian-style car.” Variations. One of the new features of DALL-E is the ability to generate multiple variations of an input image. It can also create a second image and fuse the two together by combining the most prominent aspects of each. One can then write a script that feeds all the existing images of the dataset to generate dozens of variations per class. painting. DALL-E 2 can also perform realistic edits on existing images, adding and removing elements, taking into account shadows, reflections and textures. This can be a strong data augmentation technique to further train and improve the underlying model.
Except for generating more training data, the huge advantage of all the above techniques is that the newly generated images are already tagged, eliminating the need for a human tag.
While image-generating techniques such as generative adversarial networks (GAN) have been around for some time, DALL-E 2 stands out in its high-resolution 1024×1024 generations, its multimodal nature of converting text into images, and its strong semantic consistency, i.e. understanding of the relationship between different objects in a given image.
Automate dataset creation with GPT-3 + DALL-E
The input of DALL-E is a text prompt of the image we want to generate. We can use GPT-3, a text-generating model, to generate dozens of textual prompts per class that will then be fed into DALL-E, which in turn will create dozens of images that will be stored per class.
For example, we can generate prompts containing different environments for which we want DALL-E to create images of dogs.
Caption: A GPT-3 generated prompt to be used as input to DALL-E. Source: author
Using this example and a template-like phrase like “A [class_name] [gpt3_generated_actions]’, we were able to feed DALL-E with the following prompt: ‘A Dalmatian lying on the ground.’ This can be further optimized by fine-tuning GPT-3 to produce dataset captions, such as those in the OpenAI Playground example above.
To further increase confidence in the newly added samples, one can set a confidence threshold to select only the generations that passed a specific rank, as each generated image is ranked by an image-to-text model called CLIP.
Limitations and Mitigations
If not used carefully, DALL-E can generate inaccurate or narrow scope images, excluding specific ethnic groups or ignoring traits that could lead to bias. A simple example is a face detector trained only on images of men. In addition, the use of DALL-E generated images can pose significant risk in specific domains such as pathology or self-driving cars, where the cost of a false negative is extreme.
DALL-E 2 still has some limitations, composition being one of them. It can be risky to rely on prompts that assume, for example, that objects are placed correctly.
Caption: DALL-E still struggles with some prompts. Source: Twitter
Ways to reduce this include human sampling, where a human expert will randomly select samples to check for their validity. To optimize such a process, one can take an active-learning approach, where images with the lowest CLIP rank for a given caption are prioritized for a review.
DALL-E 2 is yet another exciting research result from OpenAI that opens the door to new types of applications. Generating massive data sets to address one of computer vision’s biggest bottlenecks – data is just one example.
OpenAI indicates it will release DALL-E sometime in the coming summer, most likely in a phased release with a pre-screening for interested users. Those who cannot wait, or cannot pay for this service, can tinker with open source alternatives such as DALL-E Mini (Interface, Playground repository).
While the business case for many DALL-E based applications will depend on the pricing and policies OpenAI puts in place for its API users, all of them are sure to make a big leap forward in image generation.
Sahar Mor has 13 years of engineering and product management experience focused on AI products. He is currently a Product Manager at Stripe and leads strategic data initiatives. He previously founded AirPaper, a document intelligence API powered by GPT-3 and was a founding Product Manager at Zeitgold (Acq. By Deel), a B2B AI accounting software company where he built his human-in-the-loop product. and scaled up. , and Levity.ai, a no-code AutoML platform. He also worked as a technical manager at early stage startups and at the elite Israeli intelligence unit 8200.
Welcome to the VentureBeat Community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers
This post How DALL-E 2 Can Solve Major Computer Vision Challenges was original published at “https://venturebeat.com/2022/04/16/how-dall-e-2-could-solve-major-computer-vision-challenges/”