DALL-E is a 12-billion-parameter version of GPT-3 that has been trained to generate images from textual descriptions, utilizing a set of text-image pairs. It has been found to possess a diverse range of abilities, including creating anthropomorphized versions of animals and objects, plausibly combining unrelated concepts, rendering text, and applying transformations to existing images.
Overview of a prompt in DALL-E
Similar to GPT-3, DALL-E is a transformer language model. It processes both text and images in the form of a single data stream containing up to 1280 tokens and is trained to use maximum likelihood to generate all tokens sequentially. This learning procedure enables DALL-E not only to create an image from scratch but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a manner consistent with the textual prompt.
Image generation by DALL-E
DALL-E has the capability to produce plausible images for a wide variety of phrases that explore the compositional structure of language. The samples displayed for each caption in the visuals are obtained by selecting the top 32 out of 512 after re-ranking with CLIP.