After having all kinds of fun with Midjourney, here’s two things I realized about the state of AI image generation:

1️⃣ Magic and failure are two sides of the same coin

Working with Midjourney is like an artistic savant struggling with language comprehension.

Given the prompt, "a potato eating a potato", it draws the cutest potatoes. None of which are eating other potatoes.

Given the prompt, "burger and fries on a picnic table", you might be disappointed at the lack of fries.

After reading up on how diffusion models and text prompts work, I suspect it’s because training data often doesn’t have this kind of information.

A cat wearing a tiny beret might just be labelled “French cat”. This enables the trained AI to do what seems like magic, but is also what leads to its failures. The magic is that you can ask for a “French cat” and get something on point. The failure is that it can’t understand descriptions like, “French cat looking off to the left and sitting next to a croissant”.(Out of 4 images, it scored 0 for French cat, 2 for left, and an “inappropriate” for croissant).

Most if not all training data is from the internet, and I think it’s rare for image captions to contain specific details.

My takeaway is that is that you should write prompts based not on what you'd like to see, but rather, based on how someone might write a caption for it.



2️⃣ Image and text generation play by different rules

You’ve probably heard of the astounding failures of ChatGPT and Bing. Specifics matter in most writing, even fiction writing.

But there’s plenty of applications for image generation where specifics DON’T matter.

Example: A brochure about heart health needs vibrant colors and happy, healthy people—it doesn’t matter if they’re facing left or right, taking a walk, or sitting at a restaurant.

Given the nature of generative AI and it’s current lack of real-world understanding, correctness is a huge problem. But not all applications require correctness. And we’re probably more likely to find such applications with images rather than text.

Note: A counterpoint to this “image generation is more useful” vibe is that generated text is much easier to edit than generated images.