Creating Images and Videos with ChatGPT

Feature	ChatGPT 4.0	ChatGPT 4.0 Mini
Image Generation	Yes	No
Data Analysis	Yes	No
Advanced Features	Full Access	Limited
Message Limits	Higher	Lower

Next, let's explore ChatGPT's image generation capabilities, powered by DALL-E technology. This feature represents a significant leap in AI-powered visual content creation, though it comes with specific requirements and limitations you need to understand.

Image generation requires GPT-4 or the earlier GPT-4 model—the previous GPT-3.5 lacks this capability entirely. If you exhaust your message quota, bing.com/create offers essentially the same DALL-E functionality as a fallback option. This becomes particularly valuable since the streamlined GPT-4 mini versions cannot generate images at all.

The model hierarchy matters significantly here. Mini versions sacrifice capabilities for speed and cost efficiency—they can't perform data analysis, create images, or handle complex multimodal tasks. You need the full GPT-4 implementation for image generation. When your GPT-4o messages are exhausted, you can fall back to the original GPT-4, though OpenAI's naming convention creates unnecessary confusion. The "4" model is actually the first generation, while "4o" represents the optimized second iteration—counterintuitive naming that trips up many users.

Let's demonstrate with a practical example. I'll generate an image of "puppies running in a backyard" to show you the process and potential results.

Here's something fascinating I keep saved in my account: an incredibly realistic image of a fluffy poodle with elaborate grooming, playing with a chew toy. The photorealism is remarkable—this puppy never existed. The image is entirely AI-generated, yet it captures emotional nuance and physical detail that rivals professional photography.

However, AI-generated images often contain subtle flaws. In this poodle image, there's an ambiguous brown object on the ground that could be interpreted as a toy or something less appealing. Fortunately, DALL-E includes editing capabilities. Using the brush tool, I can select problem areas and provide simple instructions like "remove the brown thing." The AI intelligently fills the space, maintaining visual coherence while eliminating unwanted elements.

Let's try something more complex: generating an image of skiers at a ski resort. This request reveals a critical aspect of how ChatGPT processes image prompts—it doesn't simply pass your words to DALL-E verbatim.

When I input "image of skiers at a ski area," ChatGPT dramatically expands this into: "A lively scene at a ski area with skiers of various ages and skill levels gliding down snowy slopes. The landscape includes tall, snow-covered pine trees, a clear blue sky, and a cozy ski lodge at the base of the mountain. Some skiers are wearing colorful winter gear. There are ski lifts in the background. The atmosphere is vibrant and energetic, capturing the joy of winter sports and a festive holiday vibe."

This automatic prompt enhancement can be helpful, but it may not align with your vision. Understanding this process allows you to craft more precise initial prompts or copy and modify the expanded version to better suit your needs.

The distinction between requesting "images" versus "photographs" proves crucial. When you use terms like "realistic" or "naturalistic," you're employing illustration vocabulary—language typically applied to paintings and drawings. Real photographs are never described as "realistic" because reality is their inherent nature. Instead, photographs are characterized by technical specifications: lens type, aperture settings, lighting conditions, and composition techniques.

Compare these approaches: requesting a "realistic dog playing with a chew toy" yields an illustration-style result. But specifying "photograph of a dog playing with a chew toy, shot with a 50mm lens at f/4, golden hour lighting" produces dramatically more photographic results. This works because AI models train on actual photographs with embedded metadata containing exactly these technical details.

Let me refine our ski scene using photographic terminology: "Photo of three skiers at a ski area, shot with a 50mm lens at f/4, during midday sunlight." The results improve significantly, though you may need several iterations to achieve your exact vision—a process that can quickly consume your image generation quota.

Even improved images contain telltale AI artifacts when examined closely. Look for inconsistencies in mechanical objects, impossible geometries, or anatomical errors. Hands remain particularly challenging—missing or extra fingers are common giveaways. While casual viewers might not notice these flaws, they become apparent under scrutiny.

Certain concepts prove stubbornly resistant to modification. For instance, when generating images of "geeks" or "smart people," the AI invariably adds glasses, regardless of instructions to the contrary. I've attempted numerous approaches—specifying contact lenses, explicitly stating "no glasses," even trying reverse psychology—but the association remains unbreakable. This reflects deep-seated training data biases that current models cannot easily overcome.

Text generation within images remains problematic across all AI image generators, not just DALL-E. Most generated text appears as illegible gibberish rather than readable content. Until this limitation is resolved, avoid incorporating textual elements in your image requests.

Video generation represents the next frontier, requiring 30 coherent images per second. While companies are making progress, ChatGPT currently offers only static image generation. OpenAI has announced video capabilities but hasn't released them as of 2026.

For optimal results, follow these professional guidelines: Replace illustration vocabulary ("realistic," "naturalistic") with photography terms ("photograph," "shot with," specific lighting conditions). Be extremely specific about your requirements—dog breed, colors, positioning, environment, and technical specifications. Describe lighting conditions precisely: golden hour creates warm, cinematic effects while ceiling lights produce even, professional illumination.

Consider this progression: "realistic dog with chew toy" produces an obvious illustration. "Photograph of a golden retriever puppy with a rope toy, shot with an 85mm lens at f/2.8, golden hour lighting, shallow depth of field" yields professional-quality photographic results. The difference lies entirely in prompt sophistication.

Art style specifications matter equally for illustrated content. Instead of generic requests, specify "oil painting in the style of the Dutch masters" or "minimalist line drawing with spot color." For photographs, experiment with different lighting scenarios—golden hour for warmth, overcast conditions for even tones, or dramatic side lighting for artistic effect.

Remember that ChatGPT serves as an image generator, not an editor. Unlike Photoshop with integrated Adobe Firefly, which can intelligently modify existing photographs, DALL-E creates entirely new images. While you can upload reference images for style or color guidance, you cannot edit existing photos directly. For comprehensive image editing, professional tools remain necessary.

The key to mastering AI image generation lies in understanding how to communicate visually through text. The more precisely you can describe your vision using appropriate technical vocabulary, the closer your results will match your intentions. This skill becomes increasingly valuable as AI visual tools continue evolving throughout 2026 and beyond.

Related Articles

Basic Excel Calculations and Order of Operations

Paste Special: Excel Skills with Key Techniques

Building a Three-Layer Neural Network with Keras and TensorFlow