Can ChatGPT “see”?

by Will Mayall, April 27, 2023

I used ChatGPT-4 to generate a basic picture. It performed exceptionally well. In a separate conversation, I asked ChatGPT-4 to identify and describe the image's elements, which it did with remarkable accuracy.

The original image created by ChatGPT:

Although the corpus may contain SVG code describing each element of this original image, the composition is still impressive. It's improbable that any existing SVG code features this exact image—though it's not impossible.

In a new conversation, I asked ChatGPT to label the objects in the raw SVG code. There were no “hints” in the code to indicate what was in the image. Remember, from ChatGPT’s POV, it had never “seen” the image before it labeled it. The image with ChatGPT’s labels:

When asked to describe the image, ChatGPT responded:

Overall, the image appears to be a simple, abstract representation of a landscape, including mountains, ground, and a gray surface, possibly a road.

The outcome is striking. The shapes in the image are elementary and could symbolize many things. How did the AI determine one triangle to be a mountain and another a snowcap?

The results suggest that the AI “comprehends” the entire image, specifically implying that it understands the context. This is akin to a blind person describing an image based on object descriptions.

The Process I Used

For each step, I initiated a new conversation to (hopefully) eliminate the possibility that ChatGPT was using the history to figure out the image.

I made 1-3 requests and selected the best result. ChatGPT consistently performed decently on each attempt, but I aimed for the best outcome.

I asked ChatGPT to generate a landscape image using SVG, and it excelled.

After removing any "hints" from the SVG code, I requested ChatGPT to examine the SVG, labeling the objects and describing the image.

Creating the image - (new chat)

WM: Use SVG to create an image of a mountain under a cloud-filled sky. Make the sky look like a sunset. Put snow on the mountain. Put foothills in front of the mountain. Put a rolling green meadow between a road and the foothills and mountain.

Code created by ChatGPT-4:

<svg width="800" height="600" xmlns="http://www.w3.org/2000/svg">
  <!-- Sunset background -->
  <defs>
    <linearGradient id="sky-gradient" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style="stop-color:#1E90FF" />
      <stop offset="100%" style="stop-color:#FFA500" />
    </linearGradient>
  </defs>
  <rect width="100%" height="100%" fill="url(#sky-gradient)" />
  <!-- Clouds -->
  <circle cx="150" cy="100" r="50" fill="white" />
  <circle cx="200" cy="80" r="60" fill="white" />
  <circle cx="250" cy="100" r="50" fill="white" />
  <!-- Mountain -->
  <polygon points="300,400 500,150 700,400" fill="#8B4513" />
  <!-- Snow on mountain -->
  <polygon points="500,150 470,220 530,220" fill="white" />
  <!-- Foothills -->
  <polygon points="200,400 350,300 500,400" fill="#A0522D" />
  <!-- Green Meadow -->
  <polygon points="0,500 800,500 800,400 0,400" fill="#32CD32" />
  <!-- Road -->
  <rect x="0" y="500" width="100%" height="100" fill="#696969" />
</svg>

For the next two steps, I provided ChatGPT with the above SVG code but simplified to remove "hints" of what is in the image:

<svg width="800" height="600" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <linearGradient id="gradient" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style="stop-color:#1E90FF" />
      <stop offset="100%" style="stop-color:#FFA500" />
    </linearGradient>
  </defs>
  <rect width="100%" height="100%" fill="url(#gradient)" />
  <circle cx="150" cy="100" r="50" fill="white" />
  <circle cx="200" cy="80" r="60" fill="white" />
  <circle cx="250" cy="100" r="50" fill="white" />
  <polygon points="300,400 500,150 700,400" fill="#8B4513" />
  <polygon points="500,150 470,220 530,220" fill="white" />
  <polygon points="200,400 350,300 500,400" fill="#A0522D" />
  <polygon points="0,500 800,500 800,400 0,400" fill="#32CD32" />
  <rect x="0" y="500" width="100%" height="100" fill="#696969" />
</svg>

I’ll refer to the above as: [SVG code for landscape image]

Adding Labels - (new chat)

WM: Using SVG, copy the below image and add labels for each object and area:
[SVG code for landscape image]

ChatGPT-4: I have added labels for each object and area in the SVG code below:

<svg width="800" height="600" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <linearGradient id="gradient" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style="stop-color:#1E90FF" />
      <stop offset="100%" style="stop-color:#FFA500" />
    </linearGradient>
  </defs>
  <rect width="100%" height="100%" fill="url(#gradient)" />
  <text x="10" y="20" fill="black" font-size="20">Sky</text>

  <circle cx="150" cy="100" r="50" fill="white" />
  <text x="120" y="170" fill="black" font-size="16">Cloud 1</text>

  <circle cx="200" cy="80" r="60" fill="white" />
  <text x="170" y="150" fill="black" font-size="16">Cloud 2</text>

  <circle cx="250" cy="100" r="50" fill="white" />
  <text x="220" y="170" fill="black" font-size="16">Cloud 3</text>

  <polygon points="300,400 500,150 700,400" fill="#8B4513" />
  <text x="520" y="300" fill="black" font-size="20">Mountain 1</text>

  <polygon points="500,150 470,220 530,220" fill="white" />
  <text x="450" y="250" fill="black" font-size="16">Snowcap</text>

  <polygon points="200,400 350,300 500,400" fill="#A0522D" />
  <text x="270" y="400" fill="black" font-size="20">Mountain 2</text>

  <polygon points="0,500 800,500 800,400 0,400" fill="#32CD32" />
  <text x="10" y="480" fill="black" font-size="20">Grass</text>

  <rect x="0" y="500" width="100%" height="100" fill="#696969" />
  <text x="10" y="570" fill="white" font-size="20">Ground</text>
</svg>

This SVG code includes labels for each object and area: Sky, Cloud 1, Cloud 2, Cloud 3, Mountain 1, Snowcap, Mountain 2, Grass, and Ground.

Describing the image - (new chat)

WM: Describe what this SVG image is of:
[SVG code for landscape image]

ChatGPT-4: This SVG image consists of various shapes and colors, creating a simple scene with geometric elements. Here's a description of the elements in the image:

A linear gradient background is created using the colors #1E90FF (a deep sky blue) and #FFA500 (an orange color) which transition vertically from top to bottom.

Three white circles are arranged horizontally, with the first circle having a center at (150, 100) and a radius of 50, the second circle centered at (200, 80) with a radius of 60, and the third circle centered at (250, 100) with a radius of 50.

A brown (#8B4513) triangle is formed using points (300, 400), (500, 150), and (700, 400). This shape resembles a mountain.

A white triangle is formed using points (500, 150), (470, 220), and (530, 220). This shape is placed atop the brown triangle, suggesting it could be a snowcap on the mountain.

A reddish-brown (#A0522D) triangle is formed using points (200, 400), (350, 300), and (500, 400). This shape may represent another mountain or hill.

A green (#32CD32) quadrilateral is created using points (0, 500), (800, 500), (800, 400), and (0, 400), which seems to represent grass or ground in the scene.

A gray (#696969) rectangle is placed at the bottom of the image, covering the entire width (800) and with a height of 100. This rectangle could represent a road or a solid surface.

Overall, the image appears to be a simple, abstract representation of a landscape, including mountains, ground, and a gray surface, possibly a road.

Version of article on Medium