Image-to-Text Generation for New Yorker Cartoons
How I Used Deep Learning to Create New Yorker Captions
No computer has ever won the New Yorker Cartoon Caption Contest, and for good reason: it’s hard. The New Yorker magazine runs a weekly contest where readers submit captions for an un-captioned cartoon like the one below. The winning caption gets published in the magazine.

People have created New Yorker caption generators, but none actually use the image of the cartoon to generate the caption. Instead, they use previous captions to predict what a new one would be like [1].
Going directly from the cartoon image to a caption is hard for a few reasons:
- Cartoons usually depict situations and images that never occur in real life. (a fish sitting at a bar?)
- Even if we had a model that recognized subjects in a cartoon, it’s difficult to convert that understanding into a witty caption.
I tackled this task as a learning exercise for image-to-text algorithms, and after a lot of trial and error, I created an image-to-text model that produced decent captions for New Yorker cartoons based on the image alone. Here’s how I did it.

Data
The first step was to get training data. NEXTml has a Github repository where you can access the raw cartoon images and candidate captions from 186 contests. Each contest has an average of ~5,000 candidate captions. After downloading this dataset, I split it into a training and testing dataset of 127 and 59 images respectively.
Model
I started with a deep image-to-text model from sgrvinod’s tutorial. Given an image, the model works by encoding the image into a vector space that describes what’s in the image. The model then takes the encoded image and translates it into words (“decoding”). While translating the encoded image into words, an attention layer tells the model which part of the image to focus on. The attention layer is particularly cool because it allows you to see where the model’s focus is when it chooses each word.

I trained a model on the COCO2014 dataset of ~160k images and captions (“COCO model”). However, as shown below, the images and captions from COCO2014 look quite different from what we want to produce for the New Yorker. The images are real-life photos with very straightforward descriptions. No comedy in sight. The next step was to figure out: how do we take this pre-trained model and adapt it for New Yorker captions? This was the most difficult part.

Text Preprocessing
The answer to the concerns mentioned above ended up being “very careful preprocessing.”
Vocabulary Expansion
The first thing I needed to do was to expand the vocabulary of the COCO model. The New Yorker cartoons contained concepts like a “yeti” or “cavemen” that never showed up in the COCO2014 dataset. To get around this, I added words from the New Yorker dataset into the COCO model’s vocabulary and retrained the COCO model. This increased the vocabulary size from 9,490 words to 11,865 words.
Caption Filtering
In the New Yorker dataset, the candidate captions for a cartoon are very different from each other. Some mention the object in the image, while others don’t. Captions also start with different words and have different grammatical structures:
“Whaddya mean you only serve fish on Friday”
“Every single school rejected me”
“I KNOW when something is watered down, pal.”
To make the image-to-text problem easier for the model, I filtered the candidate captions in two ways:
- Structure: The caption must start with a pronoun (e.g. “I”, “You”, “They”). This is to make the caption structure more predictable.
- Content: The caption must reference an object in the image. The reason for this is twofold. First, since we are going directly from image-to-caption, I wanted to force the model to use objects in the image to derive the caption. The second is that keywords (e.g. “fish”) showed up in the caption, making it easier for the model to learn. To figure out what objects were in an image, I used a rough heuristic where I extracted the nouns from a description of the cartoon and forced the caption to at least reference one or two of the nouns.
These filters reduced the average number of captions per cartoon from 5,000 to 10, but this was still more than enough data to train the model on [3].
Transfer Learning
After cleaning the data, I continued training the COCO model by giving it images and captions from the New Yorker dataset. The additional training meant that the model could perform small adjustments to its parameters to adapt to the new task. This technique is commonly known as transfer learning. Of note, I kept the learning rate on the higher side (1e-4 in this case) to force the model to learn a new “funny” syntactic structure and recognize objects in black-and-white images.
Results
Here are some sample captions produced by the model on the test set.

A lot of the captions were 70% there and just needed some tweaking to make them caption-worthy (e.g. for the center bottom photo: “i’m sorry this is a social party” → “i’m sorry this is a digital party” or “anti-social party”). That being said, I was impressed by how transfer learning allowed the model to identify themes in the cartoon black-and-white images (e.g. “cave” for the cave painting, “chef” for the man walking into the kitchen). Maybe with a couple more tweaks, we’d have a competitive AI cartoon captioner in our hands.
Here is the model’s submission for this week’s New Yorker caption contest (contest #749).

Final Remarks
There are many ways that this model could be improved, but hopefully the techniques and learnings here can give you a head start in creating your own funny AIs. I’ll be playing around with this model more, and I’d love to hear any of your ideas on how we could improve it! Most of the modeling code was from sgrvinod’s tutorial, and you can find the code I used to preprocess the New Yorker captions in this Github repo.
Notes
[1] To be more precise, previous attempts trained a language model on previous captions to generate new ones. i.e. think Markov models. Some neat examples include:
- Markov Model based on past finalists.
- Topic → RNN text generation.
- Text Description of Cartoon → RNN text generation.
[2] A technical note: I split the data based on whether the cartoon came with a textual description of the contents of the image, which I helped use to filter the captions for training. If the cartoon did not have a textual description, I placed it in the test set. Also, the readme of the GitHub states there are 155 images, but the database actually has 184 after removing duplicates.
[3] The main data constraint for the image-to-text model in this case was the number of captioned images, not the number of captions per image. For comparison, COCO2014 had 5 captions per image.