How Does #DeepDream Work?

Do neural networks hallucinate of electronic dogs?

If you’ve been browsing the net recently, you might have stumbled on some strange-looking images, with pieces of dog heads, eyes, legs and what looks like buildings, sometimes superimposed on a normal picture, sometimes not. Although they can be nightmare-inducing (or because of that), they have gained a lot of popularity on the Internet. Often tagged #deepdream, they are made by a neural network trained on a huge set of categorized images and set free to generate new ones. The network comes from Google Research and its code is currently available on github, spawning more home-made neural image generators.

It turns out, like many things on the Internet, it has something to do with cats.

In 1971, a British scientist named Sir Colin Blakemore raised a kitten in complete darkness, except for several hours a day in a small cage, where the kitten could only see black and white horizontal stripes. A month or so later, the kitten was introduced to the normal world. It reacted to light, but otherwise seemed to be blind. It didn’t follow moving objects, unless they made a sound. When a recording was taken from the kitten’s visual cortex, it turned out that its neurons reacted to horizontal lines, but not vertical, so the brain was unable to comprehend the complexity of the real world. Another kitten raised in a vertical stripe environment had a similar disability.

The research method is controversial and Sir Colin had his share of threats from angry animal defenders, but the result is interesting. The experiment tells us that the vision system, at least in cats, is something that develops after birth. The visual cortex of a kitten adapts to what the newborn eyes are exposed to and forms neurons that react to dominant basic patterns, like vertical or horizontal lines. From those basic patterns it can then make sense of a more complex image, infer depth and motion. This gives us hope and suggests a method for creating an artificial vision system.

Convolutional neural networks

A feed-forward artificial neural network, when used as a classifier, takes an array of values as input and tries to assign it to a category in response. A relatively small network can be trained to guess a person’s gender, given numeric height and hair length. We collect some sample data, present the samples to the network and gradually modify the network’s weights to minimize the error. The network should then find the general rule and give correct answers to samples it has never seen before. If well trained, it will be correct most of the time and wrong in cases where people would probably be wrong too, given only the same information: height and hair length, and would be much faster too. People are not good with numbers. In turn, estimating gender from a photo is a much easier task for humans, but orders of magnitude harder for computers. An image, represented as a set of numerical values of pixels, is a huge input space, impossible to process in reasonable time. And what if we don’t want to detect genders, but dog breeds, or recognize plants, or find cancer in x-rays? This is where convolutional neural networks come in.

Convolutional neural networks are a breed of neural networks introduced by Kunihiko Fukushima back in 1980, under the name “Neocognitron”. You may stumble across the name “LeNet”, named after Yann LeCun, a researcher working with Facebook.

A kernel convolution is an image operation that, for each pixel, takes the pixels in its square neighborhood, calculates a weighted average of their values, and puts the result as a new value of the pixel. An equally-weighted average will produce a blurry image. Negative weights for nearest neighbors will make the image look sharper. With properly selected weights, we can enhance vertical or horizontal lines, or lines of any angle. With a bigger convolution kernel (neighborhood), we can find curved lines, color gradients or simple patterns. This technique has been known and used in image analysis and manipulation for years. In convolutional neural networks though, the weights in the convolution matrices can be trained using error back-propagation. This way, instead of each pixel being an input, we can have one neuron reacting only to vertical lines, another to horizontal lines, or angles, just like in a cat’s visual cortex.

A layer of a convolutional neural network, consists of a number of such image-transforming neurons, each emphasizing a different aspect of the image. An image then becomes split to a number of feature channels: instead of the initial RGB, we get one channel per kernel. Hardly a solution to the input size problem. We use pooling layers for that. A pooling layer takes non-overlapping square neighborhoods of pixels, finds the highest value and returns that as the value of the neighborhood. Note that, if we did that on the original image, we’d just get a badly pixelated, somewhat brighter miniature. When a pooling layer’s input comes from a convolutional layer, its response means “There is this feature in this area”, which is actually useful information. Pooling layers also make the network less sensitive to where the features are in the image.

“Park or Bird” problem by XKCD

Still, knowing that there are vertical or horizontal lines, gradients or edges, doesn’t help us detect if the photo contains a bird or not (see famous park or bird problemfinally solved by Flickr). Well, we made a step in the right direction, so why not make the next step and stack another set of convolutional neurons, and a pooling layer on top of that, creating a deep learning neural network. Turns out, with enough layers, a network “gets” quite complex features. A face is, after all, a combination of eyes, nose and mouth, with a chance of ears and hair. We can then use these complex features as input for a regular feed-forward neural network and train it to return a category: bird, dog, building, electric guitar, school bus or pagoda.

In 2012, a convolutional neural network was trained on Youtube videos and allowed to freely self-organize categories for what it could see. It self organized a category of cats (see, all goes back to cats again). In 2014, convolutional networks, working together with recurrent neural networks trained on full sentences, learned to describe images in full sentences. Recurrent neural networks are another topic.

Source: Deep Fragment Embeddings for Bidirectional Image-Sentence Mapping by Andrey Karpathy

Dreaming deep

Since 2012, deep learning neural networks started winning image analysis competitions, reaching near-human accuracy in labeling images. Initially, there was some resistance from the computer vision community. One complaint is that the networks are winning, but, well, not showing their work. The leading methods at the time, when detecting a face, for example, would give exact positions of eyes and mouth and return various proportions of the purported face. A deep neural network would just detect faces with uncanny accuracy and not tell us how. While the inner workings of the first convolutional layer were well understood (lines, edges, gradients), it was impossible to look into how the second layer used the information given by the first. The layers above that were a mystery.

Researchers have been successful in fooling these deep neural networks, by generating images with an evolutionary algorithm. Make a random noise image, check the network’s response. If the network thinks there’s a chance of a bus in the image, generate more images like that (offspring) feed those to the network again. The winner is the one that increases the network’s certainty, and gets to contribute to the next generation. What looks like random noise to us, will be interpreted as a bus by the network. If we also optimize the generated images to have statistical properties of real samples, the noise will turn into shapes and textures that we can identify. Such generated images tell us that, for example, stripes are a key feature of bees, that bananas are usually yellow and anemone fish have orange-white-black stripes.

Another approach is feed a real image to the network and then pick a layer, apply the winning transformation to each part of the image, and feed the result back to the network. For the first layer, expectedly, the network will enhance leading line directions in the image, giving it an impressionistic style of wavy brush strokes, dots or swirls. If we do that with a higher layer – reversing the process through lower layers, we get an image painted with textures, like fur, wood, feathers, scales, bricks, grass, waves, spaghetti or meatballs. A yet higher layers will turn any vertical lines into legs and draw eyes and noses on shapes that vaguely look like faces (by the way, people do that too, it’s called pareidolia). Since there are a lot of animal photos in the training image set, the generated images often contain parts of images, sometimes gruesomely distorted. Animal faces are superimposed on human faces, spiky leaves become bird beaks and everything seems a bit hairy. Our brains struggle to make sense of the network’s dreams. The images are disturbing because they remind us of something, but at the same time – not exactly. This makes them interesting, maybe disturbing, and viral on the internet.

What does that say about the history of art, if a lower layer paints a Van Gogh or Seurat picture, while a higher level layer reminds us of Picasso or Dali?

Left: Original photo by Zachi Evenor. Right: processed by Günther Noack, Software Engineer

Going deeper

Things get really interesting if we directly suggest to the network what it should see, by triggering the last, classification layer. Now, I will leave the explanation of how that trigger is passed back through the network to those who have done it. Instead of starting with a random image, it helps to take a real image, blur it a bit, zoom in, and let the network “enhance” it by drawing what it sees. When we repeat the process, we get a potentially infinite zooming sequence of bits and pieces of the network’s sample data.

This is how the Large Scale Deep Neural Network (LSD-NN) is able to hallucinate like that in real-time. It was made by Jonas Degrave and team, known as 317070 on github. Interestingly, this one was built before Google published the deep dream code. 317070’s network hallucinates in a twitch stream, where users can shout categories from Image Net databaseand the network will do its best to produce images that remind it of what the user suggested. The network doesn’t exactly draw the requested objects, but gets the essence. When users ask for volcanoes, there’s smoke and lava. When users shout for pizza, there’s melting cheese. You can see sausages in a butcher shop and spiders on spiderwebs, but most of the time it’s a mesmerizing, colorful soup of textures and shapes. Really. Try it.

It may be called fun or nightmarish, but we learn a lot from making networks dream. We have found the equivalent of the first convolutional layer in cat brains. We have a model which confirms the theory that dreaming helps us remember (networks can be trained on their own generated sample data). We basically simulated pareidolia, perhaps we can infer some information about the mechanism behind schizophrenia.

Like chess and, recently, Jeopardy, machines have crossed yet another threshold and took over something we used to be better at. Remember when captchas were a good way to stop crawlers and bots from stealing your online data? Not anymore. Now it’s just a challenge of the bot’s computational ability. The viral images under the #deepdream hashtag are just a sign that machine vision is becoming mainstream and it’s time to accept that dreaming of electronic sheep (or dogs) is just something androids do.