Introduction to Deep Neural Networks

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, to which all real-world data, be it images, sound, text or time series, must be translated.

Neural networks cluster and classify. You can think of them as a clustering and classification layer on top of data you store and manage. They group unlabeled data according by similarities among the example inputs, and they classify data when they have a labeled training set to work with.

As you think about one problem deep learning can solve, ask yourself: What categories do I care about? What can I act on? Those are labels: spam or not spam, good guy or bad guy, angry customer or happy customer. Then ask: Do I have data to accompany those labels?

For example, if you want to identify a group of people at risk for cancer, your training set would be a list of cancer patients along with all the data associated to their unique IDs, which could include everything from explicit features like age and smoking habits to raw data such as time series that track their motion, or logs of their behavior online, which likely indicate a great deal about lifestyle, habits and interests.

Searching for potential dating partners, future major-league superstars, a company’s most promising employees, or potential bad actors, involves much the same process of contructing a training set by amassing vital stats, social graphs, raw text communications, click streams, etc.

Neural Network Elements

Deep learning is a name for a certain set of stacked neural networks composed of several node layers. A node is a place where computation happens, loosely patterned on the human neuron and firing when sufficient stimuli pass through. It combines input from the data with a set of coefficients, or weights, that either amplify or mute that input. These input-weight products are summed and the sum is passed through a node’s so-called activation function.

Here’s a diagram of what one node might look like.

A node layer is a row of those neuronlike switches that turn on or off as the input is fed through the net. Each layer’s output is simultaneously the subsequent layer’s input, starting from an initial input layer receiving your data.

Pairing adjustable weights with input features is how we assign significance to those features with regard to how the network classifies and clusters input.

Key Concepts of Deep Neural Networks

Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data passes in a multistep process of pattern recognition.

Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning.

In deep-learning networks, each layer of nodes is trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.

This is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction. It makes deep-learning networks capable of handling very large, high-dimensional data sets with billions of parameters passing through nonlinear functions.

Above all, these nets are capable of discovering latent structures withinunlabeled, unstructured data, which is the vast majority of data in the world. Another word for unstructured data is simply raw media; i.e. pictures, texts, video and audio recordings. Therefore, one of the problems deep learning solves best is in processing and clustering the world’s raw, unlabeled media, discerning similarities and anomalies in data that no human has organized in a relational database or put a name to.

For example, deep learning can take a million images, and cluster them according to their similarities: cats in one corner, ice breakers in another, and in a third all the photos of your grandmother. This is the basis of so-called smart photo albums.

Now apply that same idea to other data types: Deep learning might cluster raw text such as emails or news articles. Emails full of angry complaints might cluster in one corner of the vector space, while satisfied customers, or spambot messages, might cluster in others. This is the basis of various messaging filters, and can be used in customer-relationship management (CRM). The same applies to voice messages.

With time series, data might cluster around normal/healthy behavior and anomalous/dangerous behavior. If the time series data is being generated by a smart phone, it will provide insight into users’ health and habits; if it is being generated by an autopart, it might be used to prevent catastrophic breakdowns.

Deep-learning networks perform automatic feature extraction without human intervention, unlike most traditional machine-learning algorithms. Given that feature extraction is a task that can take teams of data scientists years to accomplish, deep learning is a way to circumvent the chokepoint of limited experts. It augments the powers of small data science teams, which by their nature do not scale.

When training on unlabeled data, each node layer in a deep network learns features automatically by repeatedly trying to reconstruct the input from which it draws its samples, attempting to minimize the difference between the network’s guesses and the probability distribution of the input data itself. Restricted Boltzmann machines, for examples, create so-called reconstructions in this manner.

In the process, these networks learn to recognize correlations between certain relevant features and optimal results – they draw connections between feature signals and what those features represent, whether it be a full reconstruction, or with labeled data.

A deep-learning network trained on labeled data can then be applied to unstructured data, giving it access to much more input than machine-learning nets. This is a recipe for higher performance: the more data a net can train on, the more accurate it is likely to be. (Bad algorithms trained on lots of data can outperform good algorithms trained on very little.) Deep learning’s ability to process and learn from huge quantities of unlabeled data give it a distinct advantage over previous algorithms.

Deep-learning networks end in an output layer: a logistic, or softmax, classifier that assigns a likelihood to a particular outcome or label. We call that predictive, but it is predictive in a broad sense. Given raw data in the form of an image, a deep-learning network may decide, for example, that the input data is 90 percent likely to represent a person.

Example: Feedforward Networks

Our goal in using a neural net is to arrive at the point of least error as fast as possible. We are running a race. The starting line for the race is the state in which our weights are initialized, and the finish line is the state of those parameters when they are capable of producing accurate classifications and predictions.

The race itself involves many steps, and each of those steps resembles the others. Just like a runner, we will engage in a repetitive act over and over to arrive at the finish. Each step for a neural network involves a slight update in its weights, an incremental adjustment to their quantities.

A collection of weights, whether they are in their start or end state, is also called a model, because it is an attempt to model data’s relationship to ground-truth labels, to grasp its structure. Models normally start out bad and end up less bad, changing over time as the neural network updates its parameters.

This is because a neural network is born in ignorance. It does not know which weights and biases will translate the input best to make the correct guesses. It has to start out with a guess, and then try to make better guesses sequentially as it learns from its mistakes. (You can think of a neural network as a miniature enactment of the scientific method, testing hypotheses and trying again – only it is the scientific method with a blindfold on.)

Here is a simple explanation of what happens when a neural network learns (more precisely, we’ll discuss a feedforward neural net, the simplest architecture to explain.)

Input enters the network. The coefficients, or weights, map that input to a set of guesses the network makes at the end.

input * weight = guess

Weighted input results in a guess about what that input is. The neural then takes its guess and compares it to a ground-truth about the data, effectively asking an expert “Did I get this right?”

ground truth - guess = error

The difference between the network’s guess and the ground truth is its error. The network measures that error, and walks the error back over its model, adjusting weights to the extent that they contributed to the error.

error * weight's contribution to error = adjustment

The three pseudo-mathematical formulas above account for the three key functions of neural networks: scoring input, calculating loss and applying an update to the model – to begin the three-step process over again. A neural network is a corrective feedback loop, rewarding weights that support its correct guesses, and punishing weights that lead it to err.

The name for one commonly used optimization function that adjusts weights according to the error they caused is called “gradient descent.”

Gradient is another word for slope, and slope, in its typical form on an x-y graph, represents how two variables relate to each other: rise over run, the change in money over the change in time, etc. In this particular case, the slope we care about describes the relationship between the network’s error and a single weight; i.e. that is, how does the error vary as the weight is adjusted.

To put a finer point on it, which weight will produce the least error? Which one correctly represents the signals contained in the input data, and translates them to a correct classification? Which one can hear “nose” in an input image, and know that should be labeled as a face and not a frying pan?

As a neural network learns, it slowly adjusts many weights so that they can map signal to meaning correctly. The relationship between network Error and each of those weights is a derivative, dE/dw, that measures the degree to which a slight change in a weight causes a slight change in the error.

Each weight is just one factor in a deep network that involves many transforms; the signal of the weight passes through activations and sums over several layers, so we use the chain rule of calculus to march back through the networks activations and outputs and finally arrive at the weight in question, and its relationship to overall error.

The chain rule in calculus states that

In a feedforward network, the relationship between the net’s error and a single weight will look something like this:

That is, given two variables, Error and weight, that are mediated by a third variable, activation, through which the weight is passed, you can calculate how a change in weight affects a change in Error by first calculating how a change inactivation affects a change in Error, and how a change in weight affects a change in activation.

The essence of learning in deep learning is nothing more than that: adjusting a model’s weights in response to the error it produces, until you can’t reduce the error any more.

Logistic Regression

On a deep neural network of many layers, the final layer has a particular role. When dealing with labeled input, the output layer classifies each example, applying the most likely label. Each node on the output layer represents one label, and that node turns on or off according to the strength of the signal it receives from the previous layer’s input and parameters.

Each output node produces two possible outcomes, the binary output values 0 or 1, because an input variable either deserves a label or it does not. After all, there is no such thing as a little pregnant.

While neural networks working with labeled data produce binary output, the input they receive is often continuous. That is, the signals that the network receives as input will span a range of values and include any number of metrics, depending on the problem it seeks to solve.

For example, a recommendation engine has to make a binary decision about whether to serve an ad or not. But the input it bases its decision on could include how much a customer has spent on Amazon in the last week, or how often that customer visits the site.

So the output layer has to condense signals such as $67.59 spent on diapers, and 15 visits to a website, into a range between 0 and 1; i.e. a probability that a given input should be labeled or not.

The mechanism we use to convert continuous signals into binary output is called logistic regression. The name is unfortunate, since logistic regression is used for classification rather than regression in the linear sense that most people are familiar with. It calculates the probability that a set of inputs match the label.

Let’s examine this little formula.

For continuous inputs to be expressed as probabilities, they must output positive results, since there is no such thing as a negative probability. That’s why you see input as the exponent of e in the denominator – because exponents force our results to be greater than zero. Now consider the relationship of e’s exponent to the fraction 1/1. One, as we know, is the ceiling of a probability, beyond which our results can’t go without being absurd. (We’re 120% sure of that.)

As the input x that triggers a label grows, the expression e to the x shrinks toward zero, leaving us with the fraction 1/1, or 100%, which means we approach (without ever quite reaching) absolute certainty that the label applies. Input that correlates negatively with your output will have its value flipped by the negative sign on e’s exponent, and as that negative signal grows, the quantity e to the xbecomes larger, pushing the entire fraction ever closer to zero.

With this layer, we can set a threshold above which an example is labeled 1, and below which it is not. You can set different thresholds as you prefer – a low threshold will increase the number of false positives, and a high one will increase the number of false negatives – depending on which side you would like to err.

Neural Networks & Artificial Intelligence

In some circles, neural networks are thought of as “brute force” AI, because they start with a blank slate and hammer their way through to an accurate model. They are effective, but to some eyes inefficient in their approach to modeling, which can’t make assumptions about functional dependencies between output and input. That said, gradient descent is not recombining every weight with every other to find the best match – it’s method of pathfinding shrinks the relevant weight space therefore the number of updates by many orders of magnitude.

Introductory Resources

For people just getting started with deep learning, the following tutorials and videos provide an easy entrance to the fundamental ideas of feedforward networks:

Introduction to Deep Neural Networks的更多相关文章

  1. Training Deep Neural Networks

    http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html  //转载于 Training Deep Neural ...

  2. On Explainability of Deep Neural Networks

    On Explainability of Deep Neural Networks « Learning F# Functional Data Structures and Algorithms is ...

  3. Classifying plankton with deep neural networks

    Classifying plankton with deep neural networks The National Data Science Bowl, a data science compet ...

  4. Must Know Tips/Tricks in Deep Neural Networks

    Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)   Deep Neural Networks, especially C ...

  5. Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)

    http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html Deep Neural Networks, especially Conv ...

  6. Training (deep) Neural Networks Part: 1

    Training (deep) Neural Networks Part: 1 Nowadays training deep learning models have become extremely ...

  7. Paper Reading:Deep Neural Networks for YouTube Recommendations

    论文:Deep Neural Networks for YouTube Recommendations 发表时间:2016 发表作者:(Google)Paul Covington, Jay Adams ...

  8. 深度神经网络入门教程Deep Neural Networks: A Getting Started Tutorial

    Deep Neural Networks are the more computationally powerful cousins to regular neural networks. Learn ...

  9. (Deep) Neural Networks (Deep Learning) , NLP and Text Mining

    (Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者 普通的Neural Netw ...

随机推荐

  1. 读书笔记-常用设计模式之MVC

    1.MVC(Model-View-Controller,模型-视图-控制器)模式是相当古老的设计模式之一,它最早出现在SmallTalk语言中.MVC模式是一种复合设计模式,由“观察者”(Observ ...

  2. C语言知识总结(5)

    预处理指令 C语言提供的预处理指令主要有:宏定义.文件包含.条件编译 宏定义 不带参数的宏定义 1>一般形式 #define 宏名 字符串 比如#define A 10 2>作用 它的作用 ...

  3. 一个简单的获取参数的jqure

    今天做项目的时候需要用到上一页面传递过来的参数(只要一个参数),其解决办法就是下面: char latter=location.search.split('=')[1] 以上直接获取到第一个参数的值为 ...

  4. Dicom格式文件解析器

    转自:http://www.cnblogs.com/assassinx/archive/2013/01/09/dicomViewer.html Dicom全称是医学数字图像与通讯,这里讲的暂不涉及通讯 ...

  5. UI2_ScrollView&UIPageControl

    // // ViewController.h // UI2_ScrollView&UIPageControl // // Created by zhangxueming on 15/7/10. ...

  6. 8款给力HTML5/CSS3应用插件 可爱的HTML5笑脸

    1.HTML5/CSS3实现笑脸动画 非常可爱 今天我们要分享一款基于纯CSS3实现的笑脸动画,我们只要在面部滑动鼠标,即可让人物的眼睛嘴巴动起来,实现微笑的效果,还挺可爱的. 在线演示 源码下载 2 ...

  7. SpringMvc入门二----HelloWorld

    1. 导入需要的架包: 2. 配置web.xml,添加Servlet <servlet> <servlet-name>springmvc</servlet-name> ...

  8. Windows下OpenCV的环境配置

    首先去官网下载所需版本的OpenCV(我这里下载的是OpenCV2.4.9),然后安装(也就是解压缩)到某个地方(个人推荐解压到硬盘的根目录).解压完成后,可以得到如下的目录结构(版本不同,可能会有一 ...

  9. SNRO:Number Range

    业务对象是在一定的编号范围内分配编号的,编号既可以是内部分配也可以是外部分配.对于外部分配,用户输入编号,系统检查这个编号是否被占用.对于内部分配,系统会自动的把编号分配给业务对象.所以内部分配和外部 ...

  10. CentOS安全设置

    删除多余的用户和用户组,修改口令文件属性,禁止[Ctrl+Alt+Delete]重启命令,防止别人ping的方法.整理自互联网. 1.删除多余的用户和用户组 //删除多余用户 # vi /etc/pa ...