Brief History of Machine Learning

My subjective ML timeline

Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz ponder about a machine that is intellectually capable as much as humans. Famous writers like Jules

Pascal’s machine performing subtraction and summation – 1642

Machine Learning is one of the important lanes of AI which is very spicy hot subject in the research or industry. Companies, universities devote many resources to advance their knowledge. Recent advances in the field propel very solid results for different tasks, comparable to human performance (98.98% at Traffic Signs – higher than human-).

Here I would like to share a crude timeline of Machine Learning and sign some of the milestones by no means complete. In addition, you should add “up to my knowledge” to beginning of any argument in the text.

First step toward prevalent ML was proposed by Hebb , in 1949, based on a neuropsychological learning formulation. It is called Hebbian Learning theory. With a simple explanation, it pursues correlations between nodes of a Recurrent Neural Network (RNN). It memorizes any commonalities on the network and serves like a memory later. Formally, the argument states that;

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability.… When an  axon  of cell  A is near enough to excite a cell  B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that  A’s efficiency, as one of the cells firing  B, is increased.[1]

Arthur Samuel

In 1952 , Arthur Samuel at IBM, developed a program playing Checkers . The program was able to observe positions and learn a implicit model that gives better moves for the latter cases. Samuel played so many games with the program and observed that the program was able to play better in the course of time.

With that program Samuel confuted the general providence dictating machines cannot go beyond the written codes and learn patterns like human-beings. He coined “machine learning, ” which he defines as;

a field of study that gives computer the ability without being explicitly programmed.

F. Rosenblatt

In 1957 , Rosenblatt’s Perceptron was the second model proposed again with neuroscientific background and it is more similar to today’s ML models. It was a very exciting discovery at the time and it was practically more applicable than Hebbian’s idea. Rosenblatt introduced the Perceptron with the following lines;

The perceptron is designed to illustrate some of the fundamental properties of intelligent systems in general, without becoming too deeply enmeshed in the special, and frequently unknown, conditions which hold for particular biological organisms.[2]

After 3 years later, Widrow [4]  engraved Delta Learning rule that is then used as practical procedure for Perceptron training. It is also known as Least Square  problem. Combination of those two ideas creates a good linear classifier. However, Perceptron’s excitement was hinged by Minsky [3] in 1969 . He proposed the famousXOR problem and the inability of Perceptrons in such linearly inseparable data distributions. It was the Minsky’s tackle to NN community. Thereafter, NN researches would be dormant up until 1980s

XOR problem which is nor linearly seperable data orientation

There had been not to much effort until the intuition of Multi-Layer Perceptron (MLP)  was suggested byWerbos[6] in 1981 with NN specific Backpropagation(BP) algorithm, albeit BP idea had been proposed before by  Linnainmaa [5]  in 1970 in the name “reverse mode of automatic differentiation”. Still BP is the key ingredient of today’s NN architectures. With those new ideas, NN researches accelerated again. In 1985 – 1986 NN researchers successively presented the idea of MLP  with practical BP training (Rumelhart, Hinton, Williams [7] –  Hetch, Nielsen[8])

From Hetch and Nielsen [8]

At the another spectrum, a very-well known ML algorithm was proposed by J. R. Quinlan [9] in 1986 that we call Decision Trees , more specifically ID3 algorithm. This was the spark point of the another mainstream ML.  Moreover, ID3 was also released as a software able to find more real-life use case with its simplistic rules and its clear inference, contrary to still black-box NN models.

After ID3, many different alternatives or improvements have been explored by the community (e.g. ID4, Regression Trees, CART …) and still it is one of the active topic in ML.

From Quinlan [9]

One of the most important ML breakthrough was Support Vector Machines (Networks) (SVM), proposed by Vapnik and Cortes[10]  in 1995 with very strong theoretical standing and empirical results. That was the time separating the ML community into two crowds as NN or SVM advocates. However the competition between two community was not very easy for the NN side  after Kernelized version of SVM by near 2000s .(I was not able to find the first paper about the topic), SVM got the best of many tasks that were occupied by NN models before. In addition, SVM was able to exploit all the profound knowledge of convex optimization, generalization margin theory and kernels against NN models. Therefore, it could find large push from different disciplines causing very rapid theoretical and practical improvements.

From Vapnik and Cortes [10]

NN took another damage by the work of Hochreiter’s thesis [40] in 1991 and   Hochreiter et. al.[11] in 2001 , showing the gradient loss after the saturation of NN units as we apply BP learning. Simply means, it is redundant to train NN units after a certain number of epochs owing to saturated units hence NNs are very inclined to over-fit in a short number of epochs.

Little before, another solid ML model was proposed by Freund and Schapire  in 1997 prescribed with boosted ensemble of weak classifiers called  Adaboost. This work also gave the Godel Prize to the authors at the time. Adaboost trains weak set of classifiers that are easy to train, by giving more importance to hard instances. This model still the basis of many different tasks like face recognition and detection. It is also a realization of PAC  (Probably Approximately Correct) learning theory. In general, so called weak classifiers are chosen as simple decision stumps (single decision tree nodes). They introduced Adaboost as ;

The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting…[11]

Another ensemble model explored by Breiman [12] in 2001 that ensembles multiple decision trees where each of them is curated by a random subset of instances and each node is selected from a random subset of features. Owing to its nature,  it is called Random Forests(RF) . RF has also theoretical and empirical proofs of endurance against over-fitting. Even AdaBoost shows weakness to over-fitting and outlier instances in the data, RF   is more robust model against these caveats.(For more detail about RF, refer tomy old post.). RF shows its success in many different tasks like Kaggle competitions as well.

Random forests are a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large[12]

As we come closer today, a new era of NN called Deep Learning has been commerced. This phrase simply refers NN models with many wide successive layers. The 3rd rise of NN has begun roughly in   2005  with the conjunction of many different discoveries from past and present by  recent mavens Hinton, LeCun, Bengio, Andrew Ng and other valuable older researchers. I enlisted some of the important headings (I guess, I will dedicate complete post for Deep Learning specifically) ;

  • GPU programming
  • Convolutional NNs [18][20][40]
    • Deconvolutional Networks [21]
  • Optimization algorithms
    • Stochastic Gradient Descent [19][22]
    • BFGS and L-BFGS [23]
    • Conjugate Gradient Descent [24]
    • Backpropagation [40][19]
  • Rectifier Units
  • Sparsity [15][16]
  • Dropout Nets [26]
    • Maxout Nets  [25]
  • Unsupervised NN models [14]
    • Deep Belief Networks [13]
    • Stacked Auto-Encoders [16][39]
    • Denoising NN models [17]

With the combination of all those ideas and non-listed ones, NN models are able to beat off state of art at very different tasks such as Object Recognition, Speech Recognition, NLP etc. However, it should be noted that this absolutely does not mean, it is the end of other ML streams. Even Deep Learning success stories grow rapidly , there are many critics directed to training cost and tuning exogenous parameters of  these models. Moreover, still SVM is being used more commonly owing to its simplicity. (said but may cause a huge debate  )

Before finish, I need to touch on one another relatively young ML trend. After the growth of WWW and Social Media, a new term,  BigData  emerged and affected ML research wildly. Because of the large problems arising from BigData , many strong ML algorithms are useless for reasonable systems (not for giant Tech Companies of course). Hence, research people come up with a new set of simple models that are dubbed  Bandit Algorithms [27 – 38] (formally predicated with Online Learning )   that makes learning easier and adaptable for large scale problems.

I would like to conclude this infant sheet of ML history. If you found something wrong (you should  ), insufficient or non-referenced, please don’t hesitate to warn me in all manner.

References —-

[1] Hebb D. O., The organization of behaviour. New York: Wiley & Sons.

[2] Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.”  Psychological review 65.6 (1958): 386.

[3] Minsky, Marvin, and Papert Seymour. “Perceptrons.” (1969).

[4]Widrow, Hoff  “Adaptive switching circuits.” (1960): 96-104.

[5]S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor
expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.

[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th
IFIP Conference, 31.8 – 4.9, NYC, pages 762–770, 1981.

[7]  Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.  Learning internal representations by error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.

[8]  Hecht-Nielsen, Robert. “Theory of the backpropagation neural network.”  Neural Networks, 1989. IJCNN., International Joint Conference on. IEEE, 1989.

[9]  Quinlan, J. Ross. “Induction of decision trees.”  Machine learning 1.1 (1986): 81-106.

[10]  Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.”  Machine learning 20.3 (1995): 273-297.

[11]  Freund, Yoav, Robert Schapire, and N. Abe. “A short introduction to boosting.” Journal-Japanese Society For Artificial Intelligence 14.771-780 (1999): 1612.

[12]  Breiman, Leo. “Random forests.”  Machine learning 45.1 (2001): 5-32.

[13]  Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief nets.”  Neural computation 18.7 (2006): 1527-1554.

[14] Bengio, Lamblin, Popovici, Larochelle, “Greedy Layer-Wise
Training of Deep Networks”, NIPS’2006

[15] Ranzato, Poultney, Chopra, LeCun ” Efficient Learning of  Sparse Representations with an Energy-Based Model “, NIPS’2006

[16] Olshausen B a, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Res. 1997;37(23):3311–25. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9425546.

[17] Vincent, H. Larochelle Y. Bengio and P.A. Manzagol,  Extracting and Composing Robust Features with Denoising Autoencoders , Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096 – 1103, ACM, 2008.

[18]  Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.

[19]  LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

[20]  LeCun, Yann, and Yoshua Bengio. “Convolutional networks for images, speech, and time series.”  The handbook of brain theory and neural networks3361 (1995).

[21]  Zeiler, Matthew D., et al. “Deconvolutional networks.”  Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.

[22] S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Mur- phy. Accelerated training of conditional random fields with stochastic meta-descent. In International Conference on Ma- chine Learning (ICML ’06), 2006.

[23] Nocedal, J. (1980). ”Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35 (151): 773782. doi:10.1090/S0025-5718-1980-0572855-

[24] S. Yun and K.-C. Toh, “A coordinate gradient descent method for l1- regularized convex minimization,” Computational Optimizations and Applications, vol. 48, no. 2, pp. 273–307, 2011.

[25] Goodfellow I, Warde-Farley D. Maxout networks. arXiv Prepr arXiv …. 2013. Available at: http://arxiv.org/abs/1302.4389. Accessed March 20, 2014.

[26] Wan L, Zeiler M. Regularization of neural networks using dropconnect. Proc …. 2013;(1). Available at: http://machinelearning.wustl.edu/mlpapers/papers/icml2013_wan13. Accessed March 13, 2014.

[27]  Alekh Agarwal ,  Olivier Chapelle ,  Miroslav Dudik ,  John Langford ,  A Reliable Effective Terascale Linear Learning System , 2011

[28]  M. Hoffman ,  D. Blei ,  F. Bach ,  Online Learning for Latent Dirichlet Allocation , in Neural Information Processing Systems (NIPS) 2010.

[29]  Alina Beygelzimer ,  Daniel Hsu ,  John Langford , and  Tong Zhang   Agnostic Active Learning Without Constraints  NIPS 2010.

[30]  John Duchi ,  Elad Hazan , and  Yoram Singer ,  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , JMLR 2011 & COLT 2010.

[31]  H. Brendan McMahan ,  Matthew Streeter ,  Adaptive Bound Optimization for Online Convex Optimization , COLT 2010.

[32]  Nikos Karampatziakis  and  John Langford ,  Importance Weight Aware Gradient Updates  UAI 2010.

[33]  Kilian Weinberger ,  Anirban Dasgupta ,  John Langford ,  Alex Smola ,  Josh Attenberg ,  Feature Hashing for Large Scale Multitask Learning , ICML 2009.

[34]  Qinfeng Shi ,  James Petterson ,  Gideon Dror ,  John Langford ,  Alex Smola , and  SVN Vishwanathan , Hash Kernels for Structured Data , AISTAT 2009.

[35]  John Langford ,  Lihong Li , and  Tong Zhang ,  Sparse Online Learning via Truncated Gradient , NIPS 2008.

[36]  Leon Bottou ,  Stochastic Gradient Descent , 2007.

[37]  Avrim Blum ,  Adam Kalai , and  John Langford   Beating the Holdout: Bounds for KFold and Progressive Cross-Validation . COLT99 pages 203-208.

[38]  Nocedal, J.  (1980). “Updating Quasi-Newton Matrices with Limited Storage”. Mathematics of Computation 35: 773–782.

[39] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.

[40] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f ̈ur In-
formatik, Lehrstuhl Prof. Brauer, Technische Universit ̈at M ̈unchen, 1991. Advisor: J. Schmidhuber.

Brief History of Machine Learning的更多相关文章

  1. 机器学习简史brief history of machine learning

    BRIEF HISTORY OF MACHINE LEARNING My subjective ML timeline (click for larger) Since the initial sta ...

  2. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  3. 机器学习(Machine Learning)&深度学习(Deep Learning)资料

    <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...

  4. FAQ: Machine Learning: What and How

    What: 就是将统计学算法作为理论,计算机作为工具,解决问题.statistic Algorithm. How: 如何成为菜鸟一枚? http://www.quora.com/How-can-a-b ...

  5. 机器学习(Machine Learning)&深入学习(Deep Learning)资料

    <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost 到随机森林. ...

  6. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  7. [ML] I'm back for Machine Learning

    Hi, Long time no see. Briefly, I plan to step into this new area, data analysis. In the past few yea ...

  8. 机器学习(Machine Learning)&深度学习(Deep Learning)资料汇总 (上)

    转载:http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...

  9. 机器学习(Machine Learning)&amp;深度学习(Deep Learning)资料

    机器学习(Machine Learning)&深度学习(Deep Learning)资料 機器學習.深度學習方面不錯的資料,轉載. 原作:https://github.com/ty4z2008 ...

随机推荐

  1. Ing_制作在线QQ

    制作在线QQ的具体步骤 1.首先登录到http://is.qq.com/webpresence/code.shtml 网站2.选择风格3.填写相关数据4.生成网页代码5.复制代码到“写字板”,另存文件 ...

  2. 用Beyond Compare比较文本时,忽略不重要文本的方法

    Beyond Compare是一款好用的文本比较工具,可以比较纯文本文件.源代码和HTML,Word文档.Adobe和pdf文件.在使用Beyond Compare比较文本文件时,有些不重要的文本差异 ...

  3. Qt QpushButton 实现长按下功能

    做项目需要一个按钮具备长时间按下的功能,才发现Qt原始的按钮是没有这个功能,不过Qt的原生按钮是存在按下和释放信号的,有了这两个信号,再来实现按钮长时间被按下,这就简单了,看下动画演示. 录成GIF效 ...

  4. 【文章存档】如何通过 GitLab 进行持续部署

    好久没写博客了,这几天存档一下新文章. 链接 https://docs.azure.cn/zh-cn/articles/azure-operations-guide/app-service-web/a ...

  5. PHP学习 函数 function

    参数默认值function drink($kind ='tea'){echo 'would you please a cup'.$kind.'<br>';} drink();drink(' ...

  6. PHP学习 安装环境和语法学习

    要回归技术了,昨天下午专门去深圳大学城图书馆借书,甚是漂亮 禁不住搞了几张照片 在图书馆里面的书真多,图书馆环境真好,清华大学 北京大学研究生院的学生们有福了,最后一句深圳政府真尼玛有钱,下图是图书馆 ...

  7. 调研android开发环境的发展演变

    这是第一次接触android开发,特意上网搜索视频进行了自身知识补充,觉得说视频做得很不错,从android的发展历程以及一些基本常识都讲得很详细,也很有趣,也所以拿出来同大家一起分享学习,网址是:h ...

  8. acegi security实践教程—入门

    Acegi Security概念    Acegi Security是基于J2EE的企业软件应用提供全面的安全服务.通俗的说,就是封装的安全框架.提到安全,大家脑子中第一反应肯定是权限控制.的确如此, ...

  9. 链表的C/C++实现

    一个链表实现,函数声明放在 list.h 头文件汇总,函数定义放在list.cpp 中,main.cpp 用来测试各个函数. 1.文件list.h // list.h #ifndef __LIST_H ...

  10. idea 导入项目后不能执行main方法

    点击右键,出来不能run/debug 项目分为多个mouel模块,很多模块进来后在idea中丢失了(暂时不知道原因) 我们需要做的就是把丢失的模块加进来 ctrl+alt+shift+s 快捷键  或 ...