深度学习的起源.pdfON THE ORIGIN OF DEEP LEARNING Table 1:

文件名称: 深度学习的起源.pdf

所属分类: 其它

开发工具:

文件大小: 5mb

下载次数: 0

上传时间: 2019-07-03

提供者: python*******

下载 (5mb)

不能下载？报告错误

详细说明：深度学习的起源.pdfON THE ORIGIN OF DEEP LEARNING Table 1: Major milestones that will be covered in this paper Year Contributer Contribution 300BC Aristotle introduced Associationism, started the history of human's attempt to understand brain 1873 Alexander bain introduced Neural groupings as the earliest models of neural network, inspired Hebbian Learning Rule introduced mcp model. which is considered as the 1943 Mcculloch pitts ancestor of artificial Neural model considered as the father of neural networks introduced 1949 Donald hebb Hebbian Learning rule, which lays the foundation of modern neural network 1958 Frank rosenblatt introduced the first perceptron, which highly resembles modern perceptron. 1974 Paul werbos introduced Backpropagation Teuvo Kohonen 1980 introduced Self Organizing Map introduced Neocogitron, which inspired convolutional Kunihiko Fukushima Neural Network 1982 ohn llop introduced Ilopfield Network 1985 Hilton sejnowski introduced Boltzmann Machine introduced harmonium. which is later known as restricted 1986 Paul Smolensky Boltzmann machine Michael i. ordan defined and introduced recurrent neural network 1990 Yann Le cun introduced LeNet, showed the possibility of deep neural nctworks in practicc Schuster paliwal introduced Bidirectional Recurrent Neural Network 1997 Hochreiter introduced LSTM, solved the problem of vanishing Schmidhuber gradient in recurrent neural networks introduced Deep belief Networks, also introduced 2006 Geoffrey Hinton layer-wise pretraining technique, opened current deep earning era Salakhutdinov 2009 nton introduced Deep boltzmann Machines 2012 Geoffrey Hinton introduced Dropout, an efficient way of training neural networks WANG. RAJ AND XING orate well enough on cach of them. On the othcr hand, our papor is aimed at providing the background for readers to understand how these models are developed. Therefore, we em phasize on the milestones and elaborate those ideas to help build associations between these deas. In addition to the paths of classical deep learning models in(Schmidhuber. 2015) we also discuss those recent deep learning work that builds from classical linear models Another article that rcadcrs could rcad as a complementary is(Andcrson and Rosenfeld 2000) where the authors conducted extensive interviews with well-known scientific leaders in the 90s on the topic of the neural networks, history. ON THE ORIGIN OF DEEP LEARNING 2. From aristotle to modern artificial Neural networks The study of dccp learning and artificial neural nctworks originates from our ambition to build a computer system simulating the human brain. To build such a system requires understandings of the functionality of our cognitive system. Therefore, this paper traces all the way back to the origins of attempts to understand the brain and starts the discussion of Aristotle,s Associationism around 300 B c 2.1 Associationism When, therefore. we accomplish an act of reminiscence, we pass through a certain series of precursive movements, until we arrive at a movement on which the one we are in quest of is habitually consequent. Hence, too, it is that we hunt through the mental train, excogitating from the present or some other and from similar or contrary or coadjacent. Through this process reminiscence takes place. For the movements are, in these cases, sometimes at the same time, sometimes parts of the same whole, so that the subsequent movement is already more than half accomplished This remarkable paragraph of Aristotle is seen as the starting point of Association ism(Burnham, 1888). Associationism is a theory states that mind is a set of conceptual elements that are organized as associations between these elements. Inspired by plato Aristotle examined the processes of remembrance and recall and brought up with four laws of association(Boeree, 2000) Contiguity: Things or events with spatial or temporal proximity tend to be associated in thc mind Frequency: The number of occurrences of two events is proporLional to the strength of association bctwccn thesc two events Similarily: Thought of one event lends to trigger the thought of a siinilar event Contrast: Thought of one event tends to trigger the thought of an opposite event Back then, Aristotle described the implementation of these laws in our mind as common sense. For example, the feel, the smell, or the taste of an apple should naturally lead to the concept of an apple, as common sense. Nowadays, it is surprising to see that these laws proposed more than 2000 years ago still serve as the fundamental assumptions of machine learning methods. For example, samples that are near each other(under a defined distance) are clustered into one group; explanatory variables that frequently occur with response variables draw more attention from the model; similar/dissimilar data are usually represented with more similar/dissimilar embeddings in latent space Contemporaneously, similar laws were also proposed by Zeno of Citium, Epicurus and St Augustine of Hippo. The theory of associationism was later strengthened with a variety of philosophers or psychologists. Thomas Hobbes(1588-1679) stated that the complex experiences were the association of simple experiences, which were associations of sensations He also believed Chal association exists by means of coherence anld Irequency as its strength WANG. RAJ AND XING Z Figure 1: Illustration of neural groupings in(Bain, 1873) factor. Meanwhile, John Locke(1632-1701)introduced the concept of"association of ideas He still separated the concept of ideas of sensation and ideas of reflection and he stated that complex ideas could be derived from a combination of these two simple ideas. david Hume(1711-1776) later reduced Aristotle's four laws into three: resemblance(similarity) contiguity, and cause and effect. He believed that whatever coherence the world seemed to have was a matter of these three laws. Dugald Stewart(1753-1828)extended these three laws with several other principles, among an obvious one: accidental coincidence in the sounds of words. Thomas Reid(1710-1796)believed that no original quality of mind was required to explain the spontaneous recurrence of thinking, rather than habits. James Mill (1773-1836)emphasized on the law of frequency as the key to learning, which is very similar to later stages of research David Hartley(1705-1757, as a physician, was remarkably regarded as the one that made associationism popular(Hartley, 2013). In addition to existing laws, he proposed his argument that memory could be conceived as smaller scale vibrations in the same regions of the brain as the original sensory experience. These vibrations can link up to represent complex idcas and thcrcforc act as a matcrial basis for the strcam of consciousness. Th dea potentially inspired Hebbian learning rule, which will be discussed later in this paper to lay the foundation of neural networks 2.2 Bain and Neural groupings Besides David Hartley, Alexander Bain(1818-1903) also contributed to the fundamental ideas of Hebbian Learning Rule( Wilkes and Wade, 1997). In this book, Bain(1873)related the processes of associative memory to the distribution of activity of neural groupings(a term that he used to denote neural networks back then). Ile proposed a constructive mode of storage capable of assembling what was required, in contrast to alternative traditional mode of storage with prestored memories To further illustrate his ideas, Bain first described the computational flexibility that allows a neural grouping to function when multiple associations are to be stored. With a few hy pothesis, Bain managed to describe a structure that highly resembled the neural ON THE ORIGIN OF DEEP LEARNING nctworks of today: an individual ccll is summarizing the stimulation from othcr sclcctcd linked cells within a grouping as showed in Figure 1. The joint stimulation from a and c triggers X, stimulation from b and c triggers y and stimulation from a and c triggers Z. In his original illustration, a, b, c stand for simulations, X and y are outcomes of cells With the establishment of how this associative structure of neural grouping can function as memory, Bain proceeded to describe the construction of these structures. Hle followed the directions of associationism and stated that relevant impressions of neural groupings must be made in temporal contiguity for a period, either on one occasion or repeated occasions Further, Bain described the computational properties of neural grouping: connections are strengthened or weakened through experience via changes of intervening cell-substance Therefore, the induction of these circuits would be selected comparatively strong or weak As we will see in the following section, Hebb's postulate highly resembles Bains de- scription, although nowadays we usually label this postulate as Hebb's, rather than Bains according to(Wilkes and Wade. 1997). This omission of Bain's contribution may also be due to bains lack of confidence in his own theory: Eventually, Bain was not convinced by himself and doubted about the practical values of neural groupings 2.3 Hebbian Learning Rule Ilebbian Learning Rule is named after Donald O. Ilebb(1904-1985) since it was introduced in his work The Organization of Behavior(Hebb, 1949 ). Hebb is also seen as the father of Neural Networks because of this work(Didier and Bigand, 2011) In 1949, Hebb stated the famous rule: "Cells that fire together, wire together, which emphasized on the activation behavior of co-fired cells. More specifically, in his book, he stated that When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it some growth process or metabolic change takes place in one or both cells such that As efficiency, as one of the cells firing B, is increased.” This archaic paragraph can be re-written into modern machine learning languages as the following △=m:y where Aw; stands for the change of synaptic weights(wi) of Neuron i, of which the input signal is xi. y denotes the postsynaptic response and n denotes learning rate. In other words, Hebbian Learning Rule states that. the connection between two units should be strengthened as the frequency of co-occurrences of these two units increase Although Hebbian Learning Rule is seen as laying the foundation of neural networks, seen today, its drawbacks are obvious: as co-occurrences appear more, the weights of connec tions keep increasing and the weights of a dominant signal will increase exponentially. This is known as the unstableness of Hebbian Learning Rule(Principe et al., 1999). Fortunately these problens did not inlluence Hebb's identity as the lather of neural networks WANG. RA AND XING 2.4 Ojas Rule and Principal Component analyzer Erkki Oja extended Hebbian Learning rule to avoid the unstableness property and he also showed Chat a neuron, following Chis updating rule, is approximating the behavior of a Principal Component Analyzer(PCA)(Oja 1982) Long story short, Oja. introduced a. normalization term to rescue Hebbian Learning rule, and further he showed that his learning rule is simply an online update of Principal Component Analyzer. We present the details of this argument in the following paragraphs Starting [roin equalion 1 and Tollowing the saine notation, Oja showed t+1 t]iy where t denotes the iteration. A straightforward way to avoid the exploding of weights is to apply normalization at the end of each iteration, yielding t+1 wi +nriy 2=1(2+m;y92) where n denotes the nunber of neurons. The above equalion can be further expanded inlo the following form t+1 In(zl Z3 )+O(n2) where Z=O2 w22. Further, two more assumptions are introduced: 1)n is small Therefore O(m )is approximately 0. 2)Weights are normalized, therefore Z=2i wz) When these two assumptions were introduced back to the previous equation, Ojas rule was proposed as following I=wi+y( -yui 2) Oja took a step further to show that a neuron that was updated with this rulc was effectively performing Principal Component Analysis on the data. To show this, Oja first re-wrote Equation 2 as the following forms with two additional assumptions(oja, 1982) d(t)2 (12)2C2)a where c is the covariance matrix of input X. Then he proceeded to show this property with many conclusions from his another work(Oja and Karhunen, 1985) and linked back to PCa with the fact that components from PCA are eigenvectors and the first component is the eigenvector corres ponding to largest eigenvalues of the covariance natrix. Intuitively we could interpret this property with a simpler explanation: the eigenvectors of C are the solution when we maximize the rule updating function. Since ui are the eigenvectors of the covariance matrix of X, we can get that w; are the PCa Ojas learning rule concludes our story of learning rules of the early-stage neural network Now we proceed to visit the ideas on neural nodels ON THE ORIGIN OF DEEP LEARNING 2.5 MCP Neural Model While Donald Hebb is seen as the father of neural networks, the first model of neuron could trace back lo six years ahead of the publication of Hebbian Learning Rule, when a neurophysiologist Warren McCulloch and a mathematician Walter Pitts speculated the inner workings of neurons and modeled a primitive neural network by electrical circuits their findings(McCulloch and Pitts, 1943 ). Their model, known as Mcp neural model, was a linear step function upon weighted linearly interpolated data that could be described as y=2:2 AND 0,V 0. otherwise whore y stands for output, i stands for input of signals, wi stands for the corrcsponding weights and z, stands for the inhibitory input. 6 stands for the threshold. The function is designed in a way that the activity of any inhibitory input completely prevents excitation of the neuron at any time Despite the resemblance between MCP Neural Model and modern perceptron, they are still different distinctly in many different aspects MCP Neural model is initially built as electrical circuits. Later we will see that the study of neural networks has borrowed many ideas from the field of electrical circuits The weights of MCP Neural Model wi are fixed, in contrast to the adjustable weights in Modern perceptron. All the weights Inust be assigned with manual calculalion The idea of inhibitory input is quite unconventional even seen today. It might be an idea. worth furt, her study in modern deep learning research 2.6 Perceptron With the success of mcp neural model. frank rosenblatt further substantialized hebbian Learning Rule with the introduction of perceptrons(Rosenblatt, 1958). While theorists like Hebb were focusing on the biological system in the natural environment, Rosenblatt constructed the electronic device named Perceptron that was showed with the ability to learn in accordance with associationism Rosenblatt(1958)introduced the perceptron with the context of the vision system, as showed in Figure 2a. He introduced the rules of the organization of a perceptron as following Stimuli impact on a retina of the sensory units, which respond in a. manner that the pulse amplitude or frequency is proportional to the stimulus intensity. Impulses are transmitted to Projection Area(Al). This projection area is optiona Impulses are then transmitted to Association Area through random connections. If the sum of impulse intensities is equal to or greater than the threshold(e)of this unit Chen this unit fires WANG. RAJ AND XING R幽na AEson&& a AreA 已 (a Illustration of organization of a perceptron in (b)A typical perceptron in modern machine learn- ( Rosenblatt 1958) ing literature Figurc 2: Pcrccptrons:(a)A now figurc of thc illustration of organization of pcrceptron as in(Rosenblatt, 1958).(b )A typical perceptron nowadays, when Ar(Projection Area) is omitted Response units work in the same fashion as those intermediate units Figure 2(a)illustrates his explanation of perceptron. From left to right. the four units are sensory unit, projection unit, association unit and response unit respectively. Projection unit receives the information from sensory unit and passes onto association unit. This unit is often omitted in other description of similar models. With the omission of projection unit, the structure resembles the structure of nowadays perceptron in a neural network(as showed in Figure 2(b): sensory units collect data, association units linearly adds these data with different weights and apply non-linear transform onto the thresholded sum, then pass the results to response units One distinction between the early stage neuron models and modern perceptrons is the introduction of non-linear activation functions (we use sigmoid function as an example in Figure 2(b)). This originates from the argument that linear threshold function should be softened to simulate biological neural networks(Bose et al. 1996)as well as from consideration of the feasibility of computation to replace step function with a continuous one(Mitchell et al., 1997) After Rosenblatt's introduction of Perceptron, Widrow et a.L.(1960)introduced a follow- up model called ADALINE. However, the difference between Rosenblatt's Perceptron and ADALINE is mainly on the algorithm aspect. As the primary focus of this paper is neural network models, we skip the discussion of ADALINE 2.7 Perceptron's Linear Representation Power a perceptron is fundamentally a linear function of input signals; therefore il is limited lo represent linear decision boundaries like the logical operations like NOT, AND or OR, but not XOR when a. more sophisticated decision boundary is required. This limitation was highlighted by Minski and Papert (1969), when they attacked the limitations of perceptions by emphasizing that perceptrons cannot solve functions like XOR or NXOR. As a result very litlle research was done in Chis area until aboul the 1980s 10

(系统自动生成,下载前可以参看下载内容)