您好,欢迎光临本网站![请登录][注册会员]  
文件名称: Statistical Learning Theory: A Tutorial
  所属分类: 机器学习
  开发工具:
  文件大小: 402kb
  下载次数: 0
  上传时间: 2019-03-23
  提 供 者: yanghu******
 详细说明:In this article, we provide a tutorial overview of some aspects of statistical learning theory, which also goes by other names such as statistical pattern recognition, nonparametric classification and estimation, and supervised learning. We focus on the problem of two-class pattern classification fothere are d=3N2 features. For even a modest size image, the dimension of the feature space can be quite large In nost applicatiOns, the class to which an object belongs is not uniquely and definitively dctcrmincd by its fcaturc vector. Thcrc arc somc fundamental rcasons for this. First, although it would be nice if the measured features capture a ll the properties of the object important for classification, this usually is not the case. Second, depending on the application and the specific the pallern recognition problein is very uselu. For these reasons, a statistical fo measurements, the feature values may be noisy ormulation for We assume that there are prior probabilities P(0) and P(1)of observing an object from each of the two classes. The feature vector of the object is related to the class to which the object belongs. This is described through conditional probability densities p(a0) and p(a1) Equivalently, let y E 0, 1 be the label associated with (i.e, the actual class to which the object belongs). Then, we can think of the pair(I, y) as being drawn from a joint distribution P(, y) When we observe a fcaturc vcctor our problcm is to dccidc cithcr 0 or 1. So, a dccision rule(or classification rule) can be thought of as a mapping c: Ra-0,1, where c(a)indicates the decision when feature vector is observed Equivalently, we can think of a decision rule as a partition of R into two sets o and n21 that correspond to those feature vectors that get classified as 0 and 1, respectively. For technical/mathematical reasons we do not consider the set of all possible subsets of as potential decision rules, but rather restrict attention to decision rules that are measurable Out of the enormous collection of all possible decision rules, we would like to select one that performs well. We make a correct decision if our decision c(r) is equal to the label g. A natural success criterion is the probability that we make a correct decision. We would like a decision rule that maximizes this probability, or equivalently, that minimizes the probability of error For a decision rule c, the probability of error of c, denoted R(c), is given by R(c)=E(c()-y)=P(0)P(9210)+P(1)P(92011) In the ideal (and unusual) case, where the underlying probabilistic structure is known, the solution to the classification problem is well known and is a basic result from statistics. On the assumption that the underlying probabilities P(0), P(1) and p(0), p(a1) are all known, we can compute the so-called posterior probabilities P(o.)and P(1r)using Bayes theorem Specifically, the unconditional distribution P(r)is given by P()=P(0)P(0)+P(1)P(z1) and probabilities P(Or)and P(1a) are then given by (0x) P(0)P(x10) P(1) P(1)P(c1) Once we have the posterior probabilities, the best decision is simply to choose the class with larger conditional probability- namely, decide 0 if P(ox)>P(1 x)and decide 1 if P(l)> P(Oa). If P(0 )=P(1), then it doesn't matter what we decide. This classification rule is called bayes decision rule, or sometimes simply Bayes rule Bayes decision rule is optimal in the sense that no other decision rule has a smaller proba bility of error. The error rate of Bayes decision rule, denoted by R, is given by r'Y /R&min(P(01r), P( z)p(a)da Clearly, 0< R*< 1/2 but we can't say more without additional assumptions Finding bayes decision rule requires knowledge of the underlying distributions, but typically in applications these distributions are not known. In fact, in many applications, even the form of or approximations to the distributions are unknown. In this case, we try to overcome this lack of knowledge by resorting to labeled examples. That is, we assume that we have access to examples(31, 91),.,an, yr)that are independent and identically distributed according to the unknown distribution P(aT, 1). Using these examples, we want to come up with a decision rule to classify a new feature vector x The term"supervised"learning arises from the fact that we assume we have access to the training cxamples(21, 1),..., (n, n )that arc properly labeled by a supervisor"or "tcachcr This contrasts with"unsupervised learning"in which many examples of objects are available but the class to which each object belongs is unknown. Other formulations such as semi supervised learning, reinforcement learning, and related problems have also been widely studied but in this paper we focus exclusively on the case of supervised learning (and specifically the case of two-class pattern classification). In the following scctions, we describe a numbcr of approaches and results for this learning problem 3 Nearest Neighbor and Kernel rules Perhaps the simplest decision rule one might come up with is to find in the training data the feature vector i that is closest to a, and then decide that a belongs to the same class as given by the label yi. This decision rule is called the "nearest neighbor rule!"(or nn rule, or 1-NN rule Associated with each i is a region(called the Voronoi region) consisting of all points that are closer to i than to any other j. The nn rule simply classifies a feature vector a according 4 to the label associated to the Voronoi region to which a belongs. Figure 1 illustrates voronoi regions in two dimension. Of course, in general, the feature vectors 2 are in a high-dimensional space Figure 1: Voronoi Regions To measure the performance of the nn rule. we consider the expected performance with respect to the new instance to be classified as well as the training examples. Let Rn denote the expected error rate of the Nn rule after n training examples, and let Roo denote the limit of Rn as n tends to infini It can be shown that R≤B×≤2B(1-R). A simpler(but looser) bound is R*1 A very useful idea is to let the number of neighbors used grow with n(the amount of data re have). That is, we can let k be a function of n, so that we use a kin- nN rule. We need kin oo so that we use more and more neighbors as the amount of training data increases. But we should make sure that kn/n0, so that asymptotically the number of neighbors we use is a negligible fraction of the total amount of data. This will ensure that we use neighbors that get closer and closer to the observed feature vector x. For example, we might let kn -Vn to satisty both conditions It turns out that with any such kn(such that kn -oo and kn/n-0 are satisfied),we get R→R*asm→∞. That is, in the limit as the amount of training data OWS perfornance of the kim-NN rule approaches that of the optinal Bayes decision rule What is surprising about this result is that by observing data but without knowing anything bout the underlying distributions, asymptotically we can do as well as if we knew the under lying distributions completely. And, this works without assuming the underlying distributions take on any particular form or satisfy any stringent conditions. In this sense, the kim -NN rule is called universally consistent, and is an example of truly nonparametric learning in that the underlying distributions can be arbitrary and we need no knowledge of their form. It was not known until the early 1970s whether universally consistent rules existed, and it was quite sur prising when the hin-nN rule along with some others were shown to be universally consistent. A number of such decision rules are known today, as we will discuss further in subsequent sections However, universal consistency is not the end of the story. A critical issue is that of con vergence rates. Many results are known on the convergence rates of the nearest neighbor rule and other rules. a fairly generic problem is that, except under rather stringent conditions, the rate of convergence for most methods is very slow in high dimensional spaces. This is a facet of the so-called "curse of dimensionality. In many real applications the dimension can be ex- tremely large, which bodes ill for many methods. Furthermore, it can be shown that there are no"universal"rates of convergence. That is, for any method, one can always find distributions for which the convergence ratc is arbitrarily slow. This is rclated to so-called"No Frcc Lunch Theorems" that formalize the statement that there is no one method that can universally beat out all other methods. These results make the field continue to be exciting, and makes the design of good learning algorithms and the understanding of their performance an important science and art. In the following sections, we will discuss some other methods useful in practice 4 Kernel rules Rather than fixing the number of neighbors, we might consider fixing a distance h and taking a majority vote of the labels yi corresponding to all examples from among rn that fall within a distance h of x. If none of the i fall within distance h or if there is a tie in the majority vote, we need some way to decide in these cases. One way to do this is simply to break ties randomly As with nearest neighbor rules, this rule classifies a feature vector t rd according to a majority vote among the labels of the training points i in the vicinity of However, while the nearest neighbor rule classifies a based on a specified number kn of training examples that are closest to x. this rule considers all a 's that are within a fixed distance h of The rule we have been discussing is the simplest example of a very general class of rules call kernel rules do One way to precisely describe this rule is using a"kernel(or"window function")defined K() 1if‖cl‖≤1 0 otherwise Figurc 2: Basic Window Kernel Define vote counts vm(c)and vn(a)as G)->1=0k( a h aI F The decision rule is then to decide class O if um()>1l(a), decide class 1 if n n(a)>10(r),and break ties randomly Writing the rule in this way naturally suggests that we might pick a different kernel function K(. For cxamplc, it makes scnsc that training cxamples vcry close to a should have morc influence(or a larger weight) in determining the classification of x than those which are farther away. Such a function is usually nonnegative and is often monotonically decreasing along rays starting from the origin. In addition to the window kernel, some other popular choices for kernel functions include langular kerne K( )=(1-|)u≤ Gaussian kernel k(z)=c-|2 Cauchy kernel: K(a)=1/(11cd+l Epanechnikov kernel: K()=(1-c)I1lzllk1y Figure 3: Triangular Kernel Figure 4: Gaussian Kerne The positive number h is called the smoothing factor, or bandwidth, and is an important parameter of a kernel rule. If h is small, the rule gives large relative weight to points near T, and the decision is very "local, while for a large h many more points are considered with fairly large weight, but these points can be farther from a. Hence, h determines the amount of Smoothing. In choosing a value for h, one confronts a similar kind of tradeoff as in choosing the number of neighbors using a nearest neighbor rule As we might expect, to get universal consistency, we need to let the smoothing factor depend on the amount of data, so we let h=hn. To make sure we get"locality"(i. e, so that the training examples used get closer to T), we need to have limmn-oo hm =0. To make sure that the number of training examples used grows, we need to have limn-oonha= oo These two conditions(hn ->0 and nha oo) are analogous to the conditions imposed on kin to gct universal consistency. In addition to those two conditions, to show universal consistency we need certain fairly mild regularity conditions on the kernel function K(). In particular, we need K( to be non-negative, and over a small neighborhood of the origin K( )is required to be larger than some fixed positive number 6>0. The last requirement is more technical but as a special case, it's enough if we require that K( be non-increasing with distance from the origin and have finitc volumc undcr thc kcrncl function F K igurc 5: Cauchy Kcrn Figure 6: Epanechnikov Kernel 5 Multilayer Perceptrons Neural networks(or, neural nets, artificial neural networks) are collections of(usually simple) processing units each of which is connected to many other units. With lens or hundreds of unils and several times that many connections, a network rapidly can get complicated. Understanding the behavior of even small networks can be difficult, especially if there can be"feedback?"loops in the connectivity structure of the network (i.e, the output of a neuron can serve as inputs to others which via still others come back and serve as inputs to the original neuron We'll focus on a special class of networks called multilayer perceptrons. The units are organized in a multilayer feedforward network(Figure 7) 3x Figure 7: Feed Forward Network That is, the units are organized in layers, with the output of neurons in one layer serving as the inputs to the neurons in the next layer. Because there are no feedback loops, the behavior of the network is simple to compute The first layer receives external inputs, which will be features of the objects we wish to classify. The output of each unit in one layer is passed as an input to the units in the next layer. The outputs of the last layer (called the output layer), which gives the final result computed by the network for the given input The particular classification rule implemented by a network is determined by the specifics of the network architecture and the computation done by each neuron. A crucial part of the network computation is a set of parameters called weights. There is usually a real-valued weight associated with each connection between units. The weights are generally considered to be adjustable, while the rest of the network is usually thought of as fixed. Thus, we think of the classification rule implemented by the network as being determined by the weights, and this classification rule can be altered if we change the weights To solve a given pattern recognition problem with a neural net, we need a set of weights that results in a good classification rule. As we discussed previously, it is difficult to directly specify good decision rules in most practical situations, and this is made even more difficult by the conplexity of the computation perforned by the neural net. So, how should we go about selecting the weights of thc nctwork? This is where learning comes in. Suppose we start with some initial set of weights. It is unlikely that the weights we start with will result in a very good classification rule. But, as before, we will assume we have a set of labeled examples(training data ) If we use this data to train"the network to perform well on this data, then perhaps the network will also "generalize and provide a decision rule that works well on new data. The success of neural networks in many practical problems is due to an efficient way to train a given network to perform well on a set of training examples, together with the fact that in many cases the resulting decision rule does in fact generalize well Figure 8: Perceptron In a multilayer perceptron, each single unit is called a perceptron(depicted in Figure 8)that operates as follows. The feature vector 1, 32, .. d is presented as the input so that input i is connected to the unit with an associated weight wi The output a of the perceptron is given by a=sign(x11+…+xad) where sign(u) 1f<0 1 otherwise In general, a perceptron can have a non-zero threshold but often for convenience a threshold 10
(系统自动生成,下载前可以参看下载内容)

下载文件列表

相关说明

  • 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
  • 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度
  • 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
  • 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
  • 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
  • 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.
 输入关键字,在本站1000多万海量源码库中尽情搜索: