您好,欢迎光临本网站![请登录][注册会员]  
文件名称: Algorithms for hyper-parameter optimization
  所属分类: 其它
  开发工具:
  文件大小: 268kb
  下载次数: 0
  上传时间: 2019-09-03
  提 供 者: yangta******
 详细说明:Algorithms for hyper-parameter optimization.pdf,讲述贝叶斯算法的TPE过程的专业论文The contribution of this work is two novel strategies for approximating f by modeling H: a hier archical Gaussian Process and a tree-structured parzen estimator. These are described in section 3 and Section 4 respectively 3 The Gaussian Process Approach(GP) Gaussian Processes have long been recognized as a good method for modeling loss functions in model-based optimization literature [13]. Gaussian Processes(GPs, [14] are priors over functions that are closed under sampling, which means that if the prior distribution of f is believed to be a gp with mean 0 and kernel k, the conditional distribution of f knowing a sample H=(ci, f(ai))isl of its values is also a GP, whose mean and covariance function are analytically derivable. GPs with generic mean functions can in principle be used, but it is simpler and sufficient for our purposes to only consider zero mean processes. We do this by centering the function values in the consid ered data sets. Modelling e.g. linear trends in the gP mean leads to undesirable extrapolation in unexplored regions during SMBO [15] The above mentioned closedness property, along with the fact that GPs provide an assessment of prediction uncertainty incorporating the effect of data scarcity, make the GP an elegant candidate for both finding candidate a'(Figure l, step 3 )and fitting a model Mt(Figure 1, step 6). The runtime of each iteration of the GP approach scales cubically in H and linearly in the number of variables being optimized, however the expense of the function evaluations f (a*)typically dominate even chis cubic cost 3.1 Optimizing EI in the GP We model f with a GP and set y to the best value found after observing H: y*=minf(i,I< i< n. The model pm in(1)is then the posterior GP knowing 7. The EI function in(1 )encap- sulates a compromise between regions where the mean function is close to or better than y* and under-explored regions where the uncertainty is high EI functions are usually optimized with an exhaustive grid search over the input space, or a latin Hypercube search in higher dimensions. However, some information on the landscape of the El cri terion can be derived from simple computations [16]: 1)it is always non-negative and zero at training points from D, 2)it inherits the smoothness of the kernel k, which is in practice often at least once differentiable, and noticeably, 3) the ei criterion is likely to be highly multi-modal, especially as the number of training points increases. The authors of [16] used the preceding remarks on the landscape of el to design an evolutionary algorithm with mixture search, specifically aimed at opti mizing El, that is shown to outperform exhaustive search for a given budget in El evaluations. We borrow here their approach and go one step further. We keep the estimation of Distribution(EDa [17]) approach on the discrete part of our input space(categorical and discrete hyper-parameters) where we sample candidate points according to binomial distributions, while we use the Covariance Matrix Adaptation-Evolution Strategy(CMA-ES, [18]) for the remaining part of our input space (continuous hyper-parameters). CMA-ES is a state-of-the-art gradient-free evolutionary algorithm for optimization on continuous domains. which has been shown to outperform the gaussian search EDA. Notice that such a gradient-free approach allows non-differentiable kernels for the GP regres sion We do not take on the use of mixtures in [16], but rather restart the local searches several times starting from promising places. The use of tesselations suggested by [16] is prohibitive here, as our task often means working in more than 10 dimensions thus we start each local search at the center of mass of a simplex with vertices randomly picked among the training points Finally, we remark that all hyper-parameters are not relevant for each point. For example, a DBN with only one hidden layer does not have parameters associated to a second or third layer. Thus it is not enough to place one GP over the entire space of hyper-parameters. We chose to group the hy per-parameters by common use in a tree- like fashion and place different independent GPs over each group. As an example, for DBNs, this means placing one GP over common hyper-parameters including categorical parameters that indicate what are the conditional groups to consider, three GPs on the parameters corresponding to each of the three layers, and a few 1-dimensional GPs over individual conditional hyper-parameters, like ZCa energy(see Table I for dbn parameters) 4 Tree-structured Parzen Estimator Approach(TPe) Anticipating that our hyper-parameter optimization tasks will mean high dimensions and small fit ness evaluation budgets, we now turn to another modeling strategy and el optimization scheme for the SMBO algorithm. Whereas the Gaussian-process based approach modeled p(y a) directly, this strategy models p(aly)and p(y) Recall from the introduction that the configuration space a is described by a graph-structured gen erative process(e. g. first choose a number of dbn layers, then choose the parameters for each) The tree-structured Parzen estimator (TPE)models p(aly) by transforming that generative process replacing the distributions of the configuration prior with non-parametric densities. In the exper- imental section, we will see that the configuation space is described using uniform, log-uniform quantized log-uniform, and categorical variables. In these cases, the TPE algorithm makes the following replacements: uniform truncated Gaussian mixture, log-uniform exponentiated truncated Gaussian mixture, categorical -re-weighted categorical. Using different observations (k) in the non-parametric densities, these substitutions represent a learning algorithm that can produce a variety of densities over the configuration space &. The tpe defines p(cly) using two such densities p(aly e(: if y < y g(x)ify≥ where e(c)is the density formed by using the observations f.( such that corresponding loss f(al)was less than y* and g()is the density formed by using the remaining observations Whereas the GP-based approach favoured quite an aggressive y*(typically less than the best ob served loss), the TPE algorithm depends on a y that is larger than the best observed f() so that some points can be used to form e(e). The TPE algorithm chooses y* to be some quantile y of the observed y values, so that p(y
(系统自动生成,下载前可以参看下载内容)

下载文件列表

相关说明

  • 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
  • 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度
  • 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
  • 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
  • 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
  • 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.
 输入关键字,在本站1000多万海量源码库中尽情搜索: