### Workshop — Friday, June 24, 2016

#### Optimization Methods for the Next Generation of Machine Learning

**Abstract:** The future of optimization for machine learning, lies in the design of methods for nonconvex optimization problems, such as those arising through the use of deep neural networks. Nonconvex formulations lead to more powerful predictive models, but are much more complex in the sense that they result in much more challenging optimization problems. This workshop will bring together experts from the machine learning and optimization communities whose research focuses on the design of optimization methodologies that combine recent trends of optimization in machine learning—stochasticity, parallel and distributed computing, and second order information—but do so in nonconvex settings.

#### Organizers

Katya Scheinberg | Lehigh University |

Yoshua Bengio | University of Montreal |

Frank E. Curtis | Lehigh University |

Jorge Nocedal | Northwestern University |

#### Schedule

Below is an outline of the schedule. Please check back later for further details, which will be posted as they are developed. The poster spotlights will involve short presentations by a few selected poster contributors, which are to be determined.

**Slides are available! Click invited speaker’s name for link to slides.**

08:30–09:00 | Yoshua Bengio |

09:00–09:30 | Coralia Cartis |

09:30–10:00 | Poster Spotlights |

10:00–10:30 | Coffee Break |

10:30–11:00 | Elad Hazan |

11:00–11:30 | Josh Griffin |

11:30–02:00 | Lunch Break and Poster Session 1 |

02:00–02:30 | Leon Bottou |

02:30–03:00 | Raghu Pasupathy |

03:00–03:30 | Coffee Break |

03:30–04:00 | Ben Recht |

04:00–04:30 | Mark Schmidt |

04:30–06:00 | Poster Session 2 |

#### Speakers: Titles and Abstracts

Yoshua Bengio | University of Montreal |

Title: | From Curriculum Learning to Mollifying Networks |

Abstract: | Curriculum learning was originally motivated by the presumed difficulty due to local minima when training deep neural networks. More recently, several theoretical and empirical results seriously questioned the belief that local minima are actually the culprit. However, as deep learning research moved towards difficult tasks such as those involving some form of reasoning, it has become more and more clear that curriculum learning techniques could make a huge difference in our ability to train some of these networks. This suggests that there are some features of the optimization landscape that can make gradient-based optimization getting stuck around very poor solutions. This also motivated research in alternative ways of simplifying the optimization landscape in order to construct a continuation or annealing schedule gradually going from an easy to optimize objective to the target objective. We recently achieved impressive improvements on several deep architectures and tasks using different forms of noise injection. We call our latest such framework “mollifying networks”. As the amount of noise is decreased, the set of learnable functions goes from the identity to linear functions to gradually effectively deeper and more non-linear networks. Results on such noise-injected networks are presented on a range of tasks, starting from artificial ones like parity and Pentomino to language modelling, predicting the output of a short computer program, neural machine translation and caption generation. |

Leon Bottou | Facebook AI Research |

Title: | Three observations on diagonal second order methods applied to deep networks |

Abstract: | Using curvature information to speedup stochastic gradient algorithms often boils down to a contest between two constant factors: the computational overhead of the second order method versus the gain in the required number of iterations. Diagonal and quasi-diagonal approximations are attractive because one can hope to achieve a big part of the possible gains with very low computational overhead. We present a scheme where the computational overhead scales like the number of neurons rather than the number of weights. We then show that assuming that the curvature is diagonal implies that the function has a form that is not well suited for ordinary quasi-Newton method but is amenable to a Natural Gradient treatment. Empirical results confirm the value of this observation but also reveals new insight on batch normalization. This is joint work with Jean Lafond and Nicolas Vasilache. |

Coralia Cartis | University of Oxford |

Title: | Global rates of convergence of algorithms for nonconvex smooth optimization |

Abstract: | This talk addresses global rates of convergence and the worst-case evaluation complexity of methods for nonconvex optimization problems. We show that the classical steepest-descent and Newton’s methods for unconstrained nonconvex optimization under standard assumptions may both require a number of iterations and function evaluations arbitrarily close to the steepest-descent’s global worst-case complexity bound. This implies that the latter upper bound is essentially tight for steepest descent and that Newton’s method, as well as most methods we know and like such as line search and trust region, may be as slow as the steepest-descent method in the worst case. Then the cubic regularisation of Newton’s method (Griewank (1981); Nesterov & Polyak (2006)) is considered and shown to have improved – in fact, optimal – worst-case complexity amongst a wide class of second-order methods for nonconvex optimization. We end by discussing derivative-free, probabilistic variants of known methods, including of cubic regularisation, and show that the complexity bounds are preserved even if the local random models of the function are only occasionally accurate. This work is joint with Nick Gould (RAL, UK), Philippe Toint (Namur, Belgium) and Katya Scheinberg (Lehigh, USA). |

Joshua Griffin | SAS, Advanced Analytics R&D |

Title: | Unconventional iterative methods for nonconvex optimization in a matrix-free environment |

Abstract: | Recent work on Hessian-free methods for deep learning has sparked interest in iterative methods capable of handling the unique requirements of deep learning problems. It has been argued that classical iterative approaches are not viable for a number of compelling reasons to be high-lighted in the presentation along with additional comments based on hand on experience. This presentation will focus on lesser known iterative approaches that have similarly strong convergence properties as their classic counterparts, but in many respects are tailor made for the deep learning context. Further, we will highlight the fundamental (as well as lesser known) underlying properties of why iterative methods in general work, comparing and contrasting benefits of line-search versus trust-region approaches and how they are in fact, very similar in structure. Besides providing recommended approaches for both schools of thought, the information presented should be sufficient to readily build customized iterative approaches in house for specialized machine learning applications. |

Elad Hazan | Princeton University |

Title: | Second-order optimization for machine learning in linear time |

Abstract: | First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization due to their extremely efficient per-iteration computational cost. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. We will present a second-order stochastic method for optimization problems arising in machine learning that match the per-iteration cost of gradient descent, yet enjoy convergence properties of second-order optimization. This is joint work with Naman Agarwal and Brian Bullins |

Raghu Pasupathy | Purdue University |

Title: | Adaptive Sampling Recursions for Stochastic Optimization |

Abstract: | We consider the problem of continuous unconstrained optimization when unbiased estimates of the objective function, and possibly its derivative, are observable through a Monte Carlo simulation. The term simulation is broadly interpreted and could refer to the act of randomly drawing observations from a large database. We will argue that the “correct” method to solve such problems is to equip an established deterministic nonlinear programming method, e.g., line-search or trust-region optimization, with adaptively sampled estimators of the objects appearing within the search recursion. If designed correctly, such algorithms will generate iterates whose squared bias and variance are always in lock-step. We will present examples of such adaptively sampled recursions and focus on the detailed analysis of a simple version, proving that the famous Monte Carlo canonical rate (or the Robbins-Monro rate) is indeed achievable. Adaptive sampling, also known as sequential sampling in statistics, can be notoriously difficult to analyze in general. We will provide a fundamental theorem to aid the analysis of adaptively sampled optimization algorithms. |

Benjamin Recht | University of California, Berkeley |

Title: | What’s so hard about optimizing deep networks? |

Abstract: | When training large-scale deep neural networks for pattern recognition, hundreds of hours on clusters of GPUs are required to achieve state-of-the-art performance. Improved optimization algorithms could potentially enable faster industrial prototyping and make training contemporary models more accessible. In this talk, I will attempt to distill what exactly hinders such improvements. Using tools from nonlinear programming, approximation algorithms, convex optimization, and derivative free methods, I will explore several case studies that may elucidate the sources of difficulty. I will emphasize how many of the popularized notions of what make these optimization problems “hard” may not be impediments to performance. |

Mark Schmidt | University of British Columbia |

Title: | Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition |

Abstract: | In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequality proposed in the same year, and it does not require strong-convexity or even convexity. In this work, we show that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the five main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of coordinate descent and stochastic gradient for many non-strongly-convex (and some non-convex) functions. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization. Along the way, we give new convergence results for a wide variety of problems in machine learning: least squares, logistic regression, boosting, a variation on resilient backpropagation, support vector machines, LASSO problems, and stochastic variance-reduced gradient methods. |

#### Poster Session

##### Titles (Spotlights, with Presenters)

Shashank Reddy | Fast Incremental Method for Nonconvex Optimization |

Brian Bullins | Second Order Stochastic Optimization in Linear Time |

Afrooz Jalilzadeh | A Variable Sample-Size Stochastic Approximation Scheme (VSSA) : Rate analysis and Complexity Trade-offs |

Jennifer Erway | Solving L-SR1 trust-region subproblems for nonconvex optimization |

Levent Sagun | Universality in halting time and its applications in optimization |

##### Titles

On an ADMM framework for the resolution of complementarity formulations of the l0-norm minimization problem: Preliminary work |

AIDE: Fast and Communication Efficient Distributed Optimization |

Optimizing binary autoencoders using auxiliary coordinates, with application to learning binary hashing |

Empirical Investigation on Second-order Integrators for Non-Convex Stochastic Optimization |

Faster Asynchronous SGD |

Parallel SGD: When does averaging help |

Large Scale Distributed Hessian-Free Optimization for Deep Neural Network |

Convergence Rate Analysis of a Stochastic Trust Region Method for Nonconvex Optimization |

A Trust Region Method with Complexity of $\mathcal{O}(\epsilon^{-3/2})$ for Nonconvex Optimization |