Speakers
Day 1 - Wednesday, October 20th
I'll discuss some recent results (to be presented as an oral at NeurIPS) on the complexity of non-smooth non-convex optimization problems. This is a relatively unexplored area yet contains essentially all deep learning problems as a special case. Joint work with Guy Kornowski.
Jean-Philippe Vert, GoogleFraming RNN as a kernel method: A neural ODE approach
Building on the interpretation of a recurrent neural network (RNN) as a continuous-time neural differential equation, we show, under appropriate conditions, that the solution of a RNN can be viewed as a linear function of a specific feature set of the input sequence, known as the signature. This connection allows us to frame a RNN as a kernel method in a suitable reproducing kernel Hilbert space. As a consequence, we obtain theoretical guarantees on generalization and stability for a large class of recurrent networks. This is a joint work with Adeline Fermanian, Pierre Marion, and Gérard Biau.
Matus Telgarsky, UIUCNatural policy gradient is implicitly biased towards high entropy optimal policies
This talk will show that natural policy gradient and actor-critic, even when trained on a single trajectory, not only finds an optimal policy, but moreover is biased towards the unique maximum entropy policy. This implicit bias automatically prefers exploration, and thus the method need not feature any explicit exploration (e.g., epsilon-greedy), or other regularization or projection steps.
Surbhi Goel, MSRWhat functions do self-attention blocks prefer to represent?
Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. In this talk, we will focus on studying the inductive bias of self-attention blocks by rigorously establishing which functions and long-range dependencies they statistically represent. Our main result shows that bounded-norm Transformer layers can represent sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length. Furthermore, we propose new experimental protocols to support this analysis, built around the large body of work on provably learning sparse Boolean functions. Based on joint work with Benjamin L. Edelman, Sham Kakade and Cyril Zhang.Paolo Favaro, University of Bern.
Lechao Xiao, GoogleEigenspace Restructuring: a Principle of Space and Frequency in Neural Networks
Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the extremely complex nature of the problem, progress has been relatively slow. In this talk, through the lenses of infinite-width networks, a.k.a. neural kernels, we present one of such principles resulting from the hierarchical locality. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept frequency, which measures the complexity of interactions. We show that the topology from convolutional networks (CNNs) partitions the associated eigenspaces into finer subspaces. In addition to frequency, the partition also depends crucially on the concept space — the distance among interaction terms, defined via the length of a minimum spanning tree (MST) containing them. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to model a much richer class of interactions simultaneously, including long-range-low-frequency interactions, short-range-high-frequency interactions, and various interpolations and extrapolations between them. Finally, we show that increasing the depth of a CNN can improve the interpolation resolution and, therefore, the network's learnability.
Nicolo Cesa-Bianchi, MilanMultitask Online Mirror Descent
We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We show that the regret of MT-OMD gets better when the variance of the tasks, measured according to the geometry induced by the regularizer, decreases. This improves upon the bound obtained by running independent OMDs on each task. Our multitask extensions of Online Gradient Descent and Exponentiated Gradient, two important instances of OMD, are shown to enjoy closed-form updates, making them easy to use in practice. Joint work with: Pierre Laforgue (Univ. of Milan), Andrea Paudice (Univ. of Milan and IIT), and Massi Pontil (IIT).
Greg Valiant, StanfordNew Problems and Perspectives on Sampling, Learning, and Memory
Blind deconvolution has enjoyed quite a remarkable progress in the last few decades thanks to developments in optimization and machine learning. To a large extent today several algorithms allow to recover a sharp image from a blurry one without additional knowledge about the blur. This is a remarkable achievement given the extreme ill-posedness of this mathematical problem. Very interesting steps forward have been made in the last decade, when fundamental inconsistencies in the formulation, such as priors favoring the blurry solution, were exposed. This has led to the study of novel formulations that favor sharp over blurry images and result in state of the art performance with robustness to noise in real images. More recently, developments in deep learning have led to a fundamentally different approach to this problem, where enough data can adequately represent a realistic blur model and allow a neural network to learn how to remove blur from images. Approaches in deep learning have led to surprising results, where rather complex blur artifacts are removed effectively and efficiently. We give an account of the latest developments and show their strengths and weaknesses.
Steve Hanneke, TTICA Theory of Universal Learning
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this work, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this work is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. Joint work with Olivier Bousquet, Shay Moran, Ramon van Handel, and Amir Yehudayoff, which appeared at STOC 2021.
Ellen Vitercik, CMUTheoretical Foundations of Data-Driven Algorithm Design
Algorithms often have tunable parameters that significantly impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case instances, however, may be rare or nonexistent in practice. A growing body of research has demonstrated that data-driven algorithm design can lead to significant improvements in performance. This approach uses data from the application at hand to optimize algorithmic performance. In this talk, I will cover recent research that provides theoretical guarantees for data-driven algorithm design.
Sergei Vassilvitski, GoogleWarm start with Predictions
A recent line of research investigates how algorithms can be augmented with machine-learned predictions to overcome worst case lower bounds. This area has revealed interesting algorithmic insights into problems, with particular success in the design of competitive online algorithms. However, the question of improving algorithm running times with predictions has largely been unexplored. We take a first step in this direction by combining the idea of machine-learned predictions with the idea of ""warm-starting"" primal-dual algorithms. We consider one of the most important primitives in combinatorial optimization: weighted bipartite matching and its generalization to b-matching. We identify three key challenges when using learned dual variables in a primal-dual algorithm. First, predicted duals may be infeasible, so we give an algorithm that efficiently maps predicted infeasible duals to nearby feasible solutions. Second, once the duals are feasible, they may not be optimal, so we show that they can be used to quickly find an optimal solution. Finally, such predictions are useful only if they can be learned, so we show that the problem of learning duals for matching has low sample complexity. We validate our theoretical findings through experiments on both real and synthetic data. As a result we give a rigorous, practical, and empirically effective method to compute bipartite matchings. Joint work with Michael Dinitz, Sungjin Im, Thomas Lavastida, and Benjamin Moseley.
Wen Sun, CornellEfficient Reinforcement Learning via Representation Learning
Efficient Reinforcement Learning (RL) for large-scale Markov Decision Processes (MDP) with complex and high-dimensional data requires using rich function approximation. Prior works that tackle large-scale MDPs with rich function approximation often have computationally inefficient algorithms which cannot be easily implemented in a practical manner. We propose to open the black box of rich function approximators by explicitly learning a low-dimensional compact representation and simultaneously performing the usual RL procedures (e.g., explore and exploit) on top of the learned representation. Learning representation in RL is challenging: a good representation cannot be discovered without high-quality data that is relevant to the task that we are solving, while collecting high-quality data requires strategic exploration which itself may depend on the quality of the learned representations. We study the representation learning question in the context of low-rank MDP, where the ground true feature representation is unknown. We develop a model-based approach which iteratively learns an approximate factorization of the transition via maximum likelihood estimation using nonlinear function approximators, forms a reward bonus using the learned representation from the factorization, and then plans inside the learned model with the ground truth reward and the bonus. Our approach carefully balances the interplay between representation learning, exploration, and exploitation, and significantly improves the sample complexity of the prior state-of-art approach (Agarwal et.al.,20) for low-rank MDP. We further demonstrate that our new technique can be used for offline RL where the offline data does not have a great coverage. Specifically, we develop an offline RL algorithm for low-rank MDP which can compete against any policy as long as it is covered by the offline data. This is a joint work with Masatoshi Uehara (Cornell) and Xuezhou Zhang (Princeton).
Ashok Cutkosky, Boston UniversityOnline Learning with Hints
Typical online linear optimization algorithms consider purely adversarial cost vectors, but in reality we may be able to partially predict costs ahead of time. In this talk, we consider a setting in which we have some prediction or "hint" about each cost vector available ahead of time that may or may not be accurate. The goal is to design algorithms that improve regret from the typical sqrt(T) to only log(T) when the hints are accurate, while gracefully decaying to standard bounds when the hints are inaccurate. Accuracy is measured by correlation: a hint vector is accurate if it is well-correlated with the corresponding cost vector. These results improve prior work on optimistic online learning: the regret bound is better, and the condition for being an accurate hint is weaker. Amazingly, this same result holds (in expectation) even if we are only allowed to look at sqrt(T) hints.
Xinyi Chen, GoogleProvable Regret Bounds for Deep Online Learning and Control
The use of deep neural networks has been highly successful in reinforcement learning and control, although few theoretical guarantees for deep learning exist for these problems. There are two main challenges for deriving performance guarantees: a) control has state information and thus is inherently online, and b) deep networks are non-convex predictors for which online learning cannot provide provable guarantees in general. Building on the linearization technique for overparameterized neural networks, we derive provable regret bounds for efficient online learning with deep neural networks. Specifically, we show that over any sequence of convex loss functions, any low-regret algorithm can be adapted to optimize the parameters of a neural network such that it competes with the best net in hindsight. As an application of these results in the online setting, we obtain provable bounds for online episodic control with deep neural network controllers.
Praneeth Netrapalli, GoogleStreaming Estimation with Markovian Data: Limits and Algorithms
Standard results in machine learning theory provide near optimal algorithms for learning under the assumption of i.i.d. data in many problems of interest. However, data with temporal dependence occur often in practice in the analysis of time series, system identification and reinforcement learning. In these settings, data is often assumed to be derived from a mixing Markov process and it is important to learn-on-the-go. Currently, the theoretical analyses of many algorithms reduce learning from these data sets to learning from independent data by considering one in every mixing time number of samples and dropping the rest. In this talk, we will first see that this is in fact tight in the worst case and naively implementing SGD in these scenarios suffers from slow convergence due to dependencies present in the data. However, in well-specified cases, we can design algorithms to accurately unfurl this dependency structure in order to obtain SGD-style streaming algorithms with i.i.d. data like sample complexity. To this end, we consider the specific tasks of least squares regression with Markovian data and non-linear system identification, and introduce parallel SGD and SGD with Reverse Experience Replay (SGD-RER) which is a rigorous form of the popular heuristic Experience Replay used in practical RL. We then sketch an application to the widely used Q-learning algorithm.
Akshay Krishnamurthy, Microsoft Research NYCEfficient first-order contextual bandits
In this talk, I will describe a new contextual bandit algorithm, called FastCB, that achieves a form of instance-dependent guarantee known as a first-order regret bound, resolving a COLT 2017 open problem. The algorithm is simple, practical, accommodates generic function classes, and also performs quite well empirically. Key to our analysis is an information-theoretic quantity called the triangular divergence, which also yields new first-order guarantees for plug-in methods in the classical statistical learning framework.
Day 2 - Thursday, October 21st
I will briefly describe how the privacy of Federated Learning can be strengthened by composing complementary privacy techniques such as secure aggregation and differential privacy. In particular, I will present an end-to-end system which carefully discretizes the data and adds discrete Gaussian noise before performing secure aggregation. The sum of independent discrete Gaussians is not a discrete Gaussian, but we show that it is extremely close for the parameter regimes of interest. This is the basis of our differential privacy guarantee. I will conclude by showing experimental results that demonstrate that our solution is able to achieve a comparable accuracy to central differential privacy (which requires trusting the server in adding noise) with just 16 bits of precision per value.
Varun Kanade, University of OxfordThe Statistical Complexity of Early-Stopped Mirror Descent
We will talk about the implicit regularization properties of iterative gradient-based optimization algorithms. We will characterize the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to unregularized empirical risk for linear models and kernel methods. This is through identifying the intrinsic link between offset Rademacher complexities and potential-based convergence analysis of mirror descent. We get excess risk-guarantees for the path traced by mirror descent in terms of offset complexities for certain function classes. We can recover several recent results via rather short and elegant proofs, while also providing improvements in some settings.
Quanquan Gu, UCLABenign Overfitting of Constant-Stepsize SGD for Linear Regression
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. In this talk, I will discuss benign overfitting of constant-stepsize SGD in arguably the most basic setting: linear regression in the overparameterized regime. Our main results provide a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible. I will also reflect on a number of notable differences between the algorithmic regularization afforded by unregularized SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression. This is joint work with Difan Zou, Jingfeng Wu, Vladimir Braverman, and Sham M. Kakade.
Satyen Kale, GoogleA Deep Conditioning Treatment of Neural Networks
We study the role of depth in training randomly initialized overparameterized neural networks. We give a general result showing that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data. This result holds for arbitrary non-linear activation functions under a certain normalization. As applications of these general results, we provide a generalization of the results of Das et al. (2019) showing that learnability of deep random neural networks with a large class of non-linear activations degrades exponentially with depth. We also show how benign overfitting can occur in deep neural networks via the results of Bartlett et al. (2019b). Finally, we give experimental evidence that normalized versions of ReLU are a viable alternative to more complex operations like Batch Normalization in training deep neural networks.
Jascha Sohl-dickstein, GoogleLearned optimizers: why they're hard and why they're the future
The success of deep learning has hinged on learned functions dramatically outperforming hand-designed functions for many tasks. However, we still train models using hand designed parameter update rules acting on hand designed loss functions. I will argue that these hand designed components are typically mismatched to the desired model behavior, and that we can expect meta-learned update rules and loss functions to perform much better. I will discuss the pathologies and challenges that make meta-training learned optimizers difficult. I will then share progress we have made developing learned optimizers that outperform hand-designed optimizers.
Jon Kleinberg, Cornell (Keynote)Algorithmic Monoculture and Social Welfare
As algorithms are increasingly applied to screen applicants for high-stakes decisions in employment, education, lending, and other domains, concerns have been raised about the effects of "algorithmic monoculture", in which many decision-makers all rely on the same algorithm. This concern invokes analogies to agriculture, where a monocultural system runs the risk of severe harm from unexpected shocks. We present a set of basic models characterizing the potential risks from algorithmic monoculture, showing that monocultural convergence on a single algorithm by a group of decision-making agents, even when the algorithm is more accurate for any one agent in isolation, can reduce the overall quality of the decisions being made by the full collection of agents. Our results rely on minimal assumptions, and involve a combination of game-theoretic arguments about competing decision-makers with the development of a probabilistic framework for analyzing systems that use multiple noisy estimates of a set of alternatives. The talk is based on joint work with Manish Raghavan.
Peter Bartlett, GoogleAdversarial examples in deep networks
We consider ReLU networks of constant depth with independent gaussian parameters, and show that these networks compute functions that are very close to linear, so that a small step in the gradient direction leads to a large change of output. This result is for networks with constant depth, but we also show that some constraint on depth is necessary: there are suitably deep networks that, with constant probability, compute a function that is close to constant. Joint work with Sébastien Bubeck and Yeshwanth Cherapanamjeri
Aravindan Vijayaraghavan, NorthwesternAlgorithms for learning depth-2 neural networks with general ReLU activations
We present polynomial time algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. Prior guarantees for learning ReLU networks are for the special case when there are no bias terms in the ReLU activations. We also use these ideas to establish identifiability of the network parameters under minimal assumptions.
Ananda Theertha Suresh, GoogleLearning with user-level differential privacy
The classical setting of differential privacy assumes each participant contributes a single sample to the dataset and preserves privacy by noising the output in a way that is commensurate with the maximum contribution of a single example. However, in many practical applications such as federated learning, each participant can contribute multiple samples. In this setting, we study the fundamental problems of discrete distribution estimation, high-dimensional mean estimation, and empirical risk minimization. For all these problems, we provide polynomial time algorithms and information theoretic lower bounds.
Raman Arora, JHUMachine Unlearning via Algorithmic Stability
In this talk, I will focus on the problem of Machine UnLearning, aka data deletion in the context of machine learning. I will motivate a notion of algorithmic stability, Total Variation (TV) stability, as central to our goal. For convex risk minimization problems, I will present TV-stable algorithms based on noisy SGD and corresponding efficient unlearning algorithms based on constructing a (maximal) coupling of Markov chains for the noisy SGD procedure. I will conclude with a discussion about the trade-offs between accuracy and unlearning efficiency and extensions of our techniques to non-convex functions and differentially private settings.
Jamie Morgenstern, GoogleIndividualization, persuasion, and polarization
In this talk, I’ll give a brief overview of the connections and distinctions between the literature on multi-armed bandits or more general reinforcement learning frameworks and specialized models of when individuals’ preferences change as a function of content they are shown. I will outline several of the best known results for the equilibria of such systems, and present some new results describing when personalization (or a lack thereof) can lead to polarization.
Jon Schneider, GoogleStrategizing Against No-Regret Learners
How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium of the game. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium utility. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We provide a characterization of the optimal game-play for the player against a mean-based no-regret learner as a solution to a control problem. When the no-regret learner's strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility. (Based on joint work with Yuan Deng and Balasubramanian Sivan).
Naman Agarwal, GoogleA Regret Minimization Approach to Iterative Learning Control
In this work, we consider the setting of iterative learning control, or model-based policy learning in the presence of uncertain, time-varying dynamics. In this setting, we propose a new performance metric, planning regret, which replaces the standard stochastic uncertainty assumptions with worst case regret. Based on recent advances in non-stochastic control, I will present our proposed new iterative algorithm for minimizing planning regret that is more robust to model mismatch and uncertainty. We provide theoretical and empirical evidence that the proposed algorithm outperforms existing methods on several benchmarks.
Wouter Koolen, CWIA/B/n Testing with Control in the Presence of Subpopulations
We revisit the question: "which versions of my website are better than the control". We address this question in the active sequential testing framework, where we explicitly model the presence of subpopulations among the users. For three modes of interaction depending on what is observed about these subpopulations, we characterize the optimal sample complexity, develop matching algorithms and validate the methods experimentally.
Haipeng Luo, University of Southern California The Best of Both Worlds: Stochastic and Adversarial Episodic MDPs with Unknown Transition
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through T episodes, with the goal of achieving O(sqrt{T}) regret when the losses are adversarial and simultaneously O(polylog(T)) regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by (a fraction of) the regret itself in the stochastic setting, a key property to ensure O(polylog(T)) regret.
Chris Dann, GoogleAgnostic RL in Low-Rank MDPs with Rich Observations
There have been many recent advances on provably efficient RL in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. In this talk, I will give an overview of our work on the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies Π that may not contain any near-optimal policy. We provide a new algorithm whose error compared to the best policy in class is bounded by the rank d of the underlying MDP. Specifically, our error bound scales polynomially in the horizon and number of actions, but grows exponentially with rank d. We also show that this exponential dependence on rank is unavoidable in the agnostic setting without making additional assumptions through a nearly matching lower bound.