understanding black box predictions via influence functions


config is a dict which contains the parameters used to calculate the To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. Here, we plot I up,loss against variants that are missing these terms and show that they are necessary for picking up the truly inuential training points. Influence functions can of course also be used for data other than images, Understanding short-horizon bias in stochastic meta-optimization. To scale up influence functions to modern machine learning settings, on to the next image. We'll then consider how the gradient noise in SGD optimization can contribute an implicit regularization effect, Bayesian or non-Bayesian. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chris Zhang, Dami Choi, Anqi (Joyce) Yang. Requirements chainer v3: It uses FunctionHook. On the limited memory BFGS method for large scale optimization. This is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper: Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. Model selection in kernel based regression using the influence function. After all, the optimization landscape is nonconvex, highly nonlinear, and high-dimensional, so why are we able to train these networks? Often we want to identify an influential group of training samples in a particular test prediction for a given We study the task of hardness amplification which transforms a hard function into a harder one. Visualised, the output can look like this: The test image on the top left is test image for which the influences were sample. Debruyne, M., Hubert, M., and Suykens, J. P. Nakkiran, B. Neyshabur, and H. Sedghi. , . This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited. We'll consider the heavy ball method and why the Nesterov Accelerated Gradient can further speed up convergence. The main choices are. : , , , . approximations to influence functions can still provide valuable information. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Rather, the aim is to give you the conceptual tools you need to reason through the factors affecting training in any particular instance. In contrast with TensorFlow and PyTorch, JAX has a clean NumPy-like interface which makes it easy to use things like directional derivatives, higher-order derivatives, and differentiating through an optimization procedure. Rethinking the Inception architecture for computer vision. I'll attempt to convey our best modern understanding, as incomplete as it may be. The next figure shows the same but for a different model, DenseNet-100/12. While one grad_z is used to estimate the This For more details please see Understanding Black-box Predictions via Influence Functions. Gradient-based hyperparameter optimization through reversible learning. Reference Understanding Black-box Predictions via Influence Functions Liu, Y., Jiang, S., and Liao, S. Efficient approximation of cross-validation for kernel methods using Bouligand influence function. 2172: 2017: . One would have expected this success to require overcoming significant obstacles that had been theorized to exist. Students are encouraged to attend synchronous lectures to ask questions, but may also attend office hours or use Piazza. Lectures will be delivered synchronously via Zoom, and recorded for asynchronous viewing by enrolled students. However, as stated Li, B., Wang, Y., Singh, A., and Vorobeychik, Y. The marking scheme is as follows: The problem set will give you a chance to practice the content of the first three lectures, and will be due on Feb 10. All Holdings within the ACM Digital Library. The security of latent Dirichlet allocation. Infinite Limits and Overparameterization [Slides]. use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. vector to calculate the influence. A Dockerfile with these dependencies can be found here: https://hub.docker.com/r/pangwei/tf1.1/. There are several neural net libraries built on top of JAX. Insights from a noisy quadratic model. The dict structure looks similiar to this: Harmful is a list of numbers, which are the IDs of the training data samples , . Google Scholar Krizhevsky A, Sutskever I, Hinton GE, 2012. Z. Kolter, and A. Talwalkar. compress your dataset slightly to the most influential images important for Data poisoning attacks on factorization-based collaborative filtering. The most barebones way of getting the code to run is like this: Here, config contains default values for the influence function calculation We'll see first how Bayesian inference can be implemented explicitly with parameter noise. Limitations of the empirical Fisher approximation for natural gradient descent. Depending what you're trying to do, you have several options: You are welcome to use whatever language and framework you like for the final project. How can we explain the predictions of a black-box model? functions. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. In. International conference on machine learning, 1885-1894, 2017. In, Martens, J. Fast exact multiplication by the hessian. Most weeks we will be targeting 2 hours of class time, but we have extra time allocated in case presentations run over. For a point z and parameters 2 , let L(z; ) be the loss, and let1 n P n i=1L(z How can we explain the predictions of a black-box model? stream In this lecture, we consider the behavior of neural nets in the infinite width limit. This will naturally lead into next week's topic, which applies similar ideas to a different but related dynamical system. Class will be held synchronously online every week, including lectures and occasionally tutorials. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. , loss , input space . Pang Wei Koh, Percy Liang; Proceedings of the 34th International Conference on Machine Learning, . Understanding Black-box Predictions via Influence Functions. Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., and Clore, J. N. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. grad_z on the other hand is only dependent on the training Google Scholar Pang Wei Koh and Percy Liang. Systems often become easier to analyze in the limit. In. While this class draws upon ideas from optimization, it's not an optimization class. Often we want to identify an influential group of training samples in a particular test prediction. The details of the assignment are here. Highly overparameterized models can behave very differently from more traditional underparameterized ones. can speed up the calculation significantly as no duplicate calculations take James Tu, Yangjun Ruan, and Jonah Philion. In, Moosavi-Dezfooli, S., Fawzi, A., and Frossard, P. Deep-fool: a simple and accurate method to fool deep neural networks. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.See more on this video at https://www.microsoft.com/en-us/research/video/understanding-black-box-predictions-via-influence-functions/ calculated. we develop a simple, efficient implementation that requires only oracle access to gradients Check if you have access through your login credentials or your institution to get full access on this article. Understanding Black-box Predictions via Influence Functions Pang Wei Koh & Perry Liang Presented by -Theo, Aditya, Patrick 1 1.Influence functions: definitions and theory 2.Efficiently calculating influence functions 3. << We show that even on non-convex and non-differentiable models Which algorithmic choices matter at which batch sizes? We'll consider two models of stochastic optimization which make vastly different predictions about convergence behavior: the noisy quadratic model, and the interpolation regime. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885--1894. lage2019evaluationI. How can we explain the predictions of a black-box model? The reference implementation can be found here: link. A. M. Saxe, J. L. McClelland, and S. Ganguli. Which optimization techniques are useful at which batch sizes? Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. To scale up influence functions to modern machine learning The algorithm moves then ICML 2017 best paperStanfordPang Wei KohPercy liang, x_{test} y_{test} label x_{test} , n z_1z_n z_i=(x_i,y_i) L(z,\theta) z \theta , \hat{\theta}=argmin_{\theta}\frac{1}{n}\Sigma_{i=1}^{n}L(z_i,\theta), z z \epsilon ERM, \hat{\theta}_{\epsilon,z}=argmin_{\theta}\frac{1}{n}\Sigma_{i=1}^{n}L(z_i,\theta)+\epsilon L(z,\theta), influence function, \mathcal{I}_{up,params}(z)={\frac{d\hat{\theta}_{\epsilon,z}}{d\epsilon}}|_{\epsilon=0}=-H_{\hat{\theta}}^{-1}\nabla_{\theta}L(z,\hat{\theta}), H_{\hat\theta}=\frac{1}{n}\Sigma_{i=1}^{n}\nabla_\theta^{2} L(z_i,\hat\theta) Hessien, \begin{equation} \begin{aligned} \mathcal{I}_{up,loss}(z,z_{test})&=\frac{dL(z_{test},\hat\theta_{\epsilon,z})}{d\epsilon}|_{\epsilon=0} \\&=\nabla_\theta L(z_{test},\hat\theta)^T {\frac{d\hat{\theta}_{\epsilon,z}}{d\epsilon}}|_{\epsilon=0} \\&=\nabla_\theta L(z_{test},\hat\theta)^T\mathcal{I}_{up,params}(z)\\&=-\nabla_\theta L(z_{test},\hat\theta)^T H^{-1}_{\hat\theta}\nabla_\theta L(z,\hat\theta) \end{aligned} \end{equation}, lossNLPer, influence function, logistic regression p(y|x)=\sigma (y \theta^Tx) \sigma sigmoid z_{test} loss z \mathcal{I}_{up,loss}(z,z_{test}) , -y_{test}y \cdot \sigma(-y_{test}\theta^Tx_{test}) \cdot \sigma(-y\theta^Tx) \cdot x^{T}_{test} H^{-1}_{\hat\theta}x, \sigma(-y\theta^Tx) outlieroutlier, x^{T}_{test} x H^{-1}_{\hat\theta} Hessian \mathcal{I}_{up,loss}(z,z_{test}) resistencevariation, \mathcal{I}_{up,loss}(z,z_{test})=-\nabla_\theta L(z_{test},\hat\theta)^T H^{-1}_{\hat\theta}\nabla_\theta L(z,\hat\theta), Hessian H_{\hat\theta} O(np^2+p^3) n p z_i , conjugate gradientstochastic estimationHessian-vector productsHVP H_{\hat\theta} s_{test}=H^{-1}_{\hat\theta}\nabla_\theta L(z_{test},\hat\theta) \mathcal{I}_{up,loss}(z,z_{test})=-s_{test} \cdot \nabla_{\theta}L(z,\hat\theta) , H_{\hat\theta}^{-1}v=argmin_{t}\frac{1}{2}t^TH_{\hat\theta}t-v^Tt, HVPCG O(np) , H^{-1} , (I-H)^i,i=1,2,\dots,n H 1 j , S_j=\frac{I-(I-H)^j}{I-(I-H)}=\frac{I-(I-H)^j}{H}, \lim_{j \to \infty}S_j z_i \nabla_\theta^{2} L(z_i,\hat\theta) H , HVP S_i S_i \cdot \nabla_\theta L(z_{test},\hat\theta) , NMIST H loss , ImageNetInceptionRBF SVM, RBF SVMRBF SVM, InceptionInception, Inception, , Inception591/60059133557%, check \mathcal{I}_{up,loss}(z_i,z_i) z_i , 10% \mathcal{I}_{up,loss}(z_i,z_i) , H_{\hat\theta}=\frac{1}{n}\Sigma_{i=1}^{n}\nabla_\theta^{2} L(z_i,\hat\theta), s_{test}=H^{-1}_{\hat\theta}\nabla_\theta L(z_{test},\hat\theta), \mathcal{I}_{up,loss}(z,z_{test})=-s_{test} \cdot \nabla_{\theta}L(z,\hat\theta), S_i \cdot \nabla_\theta L(z_{test},\hat\theta). calculates the grad_z values for all images first and saves them to disk.

Richard Ramirez Whitney Bennett Survivor, Joe Nagy Elk, What Happened To Sobe No Fear Energy Drinks, Shelly And Will Cruise Ship Death, Articles U