Faculty Pick is a section of the PACM/DSA newsletter where a faculty member can highlight a dashboard, news item or journal article that they believe would be interesting or useful for the students to read.
Sumanlata Ghosh is an adjunct faculty member in the Data Science and Analytics program. She selected an article she thought the PACM and DSA students may find interesting or useful. The title of the article is "Understanding deep learning requires rethinking generalization".
The author in his paper has tried to explain by experiments that explicit regularization may improve generalization when perfectly tuned but it by no means explains adequately the reason behind generalization in neural networks nor is it the sole reason for generalization.
The author also shows how SGD acts as an implicit regularizer by using linear models. In the initial parts of the paper the author has tried to show how regularization methods fail to generalize when the training labels are randomized. Randomizing the training labels leads the neural network to completely memorize the data, i.e., using SGD it converges on the training data perfectly (perfect overfit with 100% accuracy on train). Through his experiments on the ImageNet dataset and Inception architecture the author has shown that how implicit and explicit regularizers help or not in generalization and better test results. Coming to explicit regularizers, augmentation of the dataset as an explicit regularization helps in improving performance on test set and thus better generalization. The author has proved how augmentation proves in better generalization over other regularizers such as weight decay and dropoff. He also states that a better architecture can help in better generalization that the above methods. Coming to implicit regularization, after training Inception with both with and w/o batch normalization layers the author has come to a conclusion that both implicit and explicit regularization help in generalization but by no means are sufficient to explain the complete generalization in a neural network architecture as the networks keep performing well even with regularizers removed.
Towards the end of the paper the author has used linear models to show how SGD acts as an implicit regularizer by using linear models and thereby trying to find parallel insights for neural nets. The author has used SGD over curvature of the loss function to understand the quality of the minima (which minima generalizes the best). Thus through a mathematical proof the author has come to the conclusion that with the size of the training dataset less than hundred thousand, SGD can be used to fit any set if labels perfectly. Out of all the models SGD generally converges to the solution with minimum norm which proves to perform quite well with convex models. The author finally concludes by saying that large complex models have the ability to memorize the complete data and thus decreasing the complexity of these networks to improve generalization remains a challenge. Also, the reason behind why optimization is easy differs from the true cause of generalization.
Read the full article here.
Some content on this page is saved in PDF format. To view these files, download Adobe Acrobat Reader free. If you are having trouble reading a document, request an accessible copy of the PDF or Word Document.