Monday, November 28, 2016

New 'Deep Learning' book

Amazon link

At 800 pages this will become the standard introduction to machine learning. It's expensive - £59.95 on Amazon - but you can read it for free (in a cumbersome way) as web pages here.

Update: MIT Press were shortsighted banning a free PDF. There are copies all over the net (try github). Here's a link, although the file is large (370 MB for an 800 page book).


Table of contents.

1 Introduction
1.1 Who Should Read This Book
1.2 Historical Trends in Deep Learning

Part I Applied Math and Machine Learning Basics

2  Linear Algebra
2.1 Scalars, Vectors, Matrices and Tensors
2.2 Multiplying Matrices and Vectors
2.3 Identity and Inverse Matrices
2.4 Linear Dependence and Span
2.5 Norms
2.6 Special Kinds of Matrices and Vectors
2.7 Eigendecomposition
2.8 Singular Value Decomposition
2.9 The Moore-Penrose Pseudoinverse
2.10 The Trace Operator
2.11 The Determinant
2.12 Example: Principal Components Analysis

3 Probability and Information Theory
3.1 Why Probability?
3.2 Random Variables
3.3 Probability Distributions
3.4 Marginal Probability
3.5 Conditional Probability
3.6 The Chain Rule of Conditional Probabilities
3.7 Independence and Conditional Independence
3.8 Expectation, Variance and Covariance
3.9 Common Probability Distributions
3.10 Useful Properties of Common Functions
3.11 Bayes’ Rule
3.12 Technical Details of Continuous Variables
3.13 Information Theory
3.14 Structured Probabilistic Models

4 Numerical Computation
4.1 Overflow and Underflow
4.2 Poor Conditioning
4.3 Gradient-Based Optimization
4.4 Constrained Optimization
4.5 Example: Linear Least Squares

5 Machine Learning Basics
5.1 Learning Algorithms
5.2 Capacity, Overfitting and Underfitting
5.3 Hyperparameters and Validation Sets
5.4 Estimators, Bias and Variance
5.5 Maximum Likelihood Estimation
5.6 Bayesian Statistics
5.7 Supervised Learning Algorithms
5.8 Unsupervised Learning Algorithms
5.9 Stochastic Gradient Descent
5.10 Building a Machine Learning Algorithm
5.11 Challenges Motivating Deep Learning

Part II Deep Networks: Modern Practices

6 Deep Feedforward Networks
6.1 Example: Learning XOR
6.2 Gradient-Based Learning
6.3 Hidden Units
6.4 Architecture Design
6.5 Back-Propagation and Other Differentiation Algorithms
6.6 Historical Notes

7 Regularization for Deep Learning
7.1 Parameter Norm Penalties
7.2 Norm Penalties as Constrained Optimization
7.3 Regularization and Under-Constrained Problems
7.4 Dataset Augmentation
7.5 Noise Robustness
7.6 Semi-Supervised Learning
7.7 Multi-Task Learning
7.8 Early Stopping
7.9 Parameter Tying and Parameter Sharing
7.10 Sparse Representations
7.11 Bagging and Other Ensemble Methods
7.12 Dropout
7.13 Adversarial Training
7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

8 Optimization for Training Deep Models
8.1 How Learning Differs from Pure Optimization
8.2 Challenges in Neural Network Optimization
8.3 Basic Algorithms
8.4 Parameter Initialization Strategies
8.5 Algorithms with Adaptive Learning Rates
8.6 Approximate Second-Order Methods
8.7 Optimization Strategies and Meta-Algorithms

9 Convolutional Networks
9.1 The Convolution Operation
9.2 Motivation
9.3 Pooling
9.4 Convolution and Pooling as an Infinitely Strong Prior
9.5 Variants of the Basic Convolution Function
9.6 Structured Outputs
9.7 Data Types
9.8 Efficient Convolution Algorithms
9.9 Random or Unsupervised Features
9.10 The Neuroscientific Basis for Convolutional Networks
9.11 Convolutional Networks and the History of Deep Learning

10 Sequence Modeling: Recurrent and Recursive Nets
10.1 Unfolding Computational Graphs
10.2 Recurrent Neural Networks
10.3 Bidirectional RNNs
10.4 Encoder-Decoder Sequence-to-Sequence Architectures
10.5 Deep Recurrent Networks
10.6 Recursive Neural Networks
10.7 The Challenge of Long-Term Dependencies
10.8 Echo State Networks
10.9 Leaky Units and Other Strategies for Multiple Time Scales
10.10 The Long Short-Term Memory and Other Gated RNNs
10.11 Optimization for Long-Term Dependencies
10.12 Explicit Memory

11 Practical Methodology
11.1 Performance Metrics
11.2 Default Baseline Models
11.3 Determining Whether to Gather More Data
11.4 Selecting Hyperparameters
11.5 Debugging Strategies
11.6 Example: Multi-Digit Number Recognition

12 Applications
12.1 Large-Scale Deep Learning
12.2 Computer Vision
12.3 Speech Recognition
12.4 Natural Language Processing
12.5 Other Applications

Part III Deep Learning Research

13 Linear Factor Models
13.1 Probabilistic PCA and Factor Analysis
13.2 Independent Component Analysis (ICA)
13.3 Slow Feature Analysis
13.4 Sparse Coding
13.5 Manifold Interpretation of  PCA

14 Autoencoders
14.1 Undercomplete Autoencoders
14.2 Regularized Autoencoders
14.3 Representational Power, Layer Size and Depth
14.4 Stochastic Encoders and Decoders
14.5 Denoising Autoencoders
14.6 Learning Manifolds with Autoencoders
14.7 Contractive Autoencoders
14.8 Predictive Sparse Decomposition
14.9 Applications of Autoencoders

15 Representation Learning
15.1 Greedy Layer-Wise Unsupervised Pre-training
15.2 Transfer Learning and Domain Adaptation
15.3 Semi-Supervised Disentangling of Causal Factors
15.4 Distributed Representation
15.5 Exponential Gains from Depth
15.6 Providing Clues to Discover Underlying Causes

16 Structured Probabilistic Models for Deep Learning
16.1 The Challenge of Unstructured Modeling
16.2 Using Graphs to Describe Model Structure
16.3 Sampling from Graphical Models
16.4 Advantages of Structured Modeling
16.5 Learning about Dependencies
16.6 Inference and Approximate Inference
16.7 The Deep Learning Approach to Structured Probabilistic Models

17 Monte Carlo Methods
17.1 Sampling and Monte Carlo Methods
17.2 Importance Sampling
17.3 Markov Chain Monte Carlo Methods
17.4 Gibbs Sampling
17.5 The Challenge of Mixing between Separated Modes

18 Confronting the Partition Function
18.1 The Log-Likelihood Gradient
18.2 Stochastic Maximum Likelihood and Contrastive Divergence
18.3 Pseudolikelihood
18.4 Score Matching and Ratio Matching
18.5 Denoising Score Matching
18.6 Noise-Contrastive Estimation
18.7 Estimating the Partition Function

19 Approximate Inference
19.1 Inference as Optimization
19.2 Expectation Maximization
19.3 MAP Inference and Sparse Coding
19.4 Variational Inference and Learning
19.5 Learned Approximate Inference

20 Deep Generative Models
20.1 Boltzmann Machines
20.2 Restricted Boltzmann Machines
20.3 Deep Belief Networks
20.4 Deep Boltzmann Machines
20.5 Boltzmann Machines for Real-Valued Data
20.6 Convolutional Boltzmann Machines
20.7 Boltzmann Machines for Structured or Sequential Outputs
20.8 Other Boltzmann Machines
20.9 Back-Propagation through Random Operations
20.10 Directed Generative Nets
20.11 Drawing Samples from Autoencoders
20.12 Generative Stochastic Networks
20.13 Other Generation Schemes
20.14 Evaluating Generative Models
20.15 Conclusion


On S. Matthews wrote this insightful review:
"This is, to invoke a technical reviewer cliché, a 'valuable' book. Read it and you will have a detailed and sophisticated practical understanding of the state of the art in neural networks technology. Interestingly, I also suspect it will remain current for a long time, because reading it I came to more and more of an impression that neural network technology (at least in the current iteration) is plateauing.

"Why? Because this book also makes very clear - is completely honest - that neural networks are a 'folk' technology (though they do not use those words): Neural networks work (in fact they work unbelievably well - at least, as Geoffrey Hinton himself has remarked, given unbelievably powerful computers), but the underlying theory is very limited and there is no reason to think that it will become less limited, and the lack of a theory means that there is no convincing 'gradient', to use an appropriate metaphor, for future development.

"A constant theme here is that 'this works better than that' for practical reasons not for underlying theoretical reasons. Neural networks are engineering, they are not applied mathematics, and this is very much, and very effectively, an engineer's book."


  1. Another interesting massive read! I have recently downloaded a lecture series (from Stanford U.) on Machine Learning largely because I find that I may not believe the theory of "Probability" that is (I think) in use here. (My inspiration comes from QM Probability, plus a Neural Net researcher who discovered something odd about Neural Net probabilities - he has since been "bought up" by Google.) I want to check up on that.

    I guess there are plenty of opportunities for you to post aspects of your study of this too.

    1. Having now looked at parts of the book online, I can see why the Reviewer sees no overall theory here. Indeed it has a flavour of a "homage to Connectionism/Hinton".

      Nevertheless there might be an overarching theory around - the Machine Learning Course I refer to above covers a range of topics (omitted from this book as "inferior").


Comments are moderated. Keep it polite and no gratuitous links to your business website - we're not a billboard here.