A neural network is a composition of differentiable functions — typically alternating linear maps and nonlinearities — that maps inputs to outputs. Trained by gradient descent on a loss function.
Single neuron
, where is a weight vector, is a bias, is a nonlinear activation. Without the activation, the whole network would collapse to a single linear map.
Layers
A layer applies a linear transform plus activation to a batch of inputs. Stack many layers → deep network. Each layer has its own weights; the network learns them by backpropagation.
Activation functions
- ReLU: . The dominant choice — fast, doesn't saturate for positive inputs, simple gradient.
- Sigmoid: . Used for binary outputs. Saturates → vanishing gradients in deep nets.
- Tanh: . Like sigmoid but zero-centered. Used in some RNNs.
- Softmax: outputs sum to 1; used for multi-class outputs.
Loss functions
- Cross-entropy for classification (matches softmax outputs)
- MSE for regression
- Custom losses for specific business objectives (rank loss, focal loss, etc.)
Training: gradient descent
Compute gradient of loss with respect to every weight (backpropagation = chain rule applied layer by layer). Update weights: for learning rate .
Variants: SGD with momentum, Adam, AdamW (all differ in how they smooth or normalize gradients). Adam is the safe default; AdamW for transformer-style nets.
Why deep
Deep networks (many layers) can represent some functions exponentially more efficiently than shallow ones. Empirically they generalize better, especially on data with hierarchical structure (images, text). The price: harder optimization, more compute, larger memory.