- Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
- Link to the paper
- Scan the image, one row at a time and one pixel at a time (within each row).
- Given the scanned content, predict the distribution over the possible values for the next pixel.
- Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
- Parameters used in prediction are shared across all the pixel positions.
- Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.
- The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
- This discrete representation is simpler and easier to learn.
- Undirectional layer that processed image row by row.
- Uses one-dimensional convolution (kernel of size kx1, k>=3).
- Refer image 2 in the paper.
- Weight sharing in convolution ensures translation invariance of computed feature along each row.
- For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
- For equations related to state-to-state component, refer to equation 3 in the paper
- Bidirectional layer that processes the image in the diagonal fashion.
- Input map skewed by offsetting each row of the image by one position with respect to the previous row.
- Refer image 3 in the paper
- For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
- Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
- Output map is skewed back by removing the offset positions.
- To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.
- Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
- Refer image 4 in the paper
- Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
- Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
- Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
- Refer image 4 in the paper
- Uses multiple convolution layers that preserve spatial resolution.
- Makes receptive field large but not unbounded.
- Mask used to avoid seeing the future context.
- Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).
- Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
- Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
- Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
- For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
- For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.
- Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
- Update rule - RMSProp
- Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
- Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
- PixelRNN outperforms other models for Binary MNIST and CIFAR10.
- For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
- The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.