shagunsodhani/PixelRNN.md

## PixelRNN.md

      
    Raw
  

              PixelRNN.md
            
          
    Pixel Recurrent Neural Network

Introduction


Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
Link to the paper

Model


Scan the image, one row at a time and one pixel at a time (within each row).
Given the scanned content, predict the distribution over the possible values for the next pixel.
Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
Parameters used in prediction are shared across all the pixel positions.
Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.

Pixel as discrete value


The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
This discrete representation is simpler and easier to learn.

Pixel RNN

Row LSTM


Undirectional layer that processed image row by row.
Uses one-dimensional convolution (kernel of size kx1, k>=3).
Refer image 2 in the paper.
Weight sharing in convolution ensures translation invariance of computed feature along each row.
For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
For equations related to state-to-state component, refer to equation 3 in the paper

Diagonal BiLSTM


Bidirectional layer that processes the image in the diagonal fashion.
Input map skewed by offsetting each row of the image by one position with respect to the previous row.
Refer image 3 in the paper
For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
Output map is skewed back by removing the offset positions.
To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.

Residual Connections


Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
Refer image 4 in the paper

Masked Convolutions


Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
Refer image 4 in the paper

PixelCNN


Uses multiple convolution layers that preserve spatial resolution.
Makes receptive field large but not unbounded.
Mask used to avoid seeing the future context.
Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).

Multi-Scale PixelRNN


Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.

Training and Evaluation


Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
Update rule - RMSProp
Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
PixelRNN outperforms other models for Binary MNIST and CIFAR10.
For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.