- The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
- Link to the paper
- Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
- PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
- PixelRNN gives better results but PixelCNN is faster to train.
- PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
- To account for these, deeper models and gated activation units (equation 2 in the paper) can be used respectively.
- Masked convolutions can lead to blind spots in the receptive fields.
- These can be removed by combining 2 convolutional network stacks:
- Horizontal stack - conditions on the current row.
- Vertical stack - conditions on all rows above the current row.
- Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
- Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).
- Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the paper)
- This conditioning does not depend on the location of the pixel in the image.
- To consider the location as well, map h to spatial representation s = m(h) (equation 5 in the the paper)
- Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.
- For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
- In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
- Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.