Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created May 15, 2016 09:30
Show Gist options
  • Save shagunsodhani/2153e01d026712ac94a2b4928a2dbf3e to your computer and use it in GitHub Desktop.
Save shagunsodhani/2153e01d026712ac94a2b4928a2dbf3e to your computer and use it in GitHub Desktop.
Notes for t-SNE paper

Visualizing Data using t-SNE

Introduction

  • Method to visualize high-dimensional data points in 2/3 dimensional space.
  • Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
  • Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
  • t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
  • Based on SNE (Stochastic Neighbor Embedding)
  • Link to paper

SNE

  • Given a set of datapoints x1, x2, ...xn, pi|j is the probability that xi would pick xj as its neighbor if neighbors were picked in propotion to their porbability density under a Gaussian centred at xi. Calculation of σi is described later.
  • Similarly, define qi|j as condtional probability corresponding to low-dimensional representations of yi and yj (corresponding to xi and xj). The variance of Gaussian in this case is set to be 1/sqrt(2)
  • Argument is that if yi and yj captures the similarity between xi and xj, pi|j and qi|j should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for Pi and Qi, where Pi (Qi) represent condtional probability distribution given xi (yi)
  • Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
  • Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find &sigmai which produces the Pi having same perplexity as specified by the user.
  • Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

t-SNE (t-Distributed SNE)

Symmetric SNE

  • A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
  • Symmetric because pi|j = pj|i and qi|j = qj|i
  • More robust to outliers and has a simpler gradient expression.

Crowding Problem

  • When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
  • One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

t-SNE

  • Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
  • Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast.
  • The cost function is easy to optimize.

Optimization Tricks

Early Compression

  • At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another.
  • Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

Early Exaggeration

  • Scale all the pi|j's so that large qi|j's are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

t-SNE on large datasets

  • Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
  • Select a random subset of points (called landmark points) to display.
  • for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
  • pi|j is defined as fraction of random walks starting at xi and finishing at xj (both these points are landmark points). This way, pi|j is not sensitive to "short-circuits" in the graph (due to noisy data points).

Advantages of t-SNE

  • Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
  • Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
  • The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
  • Random walk version of t-SNE takes care of "short-circuit" problem.

Limitations of t-SNE

  • It is unclear t-SNE would perform on general Dimensionality Reduction for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
  • t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (curse of dimensionality).
  • The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment