Possible new idea with tSNE

Dimensionality reduction is in the hurt of molecular dynamics simulations keeping in mind its importance in reducing the high-dimension data produced during molecular dynamics simulations. It also found application in building Markov State Models and also used as CVs to drive enhanced sampling simulations. Several dimensionality reduction techniques have been used over the years  such as Principal Component Analysis (PCA), non-linear PCA, tICA (time lagged independent component analysis), auto-encoder etc (See here).

However PCA and tICA are the most commonly used algorithms in this purpose. Both of these algorithms are linear however might suffer in giving a better representation of non-linear data. However tSNE (original site here) perform the job in a more efficient way (See here). 

So instead of using tICA during Markov State Modeling we can use tSNE to handle the job followed by some other clustering such as K-means clustering. I performed an initial testing of the PCA vs tSNE in one of my molecular dynamics trajectory which looks like the attached images. It is not a surprise that tSNE distinguished the data in a better way compared to PCA. 

pca_combined-1

Figure 1. Dimensionality reduction using PCA technique on my trajectory.

tsne_combined-1

Figure 2. Dimensionality reduction using tSNE technique on my trajectory.

Also it is very much possible to drive a simulation based on the tSNE generated clusters. You can use MODE-TASK which is a python programme which has integrated functionalities to perform PCA, Kernel PCA and tSNE. Results are shown below (using my system). Keep an eye on the island like clusters in tSNE.

pca_projection1_2Figure 3. PCA using MODE-TASK.

kpca_projection1_2.png

Figure 4. Kernel PCA using MODE-TASK.

tsne_projection1_2

Figure 5. tSNE using MODE-TASK.

Please email me if you want to collaborate on this interesting project.

Analysis of MD trajectory using tSNE

Before starting anything with tSNE let’s read what is tSNE and how it has been compared with PCA. You can read it here. Several implementations of t-SNE are available here. A great introductory video on tSNE can be found here.

The dataset used in this explanation can be accessed here (named combine_times_ca.dcd and corresponding GRO file prot_ca.gro . Use VMD to open them and crosscheck).

# Download Matlab_r2017b

# Add the path of .dcd file reader for MatLab.Download the package from here

addpath('/home/sbhakat/matdcd-1.0')

# Give the path of your dcd file. In my case I am using a dcd file named combine_times_ca.dcd which has atoms starting from 1 to 331.

x=readdcd('/home/sbhakat/Plasmepsin_r1_r2_PCA/Gromacs_plmr2/Combine/combine_times_ca.dcd',1:331);

This will produce a following output

h =

struct with fields:

fid: 3
 endoffile: 42789396
 NSET: 10560
 ISTART: 0
 NSAVC: 1
 NAMNF: 0
 charmm: 1
 charmm_extrablock: 1
 charmm_4dims: 0
 DELTA: 1
 N: 331

# Perform Pincipal Component Analysis

[pc, score, latent, tsquare] = pca(x(2:end,:));

# Plot first two principal components

plot(score(:,1),score(:,2),'.')

# Label the plot

xlabel('PC1')
ylabel('PC2')

# It will pop up a window with PCA plot something the following

pca_combined

# Carrying on the calculation on the same Matlab window

rng default % for reproducibility

# Perform tSNE analysis with Barneshut algorithm

Y = tsne(x,'Algorithm','barneshut','NumPCAComponents',50);

#Produce the figure

figure
gscatter(Y(:,1),Y(:,2))
xlabel('tSNE1')
ylabel('tSNE2')

# It will produce something like the following

tsne_combined

Reference

The initial part of the tutorial was inspired by this one.

Collaboration on use of tSNE in molecular dynamics simulation is highly appreciated.