Principal Component Analysis Combined with Truncated-Newton Minimization for
Dimensionality Reduction of Chemical Databases
The similarity and diversity sampling problems are two challenging
optimization tasks that arise in chemical database analyses. As a first
step to their solution, we propose an efficient projection/refinement
protocol based on the principal component analysis (PCA) and the
truncated-Newton minimization method implemented by our program package
TNPACK (PCA/TNPACK). We show here that PCA can provide the same initial
guess as the singular value decomposition (SVD) for the optimization task
of solving the distance-geometry optimization problem if each column of a
database matrix has a mean of zero. Hence, PCA/TNPACK is analogous to the
SVD/TNPACK projection/refinement protocol that we developed recently for
visualizing large chemical databases. Using PCA/TNPACK and the Merck MDDR
database (MDL Drug Data Report), we further investigate the
projection/refinement procedure with regards to the preservation of the
original clusters of chemical compounds, the accuracy of similarity and
diversity sampling of chemical compounds, and the potential application in
the study of structure activity relationships. We also compare the
accuracy and efficiency of the PCA/TNPACK procedure to that of a global
optimization algorithm (here we use the simulated annealing global
optimization algorithm implemented by the program package SIMANN) in
producing the projection mapping of database. Numerical results show that
the 2D PCA/TNPACK mapping can preserve the distance relationships of the
original database and is thus valuable as a first step in similarity and
diversity applications. All numerical tests performed on the Merck MDDR
database (MDL Drug Data Report) and thus represent realistic cases
encountered in practice in the field of drug design.
Click to go back to the publication list