Encoding Gene Expression Using Deep Autoencoders for Expression Inference

Author Raju Bhukya,

Keywords #Deep autoencoder #gene expression #internal covariance shift #machine learning #MLP #PCA

Abstract Gene expression of an organism contains all the information that characterises its observable traits. Researchers have invested abundant time and money to quantitatively measure the expressions in laboratories. On account of such techniques being too expensive to be widely used, the correlation between expressions of certain genes was exploited to develop statistical solutions. Pioneered by the National Institutes of Health Library of Integrated Network-Based Cellular Signature (NIH LINCS) program, expression inference techniques has many improvements over the years. The Deep Learning for Gene expression (D-GEX) project by University of California, Irvine approached the problem from a machine learning perspective, leading to the development of a multi-layer feedforward neural network to infer target gene expressions from clinically measured landmark expressions. Still, the huge number of genes to be inferred from a limited set of known expressions vexed the researchers. Ignoring possible correlation between target genes, they partitioned the target genes randomly and built separate networks to infer their expressions. This paper proposes that the dimensionality of the target set can be virtually reduced using deep autoencoders. Feedforward networks will be used to predict the coded representation of target expressions. In spite of the reconstruction error of the autoencoder, overall prediction error on the microarray based Gene Expression Omnibus (GEO) dataset was reduced by 6.6%, compared to D-GEX. An improvement of 16.64% was obtained on cross platform normalized data obtained by combining the GEO dataset and an RNA-Seq based 1000G dataset.

References

[1] Amilpur S. and Bhukya R., “EDeepSSP: Explainable Deep Neural Networks for Exact Splice Sites Prediction,” Journal of Bioinformatics and Computational Biology, vol. 18, no. 04, 2020.

[2] Arel I., Rose D., and Karnowski T., “Deep Machine Learning-A New Frontier in Artificial Intelligence Research,” IEEE Computational Intelligence Magazine, vol. 5, no. 4, pp. 13-18, 2010.

[3] Bansal M., Belcastro V., Ambesi-Impiombato A., and Di Bernardo D., “How to Infer Gene Networks from Expression Profiles,” Molecular Systems Biology, vol. 3, no. 1, pp. 1-10, 2007.

[4] Baldi P., “Autoencoders, Unsupervised Learning, and Deep Architectures,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, Washington, pp. 37-50, 2012.

[5] Baldi P. and Sadowski P., “Understanding Dropout,” Advances in Neural Information Processing Systems, vol. 26, pp. 2814-2822, 2013.

[6] Bengio Y., “Learning Deep Architectures for AI,” Foundations and Trends® in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.

[7] Bhukya R. and Ashok A., “Gene Expression Prediction Using Deep Neural Networks,” The International Arab Journal of Information Technology, vol. 17, no. 3, pp. 422-431, 2020.

[8] Bhukya R. and Sumit D., “Referential DNA Data Compression Using Hadoop Map Reduce Framework Using Deep Neural Networks,” The International Arab Journal of Information Technology, vol. 17, no. 2, pp. 207-214, 2020.

[9] Caruana R., Lawrence S., and Giles L., “Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping,” in Proceedings of 13th International Conference on Neural Information Processing Systems, Denver, pp. 402-408, 2000.

[10] Chen Y., Li Y., Narayan R., Subramanian A., and Xie X., “Gene Expression Inference with Deep Learning,” Bioinformatics, vol. 32, no. 12, pp. 1832-1839, 2016.

[11] Clevert D., Unterthiner T., and Hochreiter S., “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUS),” arXiv preprint arXiv:1511.07289, pp. 1-14, 2015.

[12] Dasari C. and Bhukya R., “InterSSPP: Investigating Patterns Through Interpretable Deep Neural Networks for Accurate Splice Signal Prediction,” Chemometrics and Intelligent Laboratory Systems, vol. 206, 2020.

[13] De Sousa C., “An Overview on Weight Initialization Methods for Feedforward Neural Networks,” in Proceedings of the International Joint Conference on Neural Networks, Vancouver, pp. 52-59, 2016.

[14] Edgar R., Domrachev M., and Lash A., “Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository,” Nucleic Acids Research, vol. 30, no. 1, pp. 207- 210, 2002.

[15] Géron A., Hands-On Machine Learning with Scikit-Learn and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems, O’Reilly Media, 2019.

[16] Glorot X. and Bengio Y., “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Sardinia, pp. 249-256, 2010.

[17] Glorot X., Bordes A., and Bengio Y., “Deep Sparse Rectifier Neural Networks,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, FL, pp. 315-323, 2011.

[18] Huang G., “Learning Capability and Storage Capacity of Two-Hidden-Layer Feedforward Networks,” IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 274-281, 2003.

[19] Ioffe S. and Szegedy C., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, pp. 448-456, 2015.

[20] Lamb J., Crawford E., Peck D., Modell J., Blat I., Wrobel M., Lerner J., Brunet J., Subramanian A., Ross K., Reich M., Hieronymus H., Wei G., Armstrong S., Haggarty S., Clemons P., Wei R., Carr S., Lander E., and Golub T., “The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, And Disease,” Science, vol. 313, no. 5795, pp. 1929-1935, 2006.

[21] Le Q., Ranzato M., Monga R., Devin M., Chen K., Corrado G., Dean J., and Ng A., “Building High-Level Features Using Large Scale Unsupervised Learning,” in Proceedings of the 29th International Conference on Machine Learning, Scotland, pp. 8595-8598, 2011.

[22] Lecun Y., Bengio Y., and Hinton G., “Deep Learning,” Nature, vol. 521, pp. 436-444, 2015.

[23] Lin C., Jain S., Kim H., and Bar-Joseph Z., “Using Neural Networks for Reducing the Dimensions of Single-Cell RNA-Seq Data,” Nucleic Acids Research, vol. 45, no. 17, pp. 1-11, 2017.

[24] NIH LINCS Program. http://lincsproject.org/ www.lincsproject.org, Last Visited, 2013. Encoding Gene Expression Using Deep Autoencoders for Expression Inference 633

[25] Pierson E. and Yau C., “ZIFA: Dimensionality Reduction for Zero-Inflated Single-Cell Gene Expression Analysis,” Genome Biology, vol. 16, no. 1, 2015.

[26] Rumelhart E., Hinton E., and Williams J., “Learning Representations by Back-Propagating Errors,” Nature, vol. 323, no. 6088, pp. 533-536, 1986.

[27] Senior A., Heigold G., Ranzato M., and Yang K., “An Empirical Study of Learning Rates in Deep Neural Networks for Speech Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 6724-6728, 2013.

[28] Vincent P. and Larochelle H., “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion Pierre-Antoine Manzagol,” Journal of Machine Learning Research, vol. 11, pp. 3371- 3408, 2010. Raju Bhukya has received his B.Tech in Computer Science and Engineering from Nagarjuna University in the year 2003, M.Tech degree in Computer Science and Engineering from Andhra University in the year 2005 and P.hD in Computer Science and Engineering from National Institute of Technology (NIT) Warangal in the year 2014. He is currently working as an Assistant Professor in the Department of Computer Science and Engineering in National Institute of Technology, Warangal, Telangana, India. He is currently working in the areas of Bio-Informatics and Data Mining.

,abstract={Gene expression of an organism contains all the information that characterises its observable traits. Researchers have invested abundant time and money to quantitatively measure the expressions in laboratories. On account of such techniques being too expensive to be widely used, the correlation between expressions of certain genes was exploited to develop statistical solutions. Pioneered by the National Institutes of Health Library of Integrated Network-Based Cellular Signature (NIH LINCS) program, expression inference techniques has many improvements over the years. The Deep Learning for Gene expression (D-GEX) project by University of California, Irvine approached the problem from a machine learning perspective, leading to the development of a multi-layer feedforward neural network to infer target gene expressions from clinically measured landmark expressions. Still, the huge number of genes to be inferred from a limited set of known expressions vexed the researchers. Ignoring possible correlation between target genes, they partitioned the target genes randomly and built separate networks to infer their expressions. This paper proposes that the dimensionality of the target set can be virtually reduced using deep autoencoders. Feedforward networks will be used to predict the coded representation of target expressions. In spite of the reconstruction error of the autoencoder, overall prediction error on the microarray based Gene Expression Omnibus (GEO) dataset was reduced by 6.6%, compared to D-GEX. An improvement of 16.64% was obtained on cross platform normalized data obtained by combining the GEO dataset and an RNA-Seq based 1000G dataset.},
keywords={Deep autoencoder, gene expression, internal covariance shift, machine learning, MLP, PCA},
ISSN={2413-9351},
month={Jan}}