Continuous Distributed Representation of Biological Sequences and Their Applications in Bioinformatics
Halls department, Hall 2
Wednesday, 28 December 2016
11:30 - 12:30
Biophysical and biochemical principles govern biological sequences (e.g., DNA, RNA, and protein sequences) similar to the way grammar of a natural language determines the structure of clauses and sentences. This analogy motivates ‘life language processing’, i.e. treating biological sequences as the output of a certain language and adopt/develop language processing methods to preform analyses and predictions in that language. We propose two specific aims for life language processing: (1) Developing computational linguistics representation learning for biological sequences: the large gap between the number of known sequences (raw data) versus the number of known functions/structures associated with these sequences (meta-data), encourage us to develop methods that can obtain prior knowledge from the existing sequences. Continuous vector representations of words known as word vectors have recently become popular in natural language processing (NLP) as an efficient unsupervised approach to represent semantic/syntactic units of text helping in the downstream NLP tasks (e.g., machine translation, part-of-speech tagging, information retrieval, etc.). In this work, we propose distributed vector representations of biological sequence segments (n-grams), called bio-vectors, using skip-gram neural network. We propose an intrinsic evaluation of bio-vectors by measuring the continuity of the underlying biophysical and biochemical properties (e.g., average mass, hydrophobicity, charge, and etc.). In addition to intrinsic evaluations, for the purpose of extrinsic evaluations, we have employed this representation in classification of 324018 protein sequences belonging to 7027 protein families, where an average family classification accuracy of 93%±0.06% was obtained. In addition, incorporation of bio-vector representation versus one-hot vector features in Max-margin Markov Network (M3Net) for intron-exon prediction and domain identification tasks could improve the sequence labeling accuracy from 73.84% to 74.99% and from 82.4% to 89.8%, respectively. (2) Performing computational linguistics comparison of genomic language variations: the purpose of this aim is to quantify the distances between syntactic and semantic features of two genomic language variations, with various applications in comparative genomics. The training model of bio-vectors is analogous to neural probabilistic language modeling. Hence, such representations can characterize sequences in terms of underlying biochemical and biophysical patterns. This makes the network of n-grams in this space an indirect representation of the underlying language model. Considering this fact, we propose a new quantitative measure of distance between genomic language variations based on the divergence between networks of n-grams in different genetic variations, called word embedding language divergence. We perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our results confirm a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a new quantitative measure of similarity between genomic languages, with applications in characterization/classification of sequences of interest.
Ehsaneddin Asgari is a PhD candidate at University of California, Berkeley. His PhD research explores the development of deep learning, specifically deep language processing methods for performing analysis and predictions on genomics and metagenomics sequences. His research interests are in the areas of Bioinformatics and Natural Language Processing. He is a former researcher at Deep Language Processing group at LMU, Genesis group at MIT Computer Science and Artificial Intelligence Laboratory, Neuroscience Statistics Lab at MIT Brain and Cognitive Science Department, ABB Corporate Research, Audiovisual Communications Lab at EPFL, ADSC of UIUC, and Digital Media Lab. He received his M.Sc. degrees from UC Berkeley and EPFL and his B.Sc. from Sharif University of Technology in Computer Engineering.