TY - GEN
T1 - A sparse latent regression approach for integrative analysis of glycomic and glycotranscriptomic data
AU - Wang, Xuefu
AU - Li, Sujun
AU - Peng, Wenjing
AU - Mechref, Yehia
AU - Tang, Haixu
N1 - Funding Information:
This work is partially supported by NIH (1R01GM112490-01) and the Indiana University Initiative of Precesion Health.
Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/8/20
Y1 - 2017/8/20
N2 - Glycomics and glycotranscitomics have emerged as two key highthroughput approaches to interrogating the glycome within specific cells, tissues or organisms under specific conditions. Because the glycotransciptomic analysis utilizes the same experimental protocol as the whole-transcriptome sequencing (RNA-seq) that is commonly used in the genomic research, the glycotranscriptomic information can be conveniently extracted in silico for many biological samples from which RNA-seq data have been collected and made publicly available through large-scale projects such as The Cancer Genome Atlas (TCGA) proeject. However, the glycomic data collection is constrained by specialized analytical tools that are less accessible by biological researchers. In this paper, we present a Bayesian sparse latent regression (BSLR) model for predicting quantitative glycan abundances from glycotranscriptomic data. The model is built using the matched glycomic and glycotranscriptomic data collected in a same set of samples as training sets, and is then exploited to study the common properties of the training samples and to predict these properties (e.g., the glycan abundances) in similar samples from which only glycotranscriptomc data are available. The BSLR model assumes the glycomic and the glycotranscriptomic abundances are both modulated by a small number of independent latent variables, and thus can be constructed by using only a relatively small number of training samples. When tested on simulated data, we show our approach achieves satisfactory performance using only 10-20 training samples. We also tested our model on five cancer cell lines, and showed the BSLR model can accurately predict the glycan abundances from the transcription levels of glycan synthetic genes. Furthermore, the predicted glycan abundances can distinguish the metastatic cell line specifically targeting brain from the remaining breast cancer cell lines as well as the a brain cancer cell line, with only slightly lower power than the observed glycan abundances in glycomic experiments, indicating the BSLR prediction retains the variations of glycan abundances across different groups of samples from their glycotranscriptomic data.
AB - Glycomics and glycotranscitomics have emerged as two key highthroughput approaches to interrogating the glycome within specific cells, tissues or organisms under specific conditions. Because the glycotransciptomic analysis utilizes the same experimental protocol as the whole-transcriptome sequencing (RNA-seq) that is commonly used in the genomic research, the glycotranscriptomic information can be conveniently extracted in silico for many biological samples from which RNA-seq data have been collected and made publicly available through large-scale projects such as The Cancer Genome Atlas (TCGA) proeject. However, the glycomic data collection is constrained by specialized analytical tools that are less accessible by biological researchers. In this paper, we present a Bayesian sparse latent regression (BSLR) model for predicting quantitative glycan abundances from glycotranscriptomic data. The model is built using the matched glycomic and glycotranscriptomic data collected in a same set of samples as training sets, and is then exploited to study the common properties of the training samples and to predict these properties (e.g., the glycan abundances) in similar samples from which only glycotranscriptomc data are available. The BSLR model assumes the glycomic and the glycotranscriptomic abundances are both modulated by a small number of independent latent variables, and thus can be constructed by using only a relatively small number of training samples. When tested on simulated data, we show our approach achieves satisfactory performance using only 10-20 training samples. We also tested our model on five cancer cell lines, and showed the BSLR model can accurately predict the glycan abundances from the transcription levels of glycan synthetic genes. Furthermore, the predicted glycan abundances can distinguish the metastatic cell line specifically targeting brain from the remaining breast cancer cell lines as well as the a brain cancer cell line, with only slightly lower power than the observed glycan abundances in glycomic experiments, indicating the BSLR prediction retains the variations of glycan abundances across different groups of samples from their glycotranscriptomic data.
KW - Bayesian model
KW - Biomarker discovery
KW - Glycomics
KW - Mcmc sampling
KW - Sparse latent factor model
KW - Transcriptomics
UR - http://www.scopus.com/inward/record.url?scp=85031321466&partnerID=8YFLogxK
U2 - 10.1145/3107411.3107468
DO - 10.1145/3107411.3107468
M3 - Conference contribution
AN - SCOPUS:85031321466
T3 - ACM-BCB 2017 - Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
SP - 273
EP - 278
BT - ACM-BCB 2017 - Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
Y2 - 20 August 2017 through 23 August 2017
ER -