Monday, November 12, 2012

Identifying Cross-category Relations in Gene Ontology and Constructing Genome-specific Term Association Networks

Gene Ontology (GO) has been widely used in biological databases, annotation projects, and computational analyses. Although the three GO categories are structured as independent ontologies, the biological relationships across the categories are not negligible for biological reasoning and knowledge integration. However, the existing cross-category ontology term similarity measures are either developed by utilizing the GO data only or based on manually curated term name similarities, ignoring the fact that GO is evolving quickly and the gene annotations are far from complete. In this paper we introduce a new cross-category similarity measurement called CroGO by incorporating genome-specific gene co-function network data. The performance study showed that our measurement outperforms the existing algorithms. We also generated genome-specific term association networks for yeast and human. An enrichment based test showed our networks are better than those generated by the other measures. The genome-specific term association networks constructed using CroGO provided a platform to enable a more consistent use of GO. In the networks, the frequently occurred MF-centered hub indicates that a molecular function may be shared by different genes in multiple biological processes, or a set of genes with the same functions may participate in distinct biological processes. And common subgraphs in multiple organisms also revealed conserved GO term relationships. Software and data are available online at www.msu.edu/~jinchen/CroGO

by Peng J, Chen J*, Wang Y*. BMC Bioinformatics (special issue for selected papers presented at the 11th Asia-Pacific Bioinformatics Conference) (* co-corresponding author) 2012, in press

Knowledge‐based Optimal Timepoint Sampling in High‐Throughput Temporal Experiments

Determining the best sampling rates (which maximize information yield and minimize cost) for time-series high-throughput gene expression experiments is a challenging optimization problem. Although existing approaches provide insight into the design of optimal sampling rates, our ability to utilize existing differential gene expression data to discover optimal timepoints is compelling. We present a new data-integrative model, Optimal Timepoint Selection (OTS), to address the sampling rate problem. Three experiments were run on two different datasets in order to test the performance of OTS, including iterative-online and a top-up sampling approaches. In all of the experiments, OTS outperformed the best existing timepoint selection approaches, suggesting that it can optimize the distribution of a limited number of timepoints, potentially leading to better biological insights about the resulting gene expression patterns. OTS is available at www.msu.edu/~jinchen/OTS

by Rosa BA, Zhang J, Major I, Qin W and Chen J. Bioinformatics, 2012, 28(21):2773-2781

Detection and Decomposition: Treatment-induced Cyclic Gene Expression Disruption in High-throughput Time-series Datasets

Higher organisms possess many genes which cycle under normal conditions, to allow the organism to adapt to expected environmental conditions throughout the course of a day. However, treatment-induced disruption of regular cyclic gene expression patterns presents a significant challenge in novel gene discovery experiments because these disruptions can induce strong differential regulation events for genes that are not involved in an adaptive response to the treatment. To address this cycle disruption problem, we reviewed the state-of-art periodic pattern detection algorithms and a pattern decomposition algorithm (PRIISM), which is a knowledge-based Fourier analysis algorithm designed to distinguish the cyclic patterns from the rest gene expression patterns, and discussed potential future improvements.

by Jiao Y, Rosa BA, Oh S, Montgomery BL, Qin W, Chen J. J Bioinform Comput Biol. 2012 Dec;10(6):1271002

Uncovering Arabidopsis membrane protein interactome enriched in transporters using mating-based split ubiquitin assays and classification models

High-throughput data are a double-edged sword; for the benefit of large amount of data, there is an associated cost of noise. To increase reliability and scalability of high-throughput protein interaction data generation, we tested the efficacy of classification to enrich potential protein-protein interactions. We applied this method to identify interactions among Arabidopsis membrane proteins enriched in transporters. We validated our method with multiple retests. Classification improved the quality of the ensuing interaction network and was effective in reducing the search space and increasing true positive rate. The final network of 541 interactions among 239 proteins (of which 179 are transporters) is the first protein interaction network enriched in membrane transporters reported for any organism. This network has similar topological attributes to other published protein interaction networks. It also extends and fills gaps in currently available biological networks in plants and allows building a number of hypotheses about processes and mechanisms involving signal-transduction and transport systems.

by Chen J, Lalonde S, Obrdlik P, Noorani Vatani A, Parsa SA, Vilarino C, Revuelta JL, Frommer WB, Rhee SY. Front Plant Sci. 2012;3:124

Draft genome sequence of Rubrivivax gelatinosus CBS


Rubrivivax gelatinosus CBS, a purple nonsulfur photosynthetic bacterium, can grow photo-synthetically using CO and N(2) as the sole carbon and nitrogen nutrients, respectively. R. gelatinosus CBS is of particular interest due to its ability to metabolize CO and yield H(2). We present the 5-Mb draft genome sequence of R. gelatinosus CBS with the goal of providing genetic insight into the metabolic properties of this bacterium.

by Hu P, Lang J, Wawrousek K, Yu J, Maness PC, Chen J. J Bacteriol. 2012

Wireless Spectrum Occupancy Prediction with Partial Periodic Pattern Mining


Cognitive radio appears as a promising technology to allocate wireless spectrum between licensed and unlicensed users in an efficient way. The availability of spectrum holes vastly affects the throughput and delay of unlicensed users. Predictive methods for inferring the availability of spectrum holes can help to improve spectrum extraction rate and reduce collision rate. In this paper, a spectrum occupancy prediction model based on Partial Periodic Pattern Mining (PPPM) is introduced. The mining aims to identify frequent spectrum occupancy patterns that are hidden in the spectrum usage of a channel. The mined frequent patterns are then used to predict future channel states (i.e., busy or idle). Based on the prediction, unlicensed users will be able to make use of spectrum holes efficiently without introducing significant interference to licensed users. PPPM outperforms traditional Frequent Pattern Mining (FPM) by considering real patterns that do not repeat perfectly due to noise, sensing errors, and irregular behaviors. Using real life network activities we show a significant reduction on miss rate in channel state prediction. With the proposed prediction mechanism, the performance of Dynamic Spectrum Access (DSA) is substantially improved.

by Huang P, Liu CJ, Xiao L, Chen J. MASCOTS’12

Sunday, May 06, 2012

Inferring the Regulatory Interaction Models of Transcription Factors in Transcriptional Regulatory Networks

Living cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the diff erential expression of target genes in the context of transcriptional regulatory networks (TRNs), either individually or in groups. To decipher the mechanisms of how the TFs control the di fferential expression of a target gene in a TRN is challenging, especially when multiple TFs collaboratively participate in the transcriptional regulation. To unravel the roles of the TFs in the regulatory networks, we model the underlying regulatory interactions in terms of the TF-target interactions' directions (activation or repression) and their corresponding logical roles (necessary and/or su cient). We design a set of constraints that relate gene expression patterns to regulatory interaction models, and develop TRIM (Transcriptional Regulatory Interaction Model Inference), a new hidden Markov model, to infer the models of TF-target interactions in large-scale TRNs of complex organisms. Besides, by training TRIM with wild-type time-series gene expression data, the activation timepoints of each regulatory module can be obtained. To demonstrate the advantages of TRIM, we applied it on yeast TRN to infer the TF-target interaction models for individual TFs as well as pairs of TFs in collaborative regulatory modules. By comparing with TF knockout and other gene expression data, we were able to show that the performance of TRIM is clearly higher than DREM (the best existing algorithm). In addition, on an individual Arabidopsis binding network, we showed that the target genes' expression correlations can be signi ficantly improved by incorporating the TF-target regulatory interaction models inferred by TRIM into the expression data analysis, which may introduce new knowledge in transcriptional dynamics and bioactivation.

by Sherine Awad, Nicholas Panchy, See-Kiong Ng, Jin Chen. Journal of Bioinformatics and Computational Biology. 2012. In Press

Frequency-based time-series gene expression recomposition using PRIISM

Circadian rhythm pathways influence the expression patterns of as much as 31% of the Arabidopsis genome through complicated interaction pathways, and have been found to be significantly disrupted by biotic and abiotic stress treatments, complicating treatment-response gene discovery methods due to clock pattern mismatches in the fold change-based statistics. The PRIISM (Pattern Recomposition for the Isolation of Independent Signals in Microarray data) algorithm outlined in this paper is designed to separate pattern changes induced by different forces, including treatment-response pathways and circadian clock rhythm disruptions. Using the Fourier transform, high-resolution time-series microarray data is projected to the frequency domain. By identifying the clock frequency range from the core circadian clock genes, we separate the frequency spectrum to different sections containing treatment-frequency (representing up- or down-regulation by an adaptive treatment response), clock-frequency (representing the circadian clock-disruption response) and noise-frequency components. Then, we project the components’ spectra back to the expression domain to reconstruct isolated, independent gene expression patterns representing the effects of the different influences. By applying PRIISM on a high-resolution time-series Arabidopsis microarray dataset under a cold treatment, we systematically evaluated our method using maximum fold change and principal component analyses. The results of this study showed that the ranked treatment frequency fold change results produce fewer false positives than the original methodology, and the 26-hour timepoint in our dataset was the best statistic for distinguishing the most known cold-response genes. In addition, six novel cold-response genes were discovered. PRIISM also provides gene expression data which represents only circadian clock influences, and may be useful for circadian clock studies. PRIISM is a novel approach for overcoming the problem of circadian disruptions from stress treatments on plants. PRIISM can be integrated with any existing analysis approach on gene expression data to separate circadian-influenced changes in gene expression, and it can be extended to apply to any organism with regular oscillations in gene expression patterns across a large portion of the genome.

by Bruce A. Rosa, Yuhua Jiao, Sookyung Oh, Beronda L. Montgomery, Wensheng Qin, Jin Chen. BMC Systems Biology. 2012. In Press

Draft genome sequence of Rubrivivax gelatinosus CBS

Rubrivivax gelatinosus CBS, a purple nonsulfur photosynthetic bacterium, can grow photosynthetically using CO and N2 as the sole carbon and nitrogen nutrients, respectively. R. gelatinosus CBS is of particular interest due to its ability to metabolize CO and yield H2. We present the 5-Mb draft genome sequence of R. gelatinosus CBS with the goal of providing genetic insight into the metabolic properties of this bacterium.

by Pingsha Hu, Juan Lang, Karen Wawrousek, Jianping Yu, Pin-Ching Maness, Jin Chen. Journal of Bacteriology. 2012. In Press

Wireless Spectrum Occupancy Prediction Based on Partial Periodic Pattern Mining

Cognitive radio appears as a promising technology to allocate wireless spectrum between licensed and unlicensed users in an efficient way. The availability of spectrum holes vastly affects the throughput and delay of unlicensed users. Predictive methods for inferring the availability of spectrum holes can help to improve channel utilization and reduce collision rate. In this paper, a spectrum occupancy prediction method based on Partial Periodic Pattern Mining (PPPM) is introduced. The mining aims to identify frequent occupancy patterns that are hidden in the spectrum usage of a channel, and then the mined frequent patterns are used to predict future channel states. By further extending our three states PPPM to N-states PPPM, the duration of high/low utilization on a channel is also predicted. The frequent patterns of channel utilization duration are critical in optimizing channel switching strategies. PPPM outperforms traditional Frequent Pattern Mining (FPM) by considering patterns that may not repeat perfectly due to noise, sensing errors, and irregular behaviors. Using real life network activities we show a significant reduction in miss rate. In addition, we observed that distinguishing low utilization periods from high utilization periods and mining rules in corresponding utilization periods significantly improve the prediction performance. With prediction mechanism, we show the performance of dynamic spectrum access is substantially improved. The high accuracy of duration prediction is also validated with data collected in the paging bands.

 by Pei Huang, Chin-Jung Liu, Li Xiao, Jin Chen, proceedings of IEEE / ACM 20th Intl Workshop on Quality of Service (IWQoS), Coimbra, Portugal, Jun. 2012

A membrane protein/signaling protein interaction network for Arabidopsis version AMPv2

Interactions between membrane proteins and the soluble fraction are essential for signal transduction and for regulating nutrient transport. To gain insights into the membrane-based interactome, 3,852 open reading frames (ORFs) out of a target list of 8,383 representing membrane and signaling proteins from Arabidopsis thaliana were cloned into a Gateway-compatible vector. The mating-based split ubiquitin system was used to screen for potential protein–protein interactions (pPPIs) among 490 Arabidopsis ORFs. A binary robotic screen between 142 receptor-like kinases (RLKs), 72 transporters, 57 soluble protein kinases and phosphatases, 40 glycosyltransferases, 95 proteins of various functions, and 89 proteins with unknown function detected 387 out of 90,370 possible PPIs. A secondary screen confirmed 343 (of 386) pPPIs between 179 proteins, yielding a scale-free network (r2 = 0.863). Eighty of 142 transmembrane RLKs tested positive, identifying 3 homomers, 63 heteromers, and 80 pPPIs with other proteins. Thirty-one out of 142 RLK interactors (including RLKs) had previously been found to be phosphorylated; thus interactors may be substrates for respective RLKs. None of the pPPIs described here had been reported in the major interactome databases, including potential interactors of G-protein-coupled receptors, phospholipase C, and AMT ammonium transporters. Two RLKs found as putative interactors of AMT1;1 were independently confirmed using a split luciferase assay in Arabidopsis protoplasts. These RLKs may be involved in ammonium-dependent phosphorylation of the C-terminus and regulation of ammonium uptake activity. The robotic screening method established here will enable a systematic analysis of membrane protein interactions in fungi, plants and metazoa.

by Lalonde S, Sero A, Pratelli R, Pilot G, Chen J, Sardi M, Parsa S, Kim DY, Acharya B, Stein E, Hu HC, Villiers F, Takeda K, Yang Y, Han Y, Schwacke R, Chiang W, Kato N, Loqué D, Assmann S, Kwak J, Schroeder J, Rhee S, Frommer W., Frontiers Plant Physiol, Vol.1 Num 24, pp. 1-14, 2010

Thursday, June 17, 2010

Computing gene expression data with a knowledge-based gene clustering approach

Computational analysis methods for gene expression data gathered in microarray experiments can be used to identify the functions of previously unstudied genes. While obtaining the expression data is not a difficult task, interpreting and extracting the information from the datasets is challenging. In this study, a knowledge-based approach which identifies and saves important functional genes before filtering based on variability and fold change differences was utilized to study light regulation. Two clustering methods were used to cluster the filtered datasets, and clusters containing a key light regulatory gene were located. The common genes to both of these clusters were identified, and the genes in the common cluster were ranked based on their coexpression to the key gene. This process was repeated for 11 key genes in 3 treatment combinations. The initial filtering method reduced the dataset size from 22,814 probes to an average of 1134 genes, and the resulting common cluster lists contained an average of only 14 genes. These common cluster lists scored higher gene enrichment scores than two individual clustering methods. In addition, the filtering method increased the proportion of light responsive genes in the dataset from 1.8% to 15.2%, and the cluster lists increased this proportion to 18.4%. The relatively short length of these common cluster lists compared to gene groups generated through typical clustering methods or co-expression networks narrows the search for novel functional genes while increasing the likelihood that they are biologically relevant.

by Bruce A. Rosa, Sookyung Oh, Beronda L. Montgomery, Jin Chen*, Wensheng Qin* (* co-corresponding authors), Int J Biochem Mol Biol 2010;1(1):51-68

Wednesday, September 24, 2008

Exploiting Domain Knowledge to Improve Biological Significance of Bi-clusters with Key Missing Genes

In an era of increasingly complex biological datasets, one of the key steps in gene functional analysis comes from clustering genes based on co-expression. Biclustering algorithms can identify gene clusters with local co-expressed patterns, which are more likely to define genes functioning together than global clustering methods. However, these algorithms are not effective in uncovering gene regulatory networks because the mined biclusters lack genes that may be critical in the function but may not be co-expressed with the clustered genes. In this project, we introduce a biclustering method called SKeleton Biclustering (SKB), which builds high quality biclusters from microarray data, creates relationships among the biclustered genes based on Gene Ontology annotations, and identifies genes that are missing in the biclusters. SKB thus defines inter-bicluster and intra-bicluster functional relationships. The delineation of functional relationships and incorporation of such missing genes may help biologists to discover biological processes that are important in a given study and provides clues for how the processes may be functioning together. Experimental results on yeast cell cycles and Arabidopsis cold-response microarray datasets show that, with SKB, the biological significance of the biclusters is considerably improved. 

--by Jin Chen, Liping Ji, Wynne Hsu, Kian-Lee Tan, Seung Rhee, ICDE, 2009

Tuesday, September 25, 2007

Molecular and cellular approaches for the detection of protein-protein interactions: Latest Techniques and Current Limitations

Homo- and heterotypic protein interactions are crucial for all levels of cellular function including architecture, regulation, metabolism, and signaling. Therefore, protein interaction maps represent essential components of post-genomic toolkits needed for understanding biological processes at a systems level. Over the past decade, a wide variety of methods have been developed to detect, analyze and quantify protein interactions, including surface plasmon resonance spectroscopy, NMR, yeast two hybrid screens, peptide tagging combined with mass spectrometry and fluorescence-based technologies. Fluorescence techniques range from colocalization of tags, which may be limited by the optical resolution of the microscope, to FRET-based methods that have molecular resolution and can also report on the dynamics and localization of the interactions within a cell. Proteins interact via highly evolved complementary surfaces with affinities that can vary over many orders of magnitude. Some of the techniques described in this review, such as surface plasmon resonance provide detailed information on physical properties of these interactions, while others, such as two hybrid techniques and mass spectrometry, are amenable to high throughput analysis using robotics. In addition to providing an overview of these methods, this review emphasizes techniques that can be applied to determine interactions involving membrane proteins, including the split ubiquitin system and fluorescence-based technologies for characterizing hits obtained with high throughput approaches. Mass spectrometry-based methods are covered by a review by Thelen et al.. In addition, we discuss the use of interaction data to construct interaction networks and as the basis for the exciting possibility of their usage for predicting interaction surfaces.

--by Sylvie Lalonde, Jin Chen, David W. Ehrhardt, Dominique Loqué, Seung Y. Rhee, Wolf B. Frommer, Plant Journal, 2008

Friday, January 19, 2007

Increasing Confidence of Protein-Protein Interactomes

High-throughput experimental methods, such as yeast-two-hybrid and phage display, have fairly high levels of false positives (and false negatives). Thus the list of protein-protein interactions detected by such experiments would need additional wet laboratory validation. It would be useful if the list could be prioritized in some way.

Advances in computational techniques for assessing the reliability of protein-protein interactions detected by such high-throughput methods are reviewed in this paper, with a focus on techniques that rely only on topological information of the protein interaction network derived from such high-throughput experiments. In particular, we discuss indices that are abstract mathematical characterizations of networks of reliable protein-protein interactions—e.g., “interaction generality” (IG), “interaction reliability by alternatve pathways” (IRAP), and “functional similarity weighting” (FSWeight). We also present indices that are based on explicit motifs associated with true-positive protein interactions—e.g., “new interaction generality” (IG2) and “meso-scale motifs” (NeMoFinder).

--by Jin Chen, Hon Nian Chua, Wynne Hsu, Mong-Li Lee, See-Kiong Ng, Rintaro Saito, Wing-Kin Sung, Limsoon Wong, GIW keynote, 2006

Friday, October 13, 2006

Labeling network motifs in protein interactomes for protein function prediction

Biological networks such as the protein-protein interaction (PPI) network have been found to contain small recurring subnetworks in significantly higher frequencies than in random networks. Such network motifs are useful for uncovering structural design principles of complex biological networks. However, current network motif finding algorithms models the PPI network as a uni-labeled graph, discovering only unlabeled and thus relatively uninformative network motifs as a result.

Our objective is to exploit the currently available biological information that are associated with the vertices (the proteins) to capture not only the topological shapes of the motifs, but also the biological context in which they occurred in the PPI networks for network motif applications. We present a method called LaMoFinder to label network motifs with Gene Ontology terms in a PPI network. We also show how the resulting labeled network motifs can be used to predict unknown protein functions. Experimental results showed that the labeled network motifs extracted are biologically meaningful and can achieve better performance than existing PPI topology based methods for predicting unknown protein functions.

--by Jin Chen, Wynne Hsu, Mong Li Lee and See-Kiong Ng, ICDE, 2007