|
|
SESSION: Regular Papers |
|
|
|
|
PReach: Reachability in Probabilistic Signaling Networks |
| |
Haitham Gabr,
Andrei Todor,
Helia Zandi,
Alin Dobra,
Tamer Kahveci
|
|
Pages: 3 |
|
doi>10.1145/2506583.2506586 |
|
Full text: PDF
|
|
Extracellular molecules trigger a response inside the cell by initiating a signal at special membrane receptors (i.e., sources) which is then transmitted to reporters (i.e., targets) through various chains of interactions among proteins. Understanding ...
Extracellular molecules trigger a response inside the cell by initiating a signal at special membrane receptors (i.e., sources) which is then transmitted to reporters (i.e., targets) through various chains of interactions among proteins. Understanding whether such a signal can reach from membrane receptors to reporters is essential in studying the cell response to extracellular events. This problem is drastically complicated due to the unreliability of the interaction data. In this paper, we develop a novel method, called PReach (Probabilistic Reachability), that precisely computes the probability that a signal can reach from a given collection of receptors to a given collection of reporters when the underlying signaling network is uncertain. This is a very difficult computational problem with no known polynomial-time solution. PReach represents each uncertain interaction as a bivariate polynomial. It transforms the reachability problem to a polynomial multiplication problem. We introduce novel polynomial collapsing operators that associate polynomial terms with possible paths between sources and targets as well as the cuts that separate sources from targets. These operators significantly shrink the number of polynomial terms and thus the running time. PReach has much better time complexity than the recent solutions for this problem. Our experimental results on real datasets demonstrate that this improvement leads to orders of magnitude of reduction in the running time over the most recent methods. expand
|
|
|
MoTeX: A word-based HPC tool for MoTif eXtraction |
| |
Solon P. Pissis,
Alexandros Stamatakis,
Pavlos Pavlidis
|
|
Pages: 13 |
|
doi>10.1145/2506583.2506587 |
|
Full text: PDF
|
|
Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying ...
Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motifs may correspond to functional elements in DNA, RNA, or protein molecules. Motifs may also correspond to whole loci whose sequences are highly similar because of recent duplication (e.g., transposable elements or recently duplicated genes). A DNA motif is a nucleic acid sequence that has a specific biological function, for instance encoding the DNA binding sites for a regulatory protein (transcription factor). Results: In this article, we introduce MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets. It uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. MoTeX comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. We show that MoTeX produces similar and partially identical results to current state-of-the-art tools with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms competing tools in terms of runtime efficiency. The MPI-based version of MoTeX requires only one hour to process all human genes on 1056 processors, while current sequential programmes require more than two months for this task. Availability: http://www.exelixis-lab.org/motex (open-source code) expand
|
|
|
Global Network Alignment In The Context Of Aging |
| |
Tijana Milenković,
Han Zhao,
Fazle E. Faisal
|
|
Pages: 23 |
|
doi>10.1145/2506583.2508968 |
|
Full text: PDF
|
|
Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. NA faces two algorithmic challenges: 1) Which cost function to use to capture "similarities" between ...
Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. NA faces two algorithmic challenges: 1) Which cost function to use to capture "similarities" between nodes in different networks? 2) Which alignment strategy to use to rapidly identify "high-scoring" alignments from all possible alignments? We "break down" existing state-of-the-art methods that use both different cost functions and different alignment strategies to evaluate each combination of their cost functions and alignment strategies. We find that a combination of the cost function of one method and the alignment strategy of another method beats the existing methods. Hence, we propose this combination as a novel superior NA method. Then, since human aging is hard to study experimentally due to long lifespan, we use NA to transfer aging-related knowledge from well annotated model species to poorly annotated human between aligned network regions. By doing so, we produce novel aging-related information, which complements currently available information about aging that has been obtained mainly by sequence alignment, especially in human. To our knowledge, we are the first to use NA to learn more about aging. expand
|
|
|
Haplotype-based prediction of gene alleles using pedigrees and SNP genotypes |
| |
Yuri Pirola,
Gianluca Della Vedova,
Paola Bonizzoni,
Alessandra Stella,
Filippo Biscarini
|
|
Pages: 33 |
|
doi>10.1145/2506583.2506592 |
|
Full text: PDF
|
|
Computational methods for gene allele prediction have been proposed to substitute dedicated and expensive assays with cheaper in-silico analyses that operate on routinely collected data, such as SNP genotypes. Most of these methods are tailored to the ...
Computational methods for gene allele prediction have been proposed to substitute dedicated and expensive assays with cheaper in-silico analyses that operate on routinely collected data, such as SNP genotypes. Most of these methods are tailored to the needs and characteristics of human genetic studies where they achieve good prediction accuracy. However, genomic analyses are becoming increasingly important in livestock species too. For livestock species generally the underlying---usually quite large and complex---pedigree is known and available; this information is not fully exploited by current allele prediction methods. In this paper, we propose a new gene allele prediction method based on a simple, but robust, combinatorial formulation for the problem of discovering haplotype-allele associations. The inherent uncertainty of the haplotype inference process is reduced by taking into account the inheritance of gene alleles across the population pedigree while genotypes are phased. The accuracy of the method has been extensively evaluated on a representative real-world livestock dataset under several scenarios and choices of parameters. The median error rate ranged from 0.0537 to 0.0896, with an average of 0.0678; this is 21% better than another state-of-the-art prediction algorithm that does not use the pedigree information. The experimental results support the validity of the proposed approach and, in particular, of the use of pedigree information in gene allele predictions. expand
|
|
|
A Semi-Supervised Learning Approach to Integrated Salient Risk Features for Bone Diseases |
| |
Hui Li,
Xiaoyi Li,
Murali Ramanathan,
Aidong Zhang
|
|
Pages: 42 |
|
doi>10.1145/2506583.2506593 |
|
Full text: PDF
|
|
The study of the risk factor analysis and prediction for diseases requires the understanding of the complicated and highly correlated relationships behind numerous potential risk factors (RFs). The existing models for this purpose usually fix a small ...
The study of the risk factor analysis and prediction for diseases requires the understanding of the complicated and highly correlated relationships behind numerous potential risk factors (RFs). The existing models for this purpose usually fix a small number of RFs based on the expert knowledge. Although handcrafted RFs are usually statistically significant, those abandoned RFs might still contain valuable information for explaining the comprehensiveness of a disease. However, it is impossible to simply keep all of RFs. So how to find the integrated risk features from numerous potential RFs becomes a particular challenging task. Another major challenge for this task is the lack of sufficient labeled data and missing values in the training data. In this paper, we focus on the identification of the relationships between a bone disease and its potential risk factors by learning a deep graphical model in an epidemiologic study for the purpose of predicting osteoporosis and bone loss. An effective risk factor analysis approach which delineates both observed and hidden risk factors behind a disease encapsulates the salient features and also provides a framework for two prediction tasks. Specifically, we first investigate an approach to show the salience of the integrated risk features yielding more abstract and useful representations for the prediction. Then we formulate the whole prediction problem as two separate tasks to evaluate our new representation of integrated features. With the success of the osteoporosis prediction, we further take advantage of the Positive output and predict the progression trend of osteoporosis severity. We capture the characteristics of data itself and intrinsic relatedness between two relevant prediction tasks by constructing a deep belief network followed with a two-stage fine-tuning (FT). Moreover, our proposed method results in stable and promising results without using any prior information. The superior performance on our evaluation metrics confirms the effectiveness of the proposed approach for extraction of the integrated salient risk features for predicting bone diseases. expand
|
|
|
Color distribution can accelerate network alignment |
| |
Md Mahmudul Hasan,
Tamer Kahveci
|
|
Pages: 52 |
|
doi>10.1145/2506583.2506594 |
|
Full text: PDF
|
|
Aligning a query network to an arbitrary large target network while ensuring provable optimality guarantee is a computationally challenging task. To ensure the confidence in the optimality of the alignment, existing methods often use an iterative randomization ...
Aligning a query network to an arbitrary large target network while ensuring provable optimality guarantee is a computationally challenging task. To ensure the confidence in the optimality of the alignment, existing methods often use an iterative randomization technique called color coding. Each iteration of the color coding technique employs dynamic programming that is exponential in the number of nodes in the query network. Here, we develop a method named ColT (Colorful Tree) that reduces the cost of this bottleneck. It particularly focuses on query networks with tree topology which is considered frequently in the literature. ColT exploits the topology of the query tree and uses the color distribution in the target network to filter unpromising alignments without compromising the confidence in the optimality. We experiment on a comprehensive set of synthetic and real data sets. ColT demonstrates supremacy over the state-of-the-art color coding algorithm, QNet with growing size of the query trees. For query trees of nine nodes in directed and undirected target networks, ColT outperforms QNet by factors of eight and fifteen, respectively. Our experiments also suggest that ColT identifies functionally similar regions in protein-protein interaction networks. expand
|
|
|
TCGA Toolbox: an Open Web App Framework for Distributing Big Data Analysis Pipelines for Cancer Genomics |
| |
David E. Robbins,
Alexander Grüneberg,
Helena F. Deus,
Murat M. Tanik,
Jonas Almeida
|
|
Pages: 62 |
|
doi>10.1145/2506583.2506595 |
|
Full text: PDF
|
|
The diversity and volume of data generated by the cancer genome atlas (TCGA) has been increasing exponentially, with the number of data files hosted by NHI, currently 3/4 million, doubling every 7 months since January 2010. The proponents have recently ...
The diversity and volume of data generated by the cancer genome atlas (TCGA) has been increasing exponentially, with the number of data files hosted by NHI, currently 3/4 million, doubling every 7 months since January 2010. The proponents have recently developed a browser-based self-updating mechanism to catalog this dynamic big data repository. In this report, that foundation is built upon to devise a web app framework to distribute TCGA analytical pipelines in a manner that can be fully reproducible without the usual requirement for a pre-installed specialized computational statistics environment. The solution found relies exclusively of sandboxed code injection (JavaScript) and on access permission configuration by the browser's app store. This framework was devised with an open architecture such that third party analyses, ideally hosted with web-facing version control in a repository such as GitHub, SourceForge, Bitbucket, or Google Code, can be distributed to the toolbox. The openness of the framework developed is specifically reflected by enabling the user to invoke the third party analysis simply by inputing the corresponding URL. Similarly, the toolbox also mediates the ability of the user to then distribute the result of the analysis as a reproducible procedure, also fully invoked as a Universal Resource Locator (URL). expand
|
|
|
A Study of Temporal Action Sequencing During Consumption of a Meal |
| |
Raul I. Ramos-Garcia,
Adam W. Hoover
|
|
Pages: 68 |
|
doi>10.1145/2506583.2506596 |
|
Full text: PDF
|
|
Advances in body sensing and mobile health technology have created new opportunities for empowering people to take a more active role in managing their health. Measurements of dietary intake are commonly used for the study and treatment of obesity. However, ...
Advances in body sensing and mobile health technology have created new opportunities for empowering people to take a more active role in managing their health. Measurements of dietary intake are commonly used for the study and treatment of obesity. However, the most widely used tools rely upon self-report and require considerable manual effort, leading to underreporting of consumption, non-compliance, and discontinued use over the long term. We are investigating the use of wrist-worn accelerometers and gyroscopes to automatically recognize eating gestures. In order to improve recognition accuracy, we studied the sequential dependency of actions during eating. Using a set of four actions (rest, utensiling, bite, drink), we developed a hidden Markov model (HMM) and compared its recognition performance against a non-sequential classifier (KNN). Tested on a dataset of 20 meals, the KNN achieved 71.7% accuracy while the HMM achieved 84.3% accuracy, showing that knowledge of the sequential nature of activities during eating improves recognition accuracy. expand
|
|
|
Binary Response Models for Recognition of Antimicrobial Peptides |
| |
Elena G. Randou,
Daniel Veltri,
Amarda Shehu
|
|
Pages: 76 |
|
doi>10.1145/2506583.2506597 |
|
Full text: PDF
|
|
There is now great urgency in developing new antibiotics to combat bacterial resistance. Recent attention has turned to naturally-occurring antimicrobial peptides (AMPs) that can serve as templates for antibacterial drug research. As natural AMPs have ...
There is now great urgency in developing new antibiotics to combat bacterial resistance. Recent attention has turned to naturally-occurring antimicrobial peptides (AMPs) that can serve as templates for antibacterial drug research. As natural AMPs have a wide range of activity against various bacteria, current research is focusing on modifying existing peptides or designing new ones to increase potency. This paper presents a computational approach to further our understanding of what physicochemical properties or features confer to a peptide antimicrobial activity. One of the contributions of this paper is the ability to rigorously test the relevance of features obtained by biological or computational researchers in the context of AMP recognition. A second contribution is the construction of a predictive model that employs relevant features and their combinations to associate with a novel peptide sequence a probability to have antimicrobial activity. Taken together, the work in this paper seeks to help researchers elucidate features of importance for antimicrobial activity. This is an important first step towards modification or design of novel AMPs for treatment. With this goal in mind, we provide access to the proposed methodology through a web server, which allows users to replicate the findings here or evaluate their own feature set. expand
|
|
|
Quantitative Early Detection of Diabetic Foot |
| |
Viktor Chekh,
Shuang (Sean) Luan,
Mark Burge,
Cesar Carranza,
Pete Soliz,
Elizabeth McGrew,
Simon Barriga
|
|
Pages: 86 |
|
doi>10.1145/2506583.2506598 |
|
Full text: PDF
|
|
Diabetes afflicts an estimated 171 million people worldwide. Diabetic patients are at risk of a wide range of complications including peripheral neuropathy (or diabetic foot). The condition if left untreated will lead to ulcers and eventually lower extremity ...
Diabetes afflicts an estimated 171 million people worldwide. Diabetic patients are at risk of a wide range of complications including peripheral neuropathy (or diabetic foot). The condition if left untreated will lead to ulcers and eventually lower extremity amputation. Current existing diagnostic techniques for peripheral neuropathy are mostly qualitative procedures based on patient sensations and exhibit significant inter- and intra-observer differences, and an economical quantitative diagnostic technique is still lacking. A system for quantitative early detection of diabetic peripheral neuropathy has been developed based the thermal response of the feet of diabetic patients following cold stimulus. This paper describes the details of the new system, which includes the following key components: (1) A new protocol of using thermal imaging as functional imaging to measure thermal response. (2) Segmentation and tracking of regions of interest (ROIs) for thermal videos. (3) A novel bio-heat transfer model based on thermoregulation. We also report our preliminary patient studies based on two classifiers, which gave strong evidence that the system can used for early quantitative detection of peripheral neuropathy for diabetics. expand
|
|
|
Reconstructing transcriptional regulatory networks by probabilistic network component analysis |
| |
Jinghua Gu,
Jianhua Xuan,
Xiao Wang,
Ayesha N. Shajahan,
Leena Hilakivi-Clarke,
Robert Clarke
|
|
Pages: 96 |
|
doi>10.1145/2506583.2506599 |
|
Full text: PDF
|
|
Despite encouraging progress made by integrating multi-platform data for regulatory network reconstruction, identification of transcriptional regulatory networks remains challenging due to imperfection in current biotechnology and complexity of biological ...
Despite encouraging progress made by integrating multi-platform data for regulatory network reconstruction, identification of transcriptional regulatory networks remains challenging due to imperfection in current biotechnology and complexity of biological systems. It is important to develop new computational approaches for reliable regulatory network reconstruction, especially those of robustness against noise in gene expression data and 'structural error' (i.e., false connections) in binding data. We propose a new method, namely probabilistic network component analysis (pNCA), to estimate the posterior binding matrix given observed gene expression and binding data. The elements in the binding matrix, instead of taking deterministic binary values, are modeled as unknown Bernoulli random variables that represent the probability of regulation. A novel two-stage Gibbs sampling framework is employed to iteratively estimate both hidden transcription factor activities and the posterior distribution of binding matrix. Numerical simulation on synthetic data has demonstrated improved performance of the proposed method over several existing methods for regulatory network identification. Notably, the robustness of pNCA against 'structural error' in initial binding data is fortified with high tolerance of false negative connections in addition to that of false positive connections. The proposed method has been applied to breast cancer cell line data to reconstruct biologically meaningful regulatory networks, revealing condition-specific regulatory rewiring and important cooperative regulation associated with estrogen signaling and action in breast cancer cells. expand
|
|
|
Protein Structure Refinement by Iterative Fragment Exchange |
| |
Debswapna Bhattacharya,
Jianlin Cheng
|
|
Pages: 106 |
|
doi>10.1145/2506583.2506601 |
|
Full text: PDF
|
|
Despite significant advancement of computational methods in protein structure prediction during the last decade, these techniques often cannot achieve allowable prediction accuracy to be applied in solving biological problems. Bringing these low-resolution ...
Despite significant advancement of computational methods in protein structure prediction during the last decade, these techniques often cannot achieve allowable prediction accuracy to be applied in solving biological problems. Bringing these low-resolution predicted models to high-resolution structures close to their native state, called the protein structure refinement problem, however, has proven to be extremely challenging and a largely unsolved problem in the field of protein structure prediction. Here, we propose a new approach to protein structure refinement by iterative fragment exchange, called REFINEpro. The protocol first identifies the less conserved local regions in the initial model by consensus approach using ensemble of models produced for the same protein target. We call these regions problematic regions (PRs). The qualities of the PRs are then iteratively improved by exchanging better-modeled fragments corresponding to these PRs from structures in the ensemble. This method has been tested on benchmark datasets comprising of decoys generated through both template-based and ab-initio protein structure prediction methods and exhibits promising improvement over the initial models in both global and local model quality measures, indicating a new avenue to solve the protein structure refinement problem. REFINEpro web server is freely available at http://sysbio.rnet.missouri.edu/REFINEpro/. expand
|
|
|
MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions |
| |
Tin Chi Nguyen,
Dongxiao Zhu
|
|
Pages: 115 |
|
doi>10.1145/2506583.2506602 |
|
Full text: PDF
|
|
Metagenomics is the study of genomic content of microorganisms from environmental samples without isolation and cultivation. Recently developed next generation sequencing (NGS) technologies efficiently generate vast amounts of metagenomic DNA sequences. ...
Metagenomics is the study of genomic content of microorganisms from environmental samples without isolation and cultivation. Recently developed next generation sequencing (NGS) technologies efficiently generate vast amounts of metagenomic DNA sequences. However, the ultra-high throughput and short read lengths make the separation of reads from different species more challenging. Among the existing computational tools for NGS data, there are supervised methods that use reference databases to classify reads and unsupervised methods that use oligonucleotide patterns to cluster reads. The former may leave a large fraction of reads unclassified due to the absence of closely related references. The latter often rely on long oligonucleotide frequencies and are sensitive to species abundance levels. In this work, we present MarkovBin, a new unsupervised method that can accurately cluster metagenomic reads across various species abundance ratios. We first model the nucleotide sequences as a fixed-order Markov chain. We then propose a hierarchical distribution to model the dependency between paired-end reads. Finally, we employ the mixture model framework to separate reads from different genomes in a metagenomic dataset. Using extensive simulation data, we demonstrate a high accuracy and precision by comparing to selected unsupervised read clustering tools. The software is freely available at http://orleans.cs.wayne.edu/MarkovBin. expand
|
|
|
Detecting various types of differential splicing events using RNA-Seq data |
| |
Nan Deng,
Dongxiao Zhu
|
|
Pages: 124 |
|
doi>10.1145/2506583.2512361 |
|
Full text: PDF
|
|
More than 90% of human genes are alternatively spliced through different types of splicing. The high-throughput RNA-Seq technology provides unprecedented opportunities for detection of differential pre-mRNA alternative splicing between different transcriptomes. ...
More than 90% of human genes are alternatively spliced through different types of splicing. The high-throughput RNA-Seq technology provides unprecedented opportunities for detection of differential pre-mRNA alternative splicing between different transcriptomes. Besides differential expression analysis, differential splicing analysis may generate new understanding into cell development and differentiation as well as various human diseases. In this paper, we present a novel computational method for detecting types of differential splicing events between transcriptomes using RNA-Seq data. Our method utilizes sequential dependency of base-wise read coverage signals and detects significant differential splicing events in the form of five types of splicing events supported by junction reads. For each candidate splicing event, by taking ratio of normalized RNA-Seq splicing indexes at each nucleotide location of two samples, our method reduces the effect of sequencing and alignment biases. We employ a parametric statistical test and a change-point type of analysis on each candidate splicing event for differential splicing event detection. We applied our method on a public RNA-Seq data set of human H1 and H1 differentiation into neural progenitor cell lines and detected many significant differential splicing events falling into the five well-known types of alternative splicing. We also compared our method with the other two existing methods, and the results demonstrate that our method is a promising approach, which can uniquely detect more differential splicing events using RNA-Seq data. expand
|
|
|
MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search |
| |
Noah M. Daniels,
Andrew Gallant,
Norman Ramsey,
Lenore J. Cowen
|
|
Pages: 133 |
|
doi>10.1145/2506583.2506607 |
|
Full text: PDF
|
|
We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14% improvement in mean Area Under the Curve for the ...
We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14% improvement in mean Area Under the Curve for the motif recognition problem as compared to HMMER, 25% improvement as compared to RAPTOR, 14% improvement as compared to HHPred, and a 18% improvement as compared to CNFPred and RaptorX. MRFy was implemented in the Haskell functional programming language, and parallelizes well on multi-core systems. MRFy is available, as source code as well as an executable, from http://mrfy.cs.tufts.edu/. expand
|
|
|
Improving discrimination of essential genes by modeling local insertion frequencies in transposon mutagenesis data |
| |
Michael A. DeJesus,
Thomas R. Ioerger
|
|
Pages: 144 |
|
doi>10.1145/2506583.2506610 |
|
Full text: PDF
|
|
Transposon mutagenesis experiments enable the identification of essential genes in bacteria. Deep-sequencing of mutant libraries provides a large amount of high-resolution data on essentiality. Statistical methods developed to analyze this data have ...
Transposon mutagenesis experiments enable the identification of essential genes in bacteria. Deep-sequencing of mutant libraries provides a large amount of high-resolution data on essentiality. Statistical methods developed to analyze this data have traditionally assumed that the probability of observing a transposon insertion is the same across the genome. This assumption, however, is inconsistent with the observed insertion frequencies from transposon mutant libraries of M. tuberculosis. We propose a modified binomial model of essentiality that can characterize the insertion probability of individual genes in which we allow local variation in the background insertion frequency in different non-essential regions of the genome. Using the Metropolis-Hastings algorithm, samples of the posterior insertion probabilities are obtained for each gene, and the probability of each gene being essential is estimated. We compare our predictions to those of previous methods and show that, by taking into consideration local insertion frequencies, our method is capable of making more conservative predictions that better match what is experimentally known about essential and non-essential genes. expand
|
|
|
GLProbs: Aligning multiple sequences adaptively |
| |
Yongtao Ye,
David W. Cheung,
Yadong Wang,
Siu-Ming Yiu,
Qing Zhan,
Tak-Wah Lam,
Hing-Fung Ting
|
|
Pages: 152 |
|
doi>10.1145/2506583.2506611 |
|
Full text: PDF
|
|
This paper proposes a simple and effective approach to improve the accuracy of multiple sequence alignment. We use a natural measure to estimate the similarity of the input sequences, and based on this measure, we align the input sequences differently. ...
This paper proposes a simple and effective approach to improve the accuracy of multiple sequence alignment. We use a natural measure to estimate the similarity of the input sequences, and based on this measure, we align the input sequences differently. For example, for inputs with high similarity, we consider the whole sequences and align them globally, while for those with moderately low similarity, we may ignore the flank regions and align locally. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs, and compared its performance with a dozen leading alignment tools on three benchmark alignment databases. Our results show that GLProbs has the best accuracy for almost all testings. expand
|
|
|
PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM Approach |
| |
Xiao Zhu,
Henry C.M. Leung,
Francis Y.L. Chin,
Siu Ming Yiu,
Guangri Quan,
Bo Liu,
Yadong Wang
|
|
Pages: 161 |
|
doi>10.1145/2506583.2506612 |
|
Full text: PDF
|
|
Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy ...
Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of single-end and paired-end reads when resolving branches, e.g. the number and positions of reads supporting each possible extension are not taken into account when resolving branches. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap size thresholds ranging from Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contigs by all feasible extensions and determine the correct extension by using look ahead technology. We evaluated PERGA on both simulated Illumina data sets and real data sets, and it constructed longer and more correct contigs and scaffolds than other state-of-the-art assemblers IDBA-UD, Velvet, ABySS, SGA and CABOG. Availability: https://github.com/hitbio/PERGA expand
|
|
|
Simultaneous determination of subunit and complex structures of symmetric homo-oligomers from ambiguous NMR data |
| |
Himanshu Chandola,
Bruce R. Donald,
Chris Bailey-Kellogg
|
|
Pages: 171 |
|
doi>10.1145/2506583.2506613 |
|
Full text: PDF
|
|
Determining the structures of symmetric homo-oligomers provides critical insights into their roles in numerous vital cellular processes. Structure determination by nuclear magnetic resonance spectroscopy typically pieces together a structure based primarily ...
Determining the structures of symmetric homo-oligomers provides critical insights into their roles in numerous vital cellular processes. Structure determination by nuclear magnetic resonance spectroscopy typically pieces together a structure based primarily on interatomic distance restraints, but for symmetric homo-oligomers each restraint may involve atoms in the same subunit or in different subunits, as the different homo-oligomeric "copies" of each atom are indistinguishable without special experimental approaches. This paper presents a novel method that simultaneously determines the structure of the individual subunits and their arrangement into a complex structure, so as to best satisfy the distance restraints under a consistent (but partial) disambiguation. Recognizing that there are likely to be multiple good solutions to this complex problem, our method provides a guarantee of completeness to within a user-specified resolution, generating representative backbone structures for the secondary structure elements, such that any structure that satisfies sufficiently many experimental restraints is sufficiently close to a representative. Our method employs a branch-and-bound algorithm to search a configuration space representation of the subunit and complex structure, identifying regions containing the structures that are most consistent with the data. We apply our method to three test cases with experimental data and demonstrate that it can handle the difficult configuration space search problem and substantial ambiguity, effectively pruning the configuration spaces and characterizing the actual diversity of structures supported by the data. expand
|
|
|
Greedy Randomized Search Procedure to Sort Genomes using Symmetric, Almost-Symmetric and Unitary Inversions |
| |
Ulisses Dias,
Christian Baudet,
Zanoni Dias
|
|
Pages: 181 |
|
doi>10.1145/2506583.2506614 |
|
Full text: PDF
|
|
Genome Rearrangement is a field that addresses the problem of finding the minimum number of global operations that transform one given genome into another. In this work we develop an algorithm for three constrained versions of the event called inversion, ...
Genome Rearrangement is a field that addresses the problem of finding the minimum number of global operations that transform one given genome into another. In this work we develop an algorithm for three constrained versions of the event called inversion, which occurs when a chromosome breaks at two locations called breakpoints and the DNA between the breakpoints is reversed. The constrained versions are called symmetric, almost-symmetric and unitary inversions. In this paper, we present a greedy randomized search procedure to find the minimum number of such operations between two genomes. Our approach is, to our knowledge, the first genome rearrangement problem modeled using this metaheuristic. Our model is an iterative process in which each iteration receives a feasible solution whose neighborhood is investigated for a better solution. This search uses greediness to shape the candidate list and randomness to select elements from the list. A previous greedy heuristic was used as an initial solution. In almost every case, we were able to improve that initial solution by providing a new sequence of inversions that uses less operations. For permutations of size 10, our solutions were, on average, 5 inversions shorter than the initial solution. For permutations of size 15 and 20, our solutions were, on average, 10 and 16 inversions shorter than the initial solution, respectively. For longer permutations ranging from 25 to 50 elements, we generated solutions that were, on average, 20--22 inversions shorter than the initial solution. We believe that the method proposed in this work can be adjusted to other genome rearrangement problems. expand
|
|
|
A generalized sparse regression model with adjustment of pedigree structure for variant detection from next generation sequencing data |
| |
Shaolong Cao,
Huaizhen Qin,
Hong-Wen Deng,
Yu-Ping Wang
|
|
Pages: 191 |
|
doi>10.1145/2506583.2506616 |
|
Full text: PDF
|
|
Next-generation sequencing technologies have been providing more comprehensive descriptions of rare and common sequence variants. Many powerful association tests have been developed for identifying significant individual common variants and genetic regions ...
Next-generation sequencing technologies have been providing more comprehensive descriptions of rare and common sequence variants. Many powerful association tests have been developed for identifying significant individual common variants and genetic regions likely harboring rare and common variants. Single marker tests bear poor statistical powers to identify rare variant associations. Set-based tests sacrifice single marker resolution and require the set size to be much smaller than the sample size. Existing sparse regression algorithms can identify susceptible variants from a large set, even if its size far exceeds the sample size. Such algorithms are developed for analyzing sequence data of unrelated individuals and thus can be invalid in the presence of relatedness. Relatedness and population structure are two ubiquitous confounders in sequencing studies, especially those of admixed minorities. We hereby propose a flexible sparse regression model to jointly rectify relatedness, population structure and traditional covariates. Under this framework, we develop unweighted and weighted Lp (0<p<1) regularization algorithms for selecting a sparse set of susceptible genetic variants. Both the joint adjustment of confounders and the Lp regularizations proved effective for preventing the flow of false positives while maximizing the power to select susceptible genetic variants. Under extensive simulations, our new algorithms outperformed conventional sparse regression algorithms by identifying more causal variants while maintaining lower false discovery rates. In particular, our algorithms appeared to be equally efficient for identifying susceptible rare and common variants. Our algorithms should be useful to a large range of next generation sequencing studies. expand
|
|
|
Text Mining of Protein Phosphorylation Information Using a Generalizable Rule-Based Approach |
| |
Manabu Torii,
Cecilia N. Arighi,
Qinghua Wang,
Cathy H. Wu,
K. Vijay-Shanker
|
|
Pages: 201 |
|
doi>10.1145/2506583.2506619 |
|
Full text: PDF
|
|
Literature-based annotation of protein phosphorylation is the focus of many biological databases, as phosphorylation is a global regulator of cellular activity. To speed up manual curation of phosphorylation information, text mining technology has been ...
Literature-based annotation of protein phosphorylation is the focus of many biological databases, as phosphorylation is a global regulator of cellular activity. To speed up manual curation of phosphorylation information, text mining technology has been utilized. In this paper, we report our ongoing effort to enhance RLIMS-P, a rule-based information extraction (IE) system to identify protein phosphorylation information in scientific literature. Despite the high accuracy attained by RLIMS-P, the use of elaborated patterns and rules resulted in a substantial effort for system development and maintenance. To mitigate this challenge, we redesigned RLIMS-P and integrated new natural language processing (NLP) techniques. It has also been adapted to mine full-text articles and generalized to be able to exploit common features for different post-translational modifications (PTMs). The updated RLIMS-P (version 2.0) was evaluated on abstracts in the publicly available BioNLP GENIA event extraction (GE) corpus, and achieved F-scores of 0.92 and 0.96 for phosphorylation substrate and site, respectively. On a full-text corpus developed in-house, it achieved F-scores of 0.91 and 0.92 for substrate and site, and 0.88 for kinase. The system was applied to the PubMed Central (PMC) Open Access Subset, and promising results have been obtained in mining the full-text articles. RLIMS-P focuses on protein phosphorylation information, but its new design would be generalizable for other PTM types. RLIMS-P version 2.0 is available at: http://proteininformationresource.org/rlimsp/. expand
|
|
|
Decomposing Biochemical Networks Into Elementary Flux Modes Using Graph Traversal |
| |
Ehsan Ullah,
Calvin Hopkins,
Shuchin Aeron,
Soha Hassoun
|
|
Pages: 211 |
|
doi>10.1145/2506583.2506620 |
|
Full text: PDF
|
|
Elementary Flux Mode (EFM) analysis is a fundamental network decomposition technique used for cellular pathway analysis in Systems Biology and Metabolic Engineering. EFM analysis has been utilized to examine robustness, regulation and microbial stress ...
Elementary Flux Mode (EFM) analysis is a fundamental network decomposition technique used for cellular pathway analysis in Systems Biology and Metabolic Engineering. EFM analysis has been utilized to examine robustness, regulation and microbial stress responses, to increase product yield, and to assess plant fitness and agricultural productivity. An EFM is a thermodynamically feasible path operating at steady state in a biochemical network, and is independent of other EFMs in the sense that it cannot be generated as a non-negative linear combination of other EFMs. We present in this paper a pathway analysis algorithm, termed graphical EFM or gEFM, based on graph traversal. Graph theoretical approaches were previously assumed to be less competitive than techniques based on the double-description method, a computational technique used for enumerating the extreme rays of a pointed cone. Importantly, we show that a practical graph-based traversal approach for computing EFMs is competitive with existing techniques. Applied to several biochemical networks, we show runtime speedups in the range of 2.5 × to 31 × when compared to the state-of-the-art tool (EFMTool). expand
|
|
|
PathCase-MAW: An Online Metabolic Network Analysis Workbench |
| |
A. Ercument Cicek,
Xinjian Qi,
Ali Cakmak,
Stephen R. Johnson,
Xu Han,
Sami Alshalwi,
Gultekin Ozsoyoglu
|
|
Pages: 219 |
|
doi>10.1145/2506583.2506621 |
|
Full text: PDF
|
|
Metabolic networks have become one of the centers of attention in life sciences research with the advancements in the metabolomics field. A vast array of studies analyzes metabolites and their interrelations to seek explanations for various biological ...
Metabolic networks have become one of the centers of attention in life sciences research with the advancements in the metabolomics field. A vast array of studies analyzes metabolites and their interrelations to seek explanations for various biological questions, and numerous genome-scale metabolic networks have been assembled to serve for this purpose. The increasing focus on this topic comes with the need for software systems that store, query, browse, analyze, and visualize metabolic networks. PathCase Metabolomics Analysis Workbench (PathCaseMAW) is built, released, and running on a manually created generic mammalian metabolic network. The PathCaseMAW system provides a database-enabled framework and web-based computational tools for browsing, querying, analyzing, and visualizing stored metabolic networks. PathCaseMAW editor, with its user-friendly interface, can be used to create a new metabolic network and/or update an existing metabolic network. The network can also be created from an existing genome-scale reconstructed network using the PathCaseMAW SBML parser. The metabolic network can be accessed through a web interface or an iPad application. For metabolomics analysis, Steady-State Metabolic Network Dynamics Analysis (SMDA) algorithm is implemented and integrated with the system. SMDA tool is accessible through both the web-based interface and the iPad application for metabolomics analysis based on a metabolic profile. PathCaseMAW is a comprehensive system with various data input and data access sub-systems. It is easy to work with by design, and is a promising tool for metabolomics research and for educational purposes. expand
|
|
|
Flexible RNA design under structure and sequence constraints using formal languages |
| |
Yu Zhou,
Yann Ponty,
Stéphane Vialette,
Jérôme Waldispuhl,
Yi Zhang,
Alain Denise
|
|
Pages: 229 |
|
doi>10.1145/2506583.2506623 |
|
Full text: PDF
|
|
The problem of RNA secondary structure design is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints ...
The problem of RNA secondary structure design is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or anywhere in the sequence. In this study, we investigate the design of RNA sequences from their targeted secondary structure, given these additional sequence constraints. To this purpose, we develop a general framework based on concepts of language theory, namely context-free grammars and finite state automata. We efficiently combine a comprehensive set of constraints into a unifying context-free grammar of moderate size. From there, we use generic algorithms to perform a (weighted) random generation, or an exhaustive enumeration, of candidate sequences. The resulting method, whose complexity scales linearly with the length of the RNA, was implemented as a standalone program. The resulting software was embedded into a publicly available dedicated web server. The applicability of the method was demonstrated on a concrete case study dedicated to Exon Splicing Enhancers, in which our approach was successfully used in the design of in vitro experiments. expand
|
|
|
The TREC Medical Records Track |
| |
Ellen M. Voorhees
|
|
Pages: 239 |
|
doi>10.1145/2506583.2506624 |
|
Full text: PDF
|
|
The Text REtrieval Conference (TREC) is a series of annual workshops designed to build the infrastructure for large-scale evaluation of search systems and thus improve the state-of-the-art. Each workshop is organized around a set of "tracks", challenge ...
The Text REtrieval Conference (TREC) is a series of annual workshops designed to build the infrastructure for large-scale evaluation of search systems and thus improve the state-of-the-art. Each workshop is organized around a set of "tracks", challenge problems that focus effort in particular research areas. The most recent TRECs have contained a Medical Records track whose goal is to enable semantic access to the free-text fields of electronic health records. Such access will enhance clinical care and support the secondary use of health records. The specific search task used in the track was a cohort-finding task. A search request described the criteria for inclusion in a (possible, but not actually planned) clinical study and the systems searched a set of de-identified clinical reports to identify candidates who matched the criteria. As anticipated, the search results demonstrate that language use within electronic health records is sufficiently different from general use to warrant domain-specific processing. Top-performing systems each used some sort of vocabulary normalization device specific to the medical domain to accommodate the array of abbreviations, acronyms, and other informal terminology used to designate medical procedures and findings in the records. The use of negative language is also much more prevalent in health records (e.g., patient denies pain, no fever) and thus requires appropriate handling for good search results. expand
|
|
|
SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq |
| |
Mark F. Rogers,
Christina Boucher,
Asa Ben-Hur
|
|
Pages: 247 |
|
doi>10.1145/2506583.2506625 |
|
Full text: PDF
|
|
Predicting the structure of genes from RNA-Seq data remains a significant challenge in bioinformatics. Although the amount of data available for analysis is growing at an accelerating rate, the capability to leverage these data to construct complete ...
Predicting the structure of genes from RNA-Seq data remains a significant challenge in bioinformatics. Although the amount of data available for analysis is growing at an accelerating rate, the capability to leverage these data to construct complete gene models remains elusive. In addition, the tools that predict novel transcripts exhibit poor accuracy. We present a novel approach to predicting splice graphs from RNA-Seq data that uses patterns of acceptor and donor sites to recognize when novel exons can be predicted unequivocally. This simple approach achieves much higher precision and higher recall than methods like Cufflinks or IsoLasso when predicting novel exons from real and simulated data. The ambiguities that arise from RNA-Seq data can preclude making decisive predictions, so we use a realignment procedure that can predict additional novel exons while maintaining high precision. We show that these accurate splice graph predictions provide a suitable basis for making accurate transcript predictions using tools such as IsoLasso and PSGInfer. Using both real and simulated data, we show that this integrated method predicts transcripts with higher recall and precision than using these other tools alone, and in comparison to Cufflinks. SpliceGrapherXT is available from the SpliceGrapher web page at http://SpliceGrapher.sf.net. expand
|
|
|
Classifying Immunophenotypes With Templates From Flow Cytometry |
| |
Ariful Azad,
Arif Khan,
Bartek Rajwa,
Saumyadipta Pyne,
Alex Pothen
|
|
Pages: 256 |
|
doi>10.1145/2506583.2506627 |
|
Full text: PDF
|
|
We describe an algorithm to dynamically classify flow cytometry data samples into several classes based on their immunophenotypes. Flow cytometry data consists of fluorescence measurements of several proteins that characterize different cell types in ...
We describe an algorithm to dynamically classify flow cytometry data samples into several classes based on their immunophenotypes. Flow cytometry data consists of fluorescence measurements of several proteins that characterize different cell types in blood or cultured cell lines. Each sample is initially clustered to identify the cell populations present in it. Using a combinatorial dissimilarity measure between cell populations in samples, we compute meta-clusters that correspond to the same cell population across samples. The collection of meta-clusters in a class of samples then describes a template for that class. We organize the samples into a template tree, and use it to classify new samples into existing classes or create a new class if needed. We dynamically update the templates and their statistical parameters as new samples are classified, so that the new information is reflected in the classes. We use our dynamic classification algorithm to classify T cells that on stimulation with an antibody show increased abundance of the proteins SLP-76 and ZAP-70. These proteins are involved in a platform that assembles signaling proteins in the immune response. We also use the algorithm to show that variation in an immune subsystem between individuals is a larger effect than variation in multiple samples from one individual. expand
|
|
|
An Ensemble Model for Mobile Device based Arrhythmia Detection |
| |
Kang Li,
Suxin Guo,
Jing Gao,
Aidong Zhang
|
|
Pages: 266 |
|
doi>10.1145/2506583.2506629 |
|
Full text: PDF
|
|
Recent advances in smart mobile device technology have resulted in global availability of portable computing devices capable of performing many complex functions. With the ultimate intent of promoting human's well-being, mobile device based arrhythmia ...
Recent advances in smart mobile device technology have resulted in global availability of portable computing devices capable of performing many complex functions. With the ultimate intent of promoting human's well-being, mobile device based arrhythmia detection (MAD) has attracted lots of attention recently. Without any guidance or supervision from experts, the performance of arrhythmia detection is usually unsatisfactory. Supervised learning can learn from labeled cardiac cycles to detect arrhythmias for each mobile device user if enough training data is provided. However, it is time-consuming, costly and sometimes impossible to let experts annotate enough training data for each user. To tackle this problem, we take advantage of publicly available and well annotated data to infer knowledge which can be treated as experts for MAD. To reduce the space usage of the framework, we extract from each source of labeled data an expert model, which consists of a task-independent individual characteristic vector and a task-related preference vector. Multiple experts are then integrated into an ensemble model for arrhythmia detection. Both space and time complexities of this proposed approach are theoretically analyzed and experimentally examined. To evaluate the performance of the method, we implement it on the MIT-BIH Arrhythmia Dataset and compare it with seven state-of-the-art methods in the area. Extensive experimental results show that the proposed algorithm outperforms all the baseline methods, which validates the effectiveness of the proposed algorithm in MAD. expand
|
|
|
Classifying Proteins by Amino Acid Variations of Sequential Patterns |
| |
En-Shiun Annie Lee,
Andrew K. C. Wong
|
|
Pages: 276 |
|
doi>10.1145/2506583.2506630 |
|
Full text: PDF
|
|
Similarities and differences in protein sequence patterns can be used to reveal essential and class-specific functionality of protein families. Traditional supervised learning methods require class labels for classifying sequences but cannot reveal embedded ...
Similarities and differences in protein sequence patterns can be used to reveal essential and class-specific functionality of protein families. Traditional supervised learning methods require class labels for classifying sequences but cannot reveal embedded patterns related to inherent functionality and taxonomical variations. We develop algorithm for discovering statistically significant sequence patterns and then aligning and clustering them into Aligned Patterns Clusters (APCs). We measure APC's classification ability: 1) with semi-supervised information measures that require class labels such as: a) class entropy (H) for patterns and each amino acid on a column; and b) class information gain (IG) for each column based its class amino acid distribution and 2) unsupervised measure without relying on class labels, such as: a) Entropy Redundancy (R1) that reflects amino acid conservation and diversity acid in a column and b) Normalized Sum of Mutual Information Redundancy (SR2) which characterizes the dependence of a column with all the other columns in the APC. We applied our Aligned Pattern Synthesis Process on: a) spermidine / spermine-N1-acetyltransferase (SSAT), b) the cytochrome c, and c) the ubiquitin protein families. After validating the classification ability of each of the proposed measures through a simple synthetic data set and the SAAT data, we present results on the other two protein families in a selective manner. In all our experiments, we have demonstrated the ability of each proposed measure and confirm the correlation between the SR2 with R1 and IG. Our experiments reveal how sequence patterns of the rows and amino acid distribution on each column can be associated with class and will be useful for amino acid substitution study, thus avoiding the dependencies on class label, which are often unavailable, inaccurate, or unbalanced. Properties of the measures, computational efficiency and biological impact of the algorithms are discussed in the paper. expand
|
|
|
PRASE: PageRank-based Active Subnetwork Extraction |
| |
Ayat Hatem,
Kamer Kaya,
Ümit V. Çatalyürek
|
|
Pages: 286 |
|
doi>10.1145/2506583.2506631 |
|
Full text: PDF
|
|
Integrating protein-protein interaction networks with gene expression data to extract active subnetworks is shown to be promising in detecting meaningful biomarkers for cancer and other diseases. Lately, the RNA-Seq technology became the new standard ...
Integrating protein-protein interaction networks with gene expression data to extract active subnetworks is shown to be promising in detecting meaningful biomarkers for cancer and other diseases. Lately, the RNA-Seq technology became the new standard for gene expression. Existing algorithms either cannot handle the RNA-Seq data or return large subnetworks which are hard to analyze. Therefore, new approaches to utilize and integrate the RNA-Seq data to the subnetwork extraction process are needed. In this work, using the RNA-Seq data, we propose a new workflow PRASE to obtain more focused subnetworks which contain important genes even if they are not differentially expressed. Although the hub nodes in the PPI network may be good candidates for such genes, they are not the only ones. A gene which is not differentially expressed and which does not have many interactions with the other genes can still be functional on many critical pathways. To prioritize such genes, PRASE employs the famous PageRank algorithm and apply a preprocessing on the gene expression p-values. Then, it applies a scaling function to construct new p-values for the genes which are then used with the existing active subnetwork extraction tools to generate the final subnetwork. We applied our workflow on colorectal cancer, oligodendroglioma tumor, and breast cancer datasets. Our evaluation shows that, using PRASE, we can obtain more specialized subnetworks which contain information that is overlooked by existing approaches. expand
|
|
|
Evaluation of Label Dependency for the Prediction of HLA Genes |
| |
Vanja Paunić,
Michael Steinbach,
Abeer Madbouly,
Vipin Kumar
|
|
Pages: 296 |
|
doi>10.1145/2506583.2506632 |
|
Full text: PDF
|
|
The Human Leukocyte Antigen (HLA) gene system plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes in order to maximize the chances of a successful transplant. It is the ...
The Human Leukocyte Antigen (HLA) gene system plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes in order to maximize the chances of a successful transplant. It is the most polymorphic region of the human genome with some of the strongest associations with autoimmune, infectious, and inflammatory diseases. The availability of HLA data is, therefore, of high importance to clinicians and researchers. However, due to its high polymorphism, obtaining it is time- and cost-prohibitive. We previously described a method for the prediction of HLA genes from widely available Single Nucleotide Polymorphism (SNP) data. In this paper we show that using HLA gene dependency information improves prediction performance on multiple real-world data sets. More specifically, we propose and evaluate different approaches for integrating HLA gene dependency into the prediction process. The results from experiments on two real data sets show that adding dependency information is a valuable asset for HLA gene prediction, particularly for smaller data sets. expand
|
|
|
Topological properties of chromosome conformation graphs reflect spatial proximities within chromatin |
| |
Hao Wang,
Geet Duggal,
Rob Patro,
Michelle Girvan,
Sridhar Hannenhalli,
Carl Kingsford
|
|
Pages: 306 |
|
doi>10.1145/2506583.2506633 |
|
Full text: PDF
|
|
Recent chromosome conformation capture (3C) experiments produce genome-wide networks of chromatin interactions to help to study how chromosome structures relate to genomic functions. We investigate whether properties of chromatin interaction graphs based ...
Recent chromosome conformation capture (3C) experiments produce genome-wide networks of chromatin interactions to help to study how chromosome structures relate to genomic functions. We investigate whether properties of chromatin interaction graphs based on shortest paths, maximum flows, and dense cores correlate with the spatial proximity in a three-dimensional model of the yeast genome. We demonstrate that within automatically-detected dense subgraphs, which correspond to spatially compact cores of interacting chromatin, these properties are well-correlated with spatial volume. We show that all tested methods are able to identify spatially compact sets when the test sets contain fragments from several chromosomes. We use a framework for systematically evaluating whether a method can accurately assess the spatial enrichment of a set of genomic loci for a hypothesized biological function. In such regions, we observe that the sets of fragments contained in the maximum density subgraph overlap highly with the sets of fragments in the spatially compact cores. Further, we observe that all methods agree on the spatial closeness of the yeast genomic annotations. Together, we show that compared to the more computationally complex and expensive three-dimensional embedding approach, the topological features of 3C graphs can be used to directly detect spatial closeness. expand
|
|
|
Improving phosphopeptide identification in shotgun proteomics by supervised filtering of peptide-spectrum matches |
| |
Sujun Li,
Randy J. Arnold,
Haixu Tang,
Predrag Radivojac
|
|
Pages: 316 |
|
doi>10.1145/2506583.2506634 |
|
Full text: PDF
|
|
One of the important objectives in mass spectrometry-based proteomics is the identification of post-translationally modified sites in cellular and extracellular proteomes. Proteomics techniques have been particularly effective in studying protein phosphorylation, ...
One of the important objectives in mass spectrometry-based proteomics is the identification of post-translationally modified sites in cellular and extracellular proteomes. Proteomics techniques have been particularly effective in studying protein phosphorylation, where tens of thousands of new sites have been recently discovered in all domains of life. Such massive discovery of new sites has been facilitated by progress in affinity enrichment techniques, high-throughput analytical platforms that couple liquid chromatography (LC) and tandem mass spectrometry (MS/MS), and also powerful computational tools that assign peptides to tandem mass spectra. In this work we focus on computational protocols for identifying phosphoproteins, phosphopeptides, and phosphosites. Although the current tools already provide solid results, most methods have not been tuned to exploit particular sequence and physicochemical properties of phosphopeptides or the peculiarities of their fragment spectra. Therefore, novel algorithms can be designed to increase the sensitivity of phosphosite identification. Here we describe a machine learning-based method that improves the identification of phosphopeptides in LC-MS/MS experiments. Our algorithm is applied as a post-processing step to a standard database search. It assigns a probability score to each peptide-spectrum match (PSM) corresponding to a phosphopeptide, based on the sequence and spectral features of the peptide and its assigned fragment spectra as well as the biological propensity of particular residues in the peptide to be phosphorylated. The algorithm is based on a simple but robust logistic regression model and is used together with a conventional search engine (here, MASCOT) to filter out the PSMs with the lowest probability of being correctly identified. Our protocol was tested on two large phosphoproteomics data sets on which it increased the number of identified phosphopeptides by 10-15% compared to the conventional scoring algorithms at the same false discovery rate threshold of 1%. expand
|
|
|
Performance Model Selection for Learning-based Biological Image Analysis on a Cluster |
| |
Jie Zhou,
Anthony Brunson,
John Winans,
Kirk Duffin,
Nicholas Karonis
|
|
Pages: 324 |
|
doi>10.1145/2506583.2506639 |
|
Full text: PDF
|
|
Microscopic images with increased scale and content call for high performance computing when applying automatic tools for biological image analysis. Speed of analysis can be improved at various stages. In learning-based models, selecting suitable algorithms ...
Microscopic images with increased scale and content call for high performance computing when applying automatic tools for biological image analysis. Speed of analysis can be improved at various stages. In learning-based models, selecting suitable algorithms for a given problem can be a lengthy process given the large pool of algorithms and the variety of biological problems. In this paper, we describe a portable method for efficiently and adaptively selecting an effective model for biological image classification as a step toward the goal of achieving high throughput biological image analysis. We implemented a high performance tool which extends the bioimage classification and annotation platform BIOCAT by deploying the model selection process on a cluster using a distributed design based on remote method invocation. The high performance model selection, when tested and compared using ten benchmarking data sets, is shown to not only dramatically increase the speed of the learning process, but also bring improved accuracy to several state-of-the-art data sets for bioimage classification. These achievements are attributed to the combined power of BIOCAT's adaptive model selection as well as the capability of distributed model evaluation. The tool is deployable to various types of distributed environments. expand
|
|
|
An Ensemble Topic Model for Sharing Healthcare Data and Predicting Disease Risk |
| |
Andrew K. Rider,
Nitesh V. Chawla
|
|
Pages: 333 |
|
doi>10.1145/2506583.2506640 |
|
Full text: PDF
|
|
With the recent signing of the Affordable Care Act into law, the use of electronic medical data is set to become ubiquitous in the United States. This presents an unprecedented opportunity to use population health data for the benefit of patient-centered ...
With the recent signing of the Affordable Care Act into law, the use of electronic medical data is set to become ubiquitous in the United States. This presents an unprecedented opportunity to use population health data for the benefit of patient-centered outcomes. However, there are two major hurdles to utilizing this wealth of data. First, medical data is not centrally located but is often divided across hospital systems, health exchanges, and physician practices. Second, sharing specific or identifiable information may not be allowed. Moreover, organizations may have a vested interest in keeping their data sets private as they may have been gathered and curated at great cost. We develop an approach to allow the sharing of beneficial information while staying within the bounds of data privacy. We show that the use of a probabilistic graphical model can facilitate effective transfer learning between distinct healthcare data sets by parameter sharing while simultaneously allowing us to construct a network for interpretation use by domain experts and the discovery of disease relationships. Our method utilizes aggregate information from distinct populations to improve the estimation of patient disease risk. expand
|
|
|
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs |
| |
Anas Abu-Doleh,
Erik Saule,
Kamer Kaya,
Ümit V. Çatalyürek
|
|
Pages: 341 |
|
doi>10.1145/2506583.2506641 |
|
Full text: PDF
|
|
Fast and robust algorithms and aligners have been developed to help the researchers in the analysis of genomic data whose size has been dramatically increased in the last decade due to the technological advancements in DNA sequencing. It was not only ...
Fast and robust algorithms and aligners have been developed to help the researchers in the analysis of genomic data whose size has been dramatically increased in the last decade due to the technological advancements in DNA sequencing. It was not only the size, but the characteristics of the data have been changed. One of the current concern is that the length of the reads is increasing. Although existing algorithms can still be used to process this fresh data, considering its size and changing structure, new and more efficient approaches are required. In this work, we address the problem of accurate sequence alignment on GPUs and propose a new tool, Masher, which processes long (and short) reads efficiently and accurately. The algorithm employs a novel indexing technique that produces an index for the 3, 137Mbp hg19 with a memory footprint small enough to be stored in a restricted-memory device such as a GPU. The results show that Masher is faster than state-of-the-art tools and obtains a good accuracy/sensitivity on sequencing data with various characteristics. expand
|
|
|
Suffix-Tree Based Error Correction of NGS Reads Using Multiple Manifestations of an Error |
| |
Daniel M. Savel,
Thomas LaFramboise,
Ananth Grama,
Mehmet Koyutürk
|
|
Pages: 351 |
|
doi>10.1145/2506583.2506644 |
|
Full text: PDF
|
|
Next Generation Sequencing (NGS) technologies produce large quantities of short length reads with higher error rates. Erroneous reads that cannot be aligned, are either ignored during de-novo sequencing, or must be suitably corrected. Such reads pose ...
Next Generation Sequencing (NGS) technologies produce large quantities of short length reads with higher error rates. Erroneous reads that cannot be aligned, are either ignored during de-novo sequencing, or must be suitably corrected. Such reads pose problems for mapping as well, since it is difficult to distinguish errors from true variants. Methods for detection and correction of errors typically rely on frequencies of substrings of the reads. Suffix trees are often utilized for this purpose, since they can be used to index and count the frequencies of substrings of all lengths. Existing suffix-tree based methods detect errors by identifying statistically under-represented branches (suffixes) and fix them. However, they do not refer back to the reads to put the correction in context. Since an error in a single read manifests itself at multiple nodes of a suffix tree, a read-driven approach that relies on its multiple manifestations is expected to perform better. Based on this observation, we develop an algorithm, PLURIBUS, which reconciles corrections suggested by multiple manifestations of an error using a voting scheme. We compare the accuracy of PLURIBUS in detecting and correcting errors against existing error correction techniques using simulated sequencing data. We also assess the impact of error correction on the performance of sequence assembly. Our results show that PLURIBUS corrects errors with improved precision and enables the assembler to generate longer contigs, particularly when the genome is longer, or coverage is lower. PLURIBUS is freely available at http://compbio.case.edu/pluribus/. expand
|
|
|
Genomic Sequence Fragment Identification using Quasi-Alignment |
| |
Anurag Nagar,
Michael Hahsler
|
|
Pages: 359 |
|
doi>10.1145/2506583.2506647 |
|
Full text: PDF
|
|
Identification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases ...
Identification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases for individual matching sequences (e.g., with BLAST) and comparing sequences using multiple sequence alignment (e.g., via Clustal), both of which are computationally expensive and require extensive server resources. We propose a novel method for sequence comparison, analysis, and classification which avoids the need to align sequences at the base level or search a database for similarity. Instead, our method uses alignment-free methods to find probabilistic quasi-alignments for longer (typically 100 base pairs) segments. Clustering is then used to create compact models that can be used to analyze a set of sequences and to score and classify unknown sequences against these models. In this paper we expand prior work in two ways. We show how quasi-alignments can be expanded into larger quasi-aligned sections and we develop a method to classify short sequence fragments. The latter is especially useful when working with Next-Generation Sequencing (NGS) techniques that generate output in the form of relatively short reads. We have conducted extensive experiments using fragments from bacterial 16S rRNA sequences obtained from the Greengenes project and our results show that the new quasi-alignment based approach can provide excellent results as well as overcome some of the restrictions of by the widely used Ribosomal Database Project (RDP) classifier. expand
|
|
|
Measuring Relatedness Between Scientific Entities in Annotation Datasets |
| |
Guillermo Palma,
Maria-Esther Vidal,
Eric Haag,
Louiqa Raschid,
Andreas Thor
|
|
Pages: 367 |
|
doi>10.1145/2506583.2506651 |
|
Full text: PDF
|
|
Linked Open Data has made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms (CV terms) from ontologies. These semantic annotations encode scientific knowledge which ...
Linked Open Data has made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms (CV terms) from ontologies. These semantic annotations encode scientific knowledge which is captured in annotation datasets. One can mine these datasets to discover relationships and patterns between entities. Determining the relatedness (or similarity) between entities becomes a building block for graph pattern mining, e.g., identifying drug-drug relationships could depend on the similarity of the diseases (conditions) that are associated with each drug. Diverse similarity metrics have been proposed in the literature, e.g., i) string-similarity metrics; ii) path-similarity metrics; iii) topological-similarity metrics; all measure relatedness in a given taxonomy or ontology. In this paper, we consider a novel annotation similarity metric AnnSim that measures the relatedness between two entities in terms of the similarity of their annotations. We model AnnSim as a 1-to-1 maximal weighted bipartite match, and we exploit properties of existing solvers to provide an efficient solution. We empirically study the effectiveness of AnnSim on real-world datasets of genes and their GO annotations, clinical trials, and a human disease benchmark. Our results suggest that AnnSim can provide a deeper understanding of the relatedness of concepts and can provide an explanation of potential novel patterns. expand
|
|
|
Identification of gene clusters with phenotype-dependent expression with application to normal and premature ageing |
| |
Kun Wang,
Avinash Das,
Zheng-Mei Xiong,
Kan Cao,
Sridhar Hannenhalli
|
|
Pages: 377 |
|
doi>10.1145/2506583.2506652 |
|
Full text: PDF
|
|
Background: Hutchinson Gilford progeria syndrome (HGPS) is a rare genetic disease with symptoms of aging manifested at a very early age. Molecular basis of HGPS is not entirely clear, although there are some known and other presumed overlaps with normal ...
Background: Hutchinson Gilford progeria syndrome (HGPS) is a rare genetic disease with symptoms of aging manifested at a very early age. Molecular basis of HGPS is not entirely clear, although there are some known and other presumed overlaps with normal aging process. Comparative investigation of biological processes associated with HGPS and normal aging may reveal common and distinctive pathways underlying these two conditions. Results: To investigate transcriptome changes through aging we have performed RNA-seq profiling in fibroblast cell cultures at three different cellular ages as measured by number of passages through culture growth, both from HGPS patients and matched normal samples. We then developed a novel iterative multiple regression approach that leverages co-expressed gene clusters to identify gene clusters whose expression changes significantly with age and/or disease state. We establish the robustness of our approach. Finally, we perform a comparative investigation of biological processes underlying normal aging and HGPS. Conclusion: Based on an iterative multiple regression approach applied to novel RNA-seq data in HGPS and aging our results recapitulate the previously known processes underlying aging while at the same time suggests numerous unique processes underlying aging and HGPS. expand
|
|
|
Meta-analysis of Genomic and Proteomic Features to Predict Synthetic Lethality of Yeast and Human Cancer |
| |
Min Wu,
Xuejuan Li,
Fan Zhang,
Xiaoli Li,
Chee-Keong Kwoh,
Jie Zheng
|
|
Pages: 384 |
|
doi>10.1145/2506583.2506653 |
|
Full text: PDF
|
|
A major goal in cancer medicine is to find selective drugs with reduced side-effect. A pair of genes is called synthetic lethality (SL) if mutations of both genes will kill a cell while mutation of either gene alone will not. Hence, a gene in SL interactions ...
A major goal in cancer medicine is to find selective drugs with reduced side-effect. A pair of genes is called synthetic lethality (SL) if mutations of both genes will kill a cell while mutation of either gene alone will not. Hence, a gene in SL interactions with a cancer-specific mutated gene will be a promising drug target with anti-cancer selectivity. Wet-lab screening approach is still so costly that even for yeast only a small fraction of gene pairs has been covered. Computational methods are therefore important for large-scale discovery of SL interactions. Most existing approaches focus on individual features or machine learning methods, which are prone to noise or overfitting. In this paper, we propose an approach of meta-analysis that integrates 17 genomic and proteomic features and the outputs of 10 classification methods. It thus combines the strengths of existing methods. It also adjusts relative contributions of multiple methods with weights learned from the training data. Running on a dataset of the yeast strain of S. cerevisiae, our method achieves AUC (area under ROC curve) of 87.2% the highest among all competitors. Moreover, through orthologous mapping from yeast to human genes, we predicted a list of SL pairs in human that contain top mutated genes in lung and breast cancers recently reported by The Cancer Genome Atlas (TCGA). Our method and predictions would shed light on mechanisms of SL and lead to discovery of novel anti-cancer drugs. expand
|
|
|
Temporal Relation Identification and Classification in Clinical Notes |
| |
Jennifer D'Souza,
Vincent Ng
|
|
Pages: 392 |
|
doi>10.1145/2506583.2506654 |
|
Full text: PDF
|
|
We examine the task of temporal relation classification for the clinical domain. Our approach to this task departs from existing ones in that it is (1) knowledge-rich, employing sophisticated knowledge derived from semantic and discourse relations, and ...
We examine the task of temporal relation classification for the clinical domain. Our approach to this task departs from existing ones in that it is (1) knowledge-rich, employing sophisticated knowledge derived from semantic and discourse relations, and (2) hybrid, combining the strengths of rule-based and learning-based approaches. Evaluation results on the i2b2 Clinical Temporal Relations Challenge corpus show that our approach yields a 15--21% and 6--13% relative reduction in error over a state-of-the-art learning-based baseline system when gold-standard and automatically identified temporal relations are used, respectively. expand
|
|
|
SESSION: Short Papers |
|
|
|
|
GapsMis: flexible sequence alignment with a bounded number of gaps |
| |
Carl Barton,
Tomáš Flouri,
Costas S. Iliopoulos,
Solon P. Pissis
|
|
Pages: 402 |
|
doi>10.1145/2506583.2506584 |
|
Full text: PDF
|
|
Motivation: Recent developments in next-generation sequencing technologies have renewed interest in pairwise sequence alignment techniques, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. ...
Motivation: Recent developments in next-generation sequencing technologies have renewed interest in pairwise sequence alignment techniques, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and the high-quality fragment of a short read, an important problem is to find the best possible alignment between a succeeding factor of the reference sequence and the remaining low-quality part of the read; allowing a number of mismatches and the insertion of gaps in the alignment. Results: We present GapsMis, a tool for pairwise global and semi-global sequence alignment with a variable, but bounded, number of gaps. It is based on a new algorithm, which computes a different version of the traditional dynamic programming matrix. Millions of pairwise sequence alignments, performed under realistic conditions based on the properties of real full-length genomes, show that GapsMis can increase the accuracy of extending short-read alignments end-to-end compared to more traditional approaches. Availability: http://www.exelixis-lab.org/gapmis expand
|
|
|
An Image-Text Approach for Extracting Experimental Evidence of Protein-Protein Interactions in the Biomedical Literature |
| |
Luis D. Lopez,
Jingyi Yu,
Cecilia N. Arighi,
Manabu Torii,
K. Vijay-Shanker,
Hongzhan Huang,
Cathy H. Wu
|
|
Pages: 412 |
|
doi>10.1145/2506583.2506585 |
|
Full text: PDF
|
|
Proteins are complex biological polymers that mediate virtually all cellular functions. Typically these functions are modulated by protein-protein interactions (PPI). Tremendous efforts have been made by life scientists to detect PPIs through different ...
Proteins are complex biological polymers that mediate virtually all cellular functions. Typically these functions are modulated by protein-protein interactions (PPI). Tremendous efforts have been made by life scientists to detect PPIs through different experimental approaches and document the results through publications. On the informatics front, however, there lacks an effective means for retrieving PPI information from published literatures. In this work we present a novel framework for identifying experimental methods employed for analyzing PPI from biomedical articles. Different from state-of-the-art approaches based only on text, we explore using the combination of attributes from figures, figure captions, and text within figures for identifying PPI experimental methods. Our work is motivated by the observation that biomedical figures often constitute direct evidence of experimental results and therefore provide complementary information to texts. We start with automatically extracting unimodal panels (subfigures) and their associated subcaptions and then classifying the subfigure into different types using a proposed hierarchical image taxonomy. Next, we combine the subfigure types with text-based features to form a hybrid feature descriptor and use it for PPI method classification. We further construct a dataset starting from a set of 2,256 documents provided by the molecular interaction database MINT. Here we show that our new approach outperforms the text-only solution for associating figures with PPI methods. expand
|
|
|
An Island-Based Approach for Differential Expression Analysis |
| |
Abdallah M. Eteleeb,
Robert M. Flight,
Benjamin J. Harrison,
Jeffrey C. Petruska,
Eric C. Rouchka
|
|
Pages: 419 |
|
doi>10.1145/2506583.2506589 |
|
Full text: PDF
|
|
High-throughput mRNA sequencing (also known as RNA-Seq) promises to be the technique of choice for studying transcriptome profiles. This technique provides the ability to develop precise methodologies for transcript and gene expression quantification, ...
High-throughput mRNA sequencing (also known as RNA-Seq) promises to be the technique of choice for studying transcriptome profiles. This technique provides the ability to develop precise methodologies for transcript and gene expression quantification, novel transcript and exon discovery, and splice variant detection. One of the limitations of current RNA-Seq methods is the dependency on annotated biological features (e.g. exons, transcripts, genes) to detect expression differences across samples. This forces the identification of expression levels and the detection of significant changes to known genomic regions. Any significant changes that occur in unannotated regions will not be captured. To overcome this limitation, we developed a novel segmentation approach, Island-Based (IB), for analyzing differential expression in RNA-Seq and targeted sequencing (exome capture) data without specific knowledge of an isoform. The IB segmentation determines individual islands of expression based on windowed read counts that can be compared across experimental conditions to determine differential island expression. In order to detect differentially expressed genes, the significance of islands (p-values) are combined using Fisher's method. We tested and evaluated the performance of our approach by comparing it to the existing differentially expressed gene (DEG) methods: CuffDiff, DESeq, and edgeR using two benchmark MAQC RNA-Seq datasets. The IB algorithm outperforms all three methods in both datasets as illustrated by an increased auROC. expand
|
|
|
Multi-Objective Stochastic Search for Sampling Local Minima in the Protein Energy Surface |
| |
Brian Olson,
Amarda Shehu
|
|
Pages: 430 |
|
doi>10.1145/2506583.2506590 |
|
Full text: PDF
|
|
We present an evolutionary stochastic search algorithm to obtain a discrete representation of the protein energy surface in terms of an ensemble of conformations representing local minima. This objective is of primary importance in protein structure ...
We present an evolutionary stochastic search algorithm to obtain a discrete representation of the protein energy surface in terms of an ensemble of conformations representing local minima. This objective is of primary importance in protein structure modeling, whether the goal is to obtain a broad view of potentially different structural states thermodynamically available to a protein system or to predict a single representative structure of a unique functional native state. In this paper, we focus on the latter setting, and show how approaches from evolutionary computation for effective stochastic search and multi-objective analysis can be combined to result in protein conformational search algorithms with high exploration capability. From a broad computational perspective, the contributions of this paper are on how to balance global and local search of some high-dimensional search space and how to guide the search in the presence of a noisy, inaccurate scoring function. From an application point of view, the contributions are demonstrated in the domain of template-free protein structure prediction on the primary subtask of sampling diverse low-energy decoy conformations of an amino-acid sequence. Comparison with the approach used for decoy sampling in the popular Rosetta protocol on 20 diverse protein sequences shows that the evolutionary algorithm proposed in this paper is able to access lower-energy regions with similar or better proximity to the known native structure. expand
|
|
|
Identifying protein complexes in AP-MS data with negative evidence via soft Markov clustering |
| |
Yu-Keng Shih,
Srinivasan Parthasarathy
|
|
Pages: 440 |
|
doi>10.1145/2506583.2506591 |
|
Full text: PDF
|
|
Protein complexes are key units to discover protein mechanism. Traditional protein complex identification methods adopt a soft (overlapping) network clustering algorithm on protein-protein interaction network and predict the clusters as protein complexes. ...
Protein complexes are key units to discover protein mechanism. Traditional protein complex identification methods adopt a soft (overlapping) network clustering algorithm on protein-protein interaction network and predict the clusters as protein complexes. Recently, the AP-MS technique and the scoring method can measure the co-complex relationship among proteins. Unlike traditional PPI networks, AP-MS can provide negative evidence which indicates which proteins are unlikely to be in the same protein complex. However, most of existing network clustering algorithms cannot utilize this negative similarity score. In this paper, we propose a soft network clustering algorithm, SR-MCL-N, which can take into account negative similarity scores. SR-MCL-N is a variation of a previous algorithm, SR-MCL, which is a network clustering algorithm based on the transition flow. Additionally, since the scoring approach we use produces a dense similarity matrix, a sparsification technique is adopted on the similarity matrix. Based on the gold standard CYC2008 and GO terms, we first show that the sparsification can not only speed up SR-MCL-N, but also let SR-MCL-N generate more accurate clusters. SR-MCL-N is then compared against SR-MCL and a hierarchical algorithm which also considers negative similarity score. The results indicate that our algorithm outperforms others since SR-MCL-N not only generates overlapped clusters but also additionally takes negative similarity score into account. expand
|
|
|
Stable Feature Selection with Minimal Independent Dominating Sets |
| |
Le Shu,
Tianyang Ma,
Longin Jan Latecki
|
|
Pages: 450 |
|
doi>10.1145/2506583.2506600 |
|
Full text: PDF
|
|
In this paper, we focus on stable selection of relevant features. The main contribution is a novel framework for selecting most informative features which can preserve the linear combination property of the original feature space. We propose a novel ...
In this paper, we focus on stable selection of relevant features. The main contribution is a novel framework for selecting most informative features which can preserve the linear combination property of the original feature space. We propose a novel formulation of this problem as selection of a minimal independent dominating set (MIDS). MIDS of a feature graph is a smallest subset such that no two of its nodes are connected and all other nodes are connected to at least one node in it. In this way, the diversity and coverage of the original feature space can be preserved. Furthermore, the proposed MIDS framework complements standard feature selection algorithms like SVM-RFE, stability lasso and ensemble SVM RFE. When these algorithms are applied to feature subsets selected by MIDS as opposed to all the input features, they select more stable features and achieve better prediction accuracy, as our experimental results clearly demonstrate. expand
|
|
|
Designing Autocorrelated Genes |
| |
Rukhsana Yeasmin,
Jesmin Jahan Tithi,
Jeffrey Chen,
Steven Skiena
|
|
Pages: 458 |
|
doi>10.1145/2506583.2506604 |
|
Full text: PDF
|
|
The redundancy in the genetic code enables a protein to be encoded by many different sequences. Recent studies show that the degree of tRNA autocorrelation in a coding sequence has important effects on translation speed. The tRNA pairing index (TPI) ...
The redundancy in the genetic code enables a protein to be encoded by many different sequences. Recent studies show that the degree of tRNA autocorrelation in a coding sequence has important effects on translation speed. The tRNA pairing index (TPI) has been used widely to study the phenomenon of autocorrelation in sequences. However TPI only counts successive transitions of tRNA usage, without regard to how far apart they occur in the sequence. In this paper, we propose a new type of autocorrelation measure, DICA (Distance Incorporated Codon Autocorrelation), which weighs positional distance between codons as well as the number of transitions. We demonstrate that our DICA correlates better to the expression level of a particular gene than TPI. Finally, we devise exact and heuristic algorithms to find near optimally autocorrelated and anti-autocorrelated genes for the purposes of synthetic gene design. expand
|
|
|
Cloud4SNP: Distributed Analysis of SNP Microarray Data on the Cloud |
| |
Giuseppe Agapito,
Mario Cannataro,
Pietro Hiram Guzzi,
Fabrizio Marozzo,
Domenico Talia,
Paolo Trunfio
|
|
Pages: 468 |
|
doi>10.1145/2506583.2506605 |
|
Full text: PDF
|
|
Pharmacogenomics studies the impact of genetic variation of patients on drug responses and searches for correlations between gene expression or Single Nucleotide Polymorphisms (SNPs) of patient's genome and the toxicity or efficacy of a drug. SNPs data, ...
Pharmacogenomics studies the impact of genetic variation of patients on drug responses and searches for correlations between gene expression or Single Nucleotide Polymorphisms (SNPs) of patient's genome and the toxicity or efficacy of a drug. SNPs data, produced by microarray platforms, need to be preprocessed and analyzed in order to find correlation between the presence/absence of SNPs and the toxicity or efficacy of a drug. Due to the large number of samples and the high resolution of instruments, the data to be analyzed can be very huge, requiring high performance computing. The paper presents the design and experimentation of Cloud4SNP, a novel Cloud-based bioinformatics tool for the parallel preprocessing and statistical analysis of pharmacogenomics SNP microarray data. Experimental evaluation shows good speed-up and scalability. Moreover, the availability on the Cloud platform allows to face in an elastic way the requirements of small as well as very large pharmacogenomics studies. expand
|
|
|
Application of a MAX-CUT Heuristic to the Contig Orientation Problem in Genome Assembly |
| |
Paul Bodily,
Mark J. Clement,
Quinn Snell,
Jared C. Price,
Stanley Fujimoto,
Nozomu Okuda
|
|
Pages: 476 |
|
doi>10.1145/2506583.2506606 |
|
Full text: PDF
|
|
In the context of genome assembly, the contig orientation problem is described as the problem of removing sufficient edges from the scaffold graph so that the remaining subgraph assigns a consistent orientation to all sequence nodes in the graph. This ...
In the context of genome assembly, the contig orientation problem is described as the problem of removing sufficient edges from the scaffold graph so that the remaining subgraph assigns a consistent orientation to all sequence nodes in the graph. This problem can also be phrased as a weighted MAX-CUT problem. The performance of MAX-CUT heuristics in this application is untested. We present a greedy heuristic solution to the contig orientation problem and compare its performance to a weighted MAX-CUT semi-definite programming heuristic solution on several graphs. We note that the contig orientation problem can be used to identify inverted repeats and inverted haplotypes, as these represent sequences whose orientation appears ambiguous in the conventional genome assembly framework. expand
|
|
|
Visual Analytics to Optimize Patient-Population Evidence Delivery for Personalized Care |
| |
Ketan K. Mane,
Phillips Owen,
Charles Schmitt,
Kirk Wilhelmsen,
Kenneth Gersing,
Ricardo Pietrobon,
Igor Akushevich
|
|
Pages: 484 |
|
doi>10.1145/2506583.2506608 |
|
Full text: PDF
|
|
Electronic medical records (EMR) can be used to identify cohorts of patients who are clinically comparable to an individual patient. In this paper, we describe an approach that applies visual analytics to EMR data to describe the clinical course for ...
Electronic medical records (EMR) can be used to identify cohorts of patients who are clinically comparable to an individual patient. In this paper, we describe an approach that applies visual analytics to EMR data to describe the clinical course for an individual patient, display outcomes for a comparable cohort stratified by treatment, and generate predictions regarding a patient's clinical course based on treatment options. The visual display of information is designed to help clinicians choose among alternative therapies based on the EMR-derived outcomes of the cohort. expand
|
|
|
A Confidence Measure for Model Fitting with X-Ray Crystallography Data |
| |
Yang Lei,
Ramgopal R. Mettu
|
|
Pages: 489 |
|
doi>10.1145/2506583.2506609 |
|
Full text: PDF
|
|
Structure determination from X-ray crystallography requires numerous stages of iterative refinement between real and reciprocal space. Current methods that fit a model structure to X-ray data therefore utilize a refined experimental electron density ...
Structure determination from X-ray crystallography requires numerous stages of iterative refinement between real and reciprocal space. Current methods that fit a model structure to X-ray data therefore utilize a refined experimental electron density map along with a scoring function that characterizes the fit of the density map to structure. Additional information (e.g., from an energy function or conformational statistics) may supplement this score. In this paper, we derive a novel confidence measure for fitting model fragments into X-ray crystallography data. Given any set of conformations under consideration (e.g., a set of sidechain rotamers, or backbone fragments), and a scoring function for those conformations (e.g., least squares fit of the associated model density maps), we give a general-purpose method for assessing the confidence of the best-fit model. For the commonly used least-squares measure of fit, our method analyzes the statistics of the matching scores and estimates the probability that the best-fit conformation is the correct underlying model. To our knowledge, ours is the first method for computing such a confidence measure. To demonstrate the practical utility of our method, we study the problem of sidechain placement and show that our confidence measure can be used to detect and correct incorrect conformational predictions. Over nine proteins with density maps of varying resolutions, the Pearson correlation between predictive accuracy (of least-squares fit) and our confidence measure is quite high, about .89. We show that our approach can guide the use of stereochemical restraints when confidence is low in predictions. We also propose a Bayesian data fusion scheme that integrates our confidence measure to weight the contributon of each source of data, which could potentially be used for combining experimental, modeling, and empirical data in automated structure determination. expand
|
|
|
Heuristics for the Sorting by Length-Weighted Inversion Problem |
| |
Thiago da Silva Arruda,
Ulisses Dias,
Zanoni Dias
|
|
Pages: 498 |
|
doi>10.1145/2506583.2506615 |
|
Full text: PDF
|
|
In this paper we present a polynomial-time algorithm for the length-weighted inversion problem on unsigned permutations. We consider the linear cost function where each inversion costs the number of elements in the reversed segment. We evaluate our method ...
In this paper we present a polynomial-time algorithm for the length-weighted inversion problem on unsigned permutations. We consider the linear cost function where each inversion costs the number of elements in the reversed segment. We evaluate our method by comparing its results against a previous known approximation algorithm. The results from two batches of tests using all possible small permutations and a sample of large permutations show that our algorithm has significantly better results. expand
|
|
|
glu-RNA: aliGn highLy strUctured ncRNAs using only sequence similarity |
| |
Prapaporn Techa-angkoon,
Yanni Sun
|
|
Pages: 508 |
|
doi>10.1145/2506583.2506617 |
|
Full text: PDF
|
|
Generating reliable alignments for ncRNAs is an important step in ncRNA secondary structure prediction and ncRNA gene finding. Existing sequence alignment programs can generate reliable alignments for ncRNAs with high sequence conservation. For highly ...
Generating reliable alignments for ncRNAs is an important step in ncRNA secondary structure prediction and ncRNA gene finding. Existing sequence alignment programs can generate reliable alignments for ncRNAs with high sequence conservation. For highly structured ncRNAs that may lack strong sequence similarity, structural alignment programs are required. However, conducting reliable structural alignment is much more expensive than sequence alignment and is not ideal for large-scale input such as whole genomes or next-generation sequencing data. In this paper, we propose an accurate ncRNA alignment approach to align highly structured ncRNAs using only sequence similarity. By incorporating posterior probability and a machine learning approach, we can generate accurate alignments of highly structured ncRNAs without using structural information. We tested our approach on over three hundreds of pairs of highly structured ncRNAs from BRAliBase 2.1. The experimental results show that our approach can achieve more accurate alignments than commonly used sequence alignment programs and a popular structural alignment tool. The source codes of glu-RNA can be downloaded at http://sourceforge.net/projects/glu-rna/. expand
|
|
|
Predictive model of the treatment effect for patients with major depressive disorder |
| |
Igor Akushevich,
Julia Kravchenko,
Ken Gersing,
Ketan K. Mane
|
|
Pages: 518 |
|
doi>10.1145/2506583.2506618 |
|
Full text: PDF
|
|
The model to evaluate and predict the effectiveness of treatment of the Major Depressive Disorder (MDD) was developed and estimated using MindLinc data. The clinical global impression (CGI) scale with seven categories was used to measure the patient's ...
The model to evaluate and predict the effectiveness of treatment of the Major Depressive Disorder (MDD) was developed and estimated using MindLinc data. The clinical global impression (CGI) scale with seven categories was used to measure the patient's state. The proportional odds model was selected because of ordinal nature of the outcome. The set of predictors included i) CGI score measured at preceded visit, ii) three groups of medications (antidepressants, atypical medicine, and augmentation medicine), all categorized for appropriate number of strata (from six to nine) and their daily doses, iii) psychiatric comorbidities, iv) type of the therapy used (talk vs. medications), v) demographic variables (e.g., age group, sex), and vi) the history of the efficiency of prior treatment. More than a half of a million records with measured CGI scores and their predictors were identified in the MindLinc database and used for model estimation. The predicted model of future CGI scales was developed and evaluated for single and recurrent episodes of MDD. Significant estimates were obtained for demographic factors, history of previous SGI scales, and for comorbidity and treatment indices. The methods of causal inferences based on the inverse probability weighting approach were applied to evaluate the treatment effects. The model extensions allowing for addressing the limitations of the proportional odds model are discussed. expand
|
|
|
Towards Independent Particle Reconstruction from Cryogenic Transmission Electron Microscopy |
| |
W. Lewis Collier,
Jean Yves Hervé,
Lenore Martin
|
|
Pages: 525 |
|
doi>10.1145/2506583.2506622 |
|
Full text: PDF
|
|
Coronary heart disease is the single largest killer of Americans so improved means of detecting risk factors before arterial obstructions appear are expected to lead to a improvement in quality of life with a reduced cost. This paper introduces a new ...
Coronary heart disease is the single largest killer of Americans so improved means of detecting risk factors before arterial obstructions appear are expected to lead to a improvement in quality of life with a reduced cost. This paper introduces a new approach to 3-D reconstruction of individual particles based on statistical modeling from a sparse set of 2-D projection images. This paper introduces a new approach to 3-D reconstruction of individual particles based on statistical modeling from a sparse set of 2-D projection images. The method is in contrast to the current state of practice where reconstruction is performed via signal processing or Bayesian methods that use averaged images acquired from an ensemble of particles. As such, this new approach has its impetus in use for novel diagnostic tests such as LDL and HDL particle shape characterization. The approach is also expected to have uses in areas such as quality assurance for drug delivery nano-technologies and for general proteomic studies. The individual particle reconstruction algorithm is based on a hidden Markov model. Higher order Markov chain statistics, which are generated from the a priori model of the target of interest, can be derived from traditional methods such as single particle reconstruction and/or the underlying physical properties of the particle. By placing the reconstruction voxel space at a 45° angle to the projection image, 4-passes of the HMM processing can be performed from a single image. Reconstruction results from a simple model and a single projection image resulted in better than 98% reconstruction accuracy as compared to the original target. expand
|
|
|
ChainKnot: a comparative H-type pseudoknot prediction tool using multiple ab initio folding tools |
| |
Jikai Lei,
Prapaporn Techa-angkoon,
Yanni Sun,
Rujira Achawanantakun
|
|
Pages: 535 |
|
doi>10.1145/2506583.2506626 |
|
Full text: PDF
|
|
Pseudoknot is an important structural motif in many types of ncRNAs. However, the accuracy of pseudoknot derivation is still not satisfactory even for simple pseudoknotted structures and short sequences. In this work, we design and implement an effective ...
Pseudoknot is an important structural motif in many types of ncRNAs. However, the accuracy of pseudoknot derivation is still not satisfactory even for simple pseudoknotted structures and short sequences. In this work, we design and implement an effective pipeline, ChainKnot, for deriving secondary structures containing recursive H-type pseudoknots from two or multiple ncRNA sequences. ChainKnot solves the consensus structure derivation problem using an extended maximum-weighted chain algorithm. In addition, ChainKnot tests a new strategy that extracts structural elements from the optimal and sub-optimal predictions of multiple ab initio pseudoknot prediction tools. The experimental results on over five hundreds of pseudoknot-containing ncRNAs demonstrate that extracting stems from the output of ab initio tools significantly increases the performance of the prediction pipeline compared to using base-pairing probability matrices. Our approach achieves better sensitivity, PPV, and F-score than the state-of-the-art pseudoknot prediction tools on recursive H-type pseudoknots. The source code of ChainKnot is available at http://sourceforge.net/projects/chainknot expand
|
|
|
A Framework for Identifying Affinity Classes of Inorganic Materials Binding Peptide Sequences |
| |
Nan Du,
Marc R. Knecht,
Paras N. Prasad,
Mark T. Swihart,
Tiffany Walsh,
Aidong Zhang
|
|
Pages: 545 |
|
doi>10.1145/2506583.2506628 |
|
Full text: PDF
|
|
With the rapid development of bionanotechnology, there has been a growing interest recently in identifying the affinity classes of the inorganic materials binding peptide sequences. However, there are some distinct characteristics of inorganic materials ...
With the rapid development of bionanotechnology, there has been a growing interest recently in identifying the affinity classes of the inorganic materials binding peptide sequences. However, there are some distinct characteristics of inorganic materials binding sequence data that limit the performance of many widely-used classification methods. In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganic material. We first generate a large set of simulated peptide sequences based on our new amino acid transition matrix, and then the probability of test sequences belonging to a specific affinity class is calculated through solving an objective function. In addition, the objective function is solved through iterative propagation of probability estimates among sequences and sequence clusters. Experimental results on a real inorganic material binding sequence dataset show that the proposed framework is highly effective on identifying the affinity classes of inorganic material binding sequences. expand
|
|
|
Clustering Coefficients in Protein Interaction Hypernetworks |
| |
Suzanne Renick Gallagher,
Debra S. Goldberg
|
|
Pages: 552 |
|
doi>10.1145/2506583.2506635 |
|
Full text: PDF
|
|
Modeling protein interaction data with graphs (networks) is insufficient for some common types of experimentally generated interaction data. For example, in affinity purification experiments, one protein is pulled out of the cell along with other proteins ...
Modeling protein interaction data with graphs (networks) is insufficient for some common types of experimentally generated interaction data. For example, in affinity purification experiments, one protein is pulled out of the cell along with other proteins that are bound to it. This data is not intrinsically binary, so we lose information when we model it with a graph, which can only associate pairs of proteins. Hypergraphs, an extension of graphs which allows relationships among sets of arbitrary size, have been proposed to model this type of data. However, there is no consensus for appropriate measures for these "protein interaction hypernetworks" that are meaningful in both their interpretation and in their correspondence to a biological question (e.g., predicting the function of uncharacterized proteins, identifying new biological modules). The clustering coefficient is a measure commonly used in binary networks for biological insights. While multiple analogs of the clustering coefficient have been proposed for hypernetworks, the usefulness of these for generating biological hypotheses has not been established. We present several new definitions for a hypergraph clustering coefficient that pertain specifically to the biology of interacting proteins. We evaluate the biological meaning of these and previously proposed definitions in protein interaction hypernetworks and test their correlation with protein complexes. We conclude that hypergraph analysis offers important advantages over graph measures for non-binary data, and we discuss the clustering coefficient measures that perform best. Our work suggests a paradigm shift is needed to best gain insights from affinity purification assays and other non-binary data. expand
|
|
|
A Privacy Preserving Markov Model for Sequence Classification |
| |
Suxin Guo,
Sheng Zhong,
Aidong Zhang
|
|
Pages: 561 |
|
doi>10.1145/2506583.2506636 |
|
Full text: PDF
|
|
Sequence classification has attracted much interest in recent years due to its difference from the traditional classification tasks, as well as its wide applications in many fields, such as bioinformatics. As it is not easy to define specific "features" ...
Sequence classification has attracted much interest in recent years due to its difference from the traditional classification tasks, as well as its wide applications in many fields, such as bioinformatics. As it is not easy to define specific "features" for sequence data as in traditional feature based classifications, many methods have been developed to utilize the particular characteristics of sequences. One common way of classifying sequence data is to use probabilistic generative models, such as the Markov model, to learn the probability distribution of sequences in each class. One thing that should be considered in the research of sequence classification is the privacy issue. In many cases, especially in the bioinformatics field, the sequence data contains sensitive information which obstructs the mining of data. For example, the DNA and protein sequences of individuals are highly sensitive and should not be released without protection. But in the real world, data is usually distributed among different parties and for the parties, training only with their own data may not give them strong enough models. This raises a problem when some parties, each holding a set of sequences, want to learn the Markov models on the union of their data, but do not want to reveal their data to others due to the privacy concerns. In this paper, we address this problem and propose a method to train the Markov models, from the ones of the first order to the ones of order k where k > 1, on sequence data distributed among parties without revealing each party's private sequences to others. We apply the homomorphic encryption to protect the sensitive information. expand
|
|
|
Classification of Alzheimer Diagnosis from ADNI Plasma Biomarker Data |
| |
Jue Mo,
Sana Siddiqui,
Stuart Maudsley,
Huey Cheung,
Bronwen Martin,
Calvin A. Johnson
|
|
Pages: 569 |
|
doi>10.1145/2506583.2506637 |
|
Full text: PDF
|
|
Research into modeling the progression of Alzheimer's disease (AD) has made recent progress in identifying plasma proteomic biomarkers to identify the disease at the pre-clinical stage. In contrast with cerebral spinal fluid (CSF) biomarkers and PET ...
Research into modeling the progression of Alzheimer's disease (AD) has made recent progress in identifying plasma proteomic biomarkers to identify the disease at the pre-clinical stage. In contrast with cerebral spinal fluid (CSF) biomarkers and PET imaging, plasma biomarker diagnoses have the advantage of being cost-effective and minimally invasive, thereby improving our understanding of AD and hopefully leading to early interventions as research into this subject advances. The Alzheimer's Disease Neuroimaging Initiative* (ADNI) has collected data on 190 plasma analytes from individuals diagnosed with AD as well subjects with mild cognitive impairment and cognitively normal (CN) controls. We propose an approach to classify subjects as AD or CN via an ensemble of classifiers trained and validated on ADNI data. Classifier performance is enhanced by an augmentation of a selective biomarker feature space with principal components obtained from the entire set of biomarkers. This procedure yields accuracy of 89% and area under the ROC curve of 94%. expand
|
|
|
The Forward Stem Matrix: An Efficient Data Structure for Finding Hairpins in RNA Secondary Structures |
| |
Richard Beal,
Donald Adjeroh,
Ahmed Abbasi
|
|
Pages: 575 |
|
doi>10.1145/2506583.2506638 |
|
Full text: PDF
|
|
With the rapid growth in available genomic data, robust and efficient methods for identifying RNA secondary structure elements, such as hairpins, have become a significant challenge in computational biology, with potential applications in prediction ...
With the rapid growth in available genomic data, robust and efficient methods for identifying RNA secondary structure elements, such as hairpins, have become a significant challenge in computational biology, with potential applications in prediction of RNA secondary and tertiary structures, functional classification of RNA structures, micro RNA target prediction, and discovery of RNA structure motifs. In this work, we propose the Forward Stem Matrix (FSM), a data structure to efficiently represent all k-length stem options, for k ∈ K, within an n-length RNA sequence T. We show that the FSM structure is of size O(n|K|) and still permits efficient access to stems. In this paper, we provide a linear O(n|K|) construction for the FSM using suffix arrays and data structures related to the Longest Previous Factor (LPF), namely, the Furthest Previous Non-Overlapping Factor (FPnF) and Furthest Previous Factor (FPF) arrays. We also provide new constructions for the FPnF and FPF via a novel application of parameterized string (p-string) theory and suffix trees. As an application of the FSM, we show how to efficiently find all hairpin structures in an RNA sequence. Experimental results show the practical performance of the proposed data structures. expand
|
|
|
Fine-Scale Recombination Mapping of High-Throughput Sequence Data |
| |
Catherine E. Welsh,
Chen-Ping Fu,
Fernando Pardo-Manuel de Villena,
Leonard McMillan
|
|
Pages: 585 |
|
doi>10.1145/2506583.2506642 |
|
Full text: PDF
|
|
In this paper, we contrast the resolution and accuracy of determining recombination boundaries using genotyping arrays compared to high-throughput sequencing. In addition, we consider the impacts of sequence coverage and genetic diversity on localizing ...
In this paper, we contrast the resolution and accuracy of determining recombination boundaries using genotyping arrays compared to high-throughput sequencing. In addition, we consider the impacts of sequence coverage and genetic diversity on localizing recombination boundaries. We developed a hidden Markov model for estimating recombination breakpoints based on variant observations seen in the read coverage spanning uniformly sized genomic windows. Our model includes 36 states representing all combinations of 8 genomes, and estimates a founder mosaic that is consistent with the variants observed in the aligned sequences. At HMM transition locations we consider the most likely founder-pair and refine the recombination breakpoints down to an interval spanning two informative variants. We compare this solution to alternate solutions based on microarrays that we have estimated. At 30x coverage the recombination mapping accuracy far exceeds the resolution attainable by any microarray. Even at coverages of 1x and below we are generally able to estimate recombination breakpoints with comparable accuracy. expand
|
|
|
Transforming Genomes Using MOD Files with Applications |
| |
Shunping Huang,
Chia-Yu Kao,
Leonard McMillan,
Wei Wang
|
|
Pages: 595 |
|
doi>10.1145/2506583.2506643 |
|
Full text: PDF
|
|
Next generation sequencing techniques have enabled new methods of DNA and RNA quantification. Many of these methods require a step of aligning short reads to some reference genome. If the target organism differs significantly from this reference, alignment ...
Next generation sequencing techniques have enabled new methods of DNA and RNA quantification. Many of these methods require a step of aligning short reads to some reference genome. If the target organism differs significantly from this reference, alignment errors can lead to significant errors in downstream analysis. Various attempts have been tried to integrate known genetic variants into the reference genome so as to construct sample-specific genomes to improve read alignments. However, many hurdles in generating and annotating such genomes remain unsolved. In this paper, we propose a general framework for mapping back and forth between genomes. It employs a new format, MOD, to represent known variants between genomes, and a set of tools that facilitate genome manipulation and mapping. We demonstrate the utility of this framework using three inbred mouse strains. We built pseudogenomes from the mm9 mouse reference genome for three highly divergent mouse strains based on MOD files and used them to map the gene annotations to these new genomes. We observe that a large fraction of genes have their positions or ranges altered. Finally, using RNA-seq and DNA-seq short reads from these strains, we demonstrate that mapping to the new genomes yields a better alignment result than mapping to the standard reference. The MOD files for the 17 mouse strains sequenced in the Wellcome Trust Sanger Institute's Mouse Genomes Project can be found at http://www.csbio.unc.edu/CCstatus/index.py?run=Pseudo The auxiliary tools (i.e. MODtools and Lapels), written in Python, are available at http://code.google.com/p/modtools/ and http://code.google.com/p/lapels/. expand
|
|
|
Read Annotation Pipeline for High-Throughput Sequencing Data |
| |
James Holt,
Shunping Huang,
Leonard McMillan,
Wei Wang
|
|
Pages: 605 |
|
doi>10.1145/2506583.2506645 |
|
Full text: PDF
|
|
Mapping reads to a reference sequence is a common step when analyzing allele effects in high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest ...
Mapping reads to a reference sequence is a common step when analyzing allele effects in high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins. expand
|
|
|
Simulating Anti-adhesive and Antibacterial Bifunctional Polymers for Surface Coating using BioScape |
| |
Vishakha Sharma,
Adriana Compagnoni,
Matthew Libera,
Agnieszka K. Muszanska,
Henk J. Busscher,
Henny C. van der Mei
|
|
Pages: 613 |
|
doi>10.1145/2506583.2506646 |
|
Full text: PDF
|
|
Traditionally biomaterials development consists of designing a surface and testing its properties experimentally. This trial-and-error approach is limited because of the resources and time needed to sample a representative number of configurations in ...
Traditionally biomaterials development consists of designing a surface and testing its properties experimentally. This trial-and-error approach is limited because of the resources and time needed to sample a representative number of configurations in a combinatorially complex scenario. Therefore, computational modeling is of significant importance in identifying best antibacterial materials to prevent and treat implant related biofilm infections. In this paper we focus on bifunctional surface with polymer brushes and Pluronic-Lysozyme conjugates developed by Henk Busscher's group in Groningen, The Netherlands. The bifunctional brushes act as anti-adhesive due to the unmodified polymer brushes and antibacterial, because of the Pluronic-Lysozyme conjugates. They developed and studied three different surfaces with varying proportions of antibacterial and anti-adhesive properties. In order to aid the development of optimal bifunctional surfaces, we build a three dimensional computational model using BioScape, an agent-based modeling and simulation language developed by Compagnoni's group at Stevens. We model two different experimental phases: adhesion and growth. We use the results of experiments on two surfaces as training data, and we validate our model by reproducing the experimental results from the third surface. The resulting model is able to simulate varying configurations of surface coatings both at adhesion and growth phases at a fraction of the time necessary to perform in-vitro experiments. The output of the model not only plots populations over time, but it also produces 3D-rendered videos of bacteria-surface interactions enhancing the visualization of the system's behavior. expand
|
|
|
Systematic Assessment of RNA-Seq Quantification Tools Using Simulated Sequence Data |
| |
Raghu Chandramohan,
Po-Yen Wu,
John H. Phan,
May D. Wang
|
|
Pages: 623 |
|
doi>10.1145/2506583.2506648 |
|
Full text: PDF
|
|
RNA-sequencing (RNA-seq) technology has emerged as the preferred method for quantification of gene and isoform expression. Numerous RNA-seq quantification tools have been proposed and developed, bringing us closer to developing expression-based diagnostic ...
RNA-sequencing (RNA-seq) technology has emerged as the preferred method for quantification of gene and isoform expression. Numerous RNA-seq quantification tools have been proposed and developed, bringing us closer to developing expression-based diagnostic tests based on this technology. However, because of the rapidly evolving technologies and algorithms, it is essential to establish a systematic method for evaluating the quality of RNA-seq quantification. We investigate how different RNA-seq experimental designs (i.e., variations in sequencing depth and read length) affect various quantification algorithms (i.e., HTSeq, Cufflinks, and MISO). Using simulated data, we evaluate the quantification tools based on four metrics, namely: (1) total number of usable fragments for quantification, (2) detection of genes and isoforms, (3) correlation, and (4) accuracy of expression quantification with respect to the ground truth. Results show that Cufflinks is able to use the largest number of fragments for quantification, leading to better detection of genes and isoforms. However, HTSeq produces more accurate expression estimates. Moreover, each quantification algorithm is affected differently by varying sequencing depth and read length, suggesting that the selection of quantification algorithms should be application-dependent. expand
|
|
|
GPU-Optimized Hybrid Neighbor/Cell List Algorithm for Coarse-Grained MD Simulations of Protein and RNA Folding and Assembly |
| |
Andrew J. Proctor,
Cody A. Stevens,
Samuel S. Cho
|
|
Pages: 633 |
|
doi>10.1145/2506583.2506649 |
|
Full text: PDF
|
|
Molecular dynamics (MD) simulations provide a molecular-resolution view of biomolecular folding and assembly processes, but the computational demands of the underlying algorithms limit the lenth- and time-scales of the simulations one can perform. Recently, ...
Molecular dynamics (MD) simulations provide a molecular-resolution view of biomolecular folding and assembly processes, but the computational demands of the underlying algorithms limit the lenth- and time-scales of the simulations one can perform. Recently, graphics processing units (GPUs), specialized devices that were originally designed for rendering images, have been repurposed for high performance computing, and there have been significant increases in the performances of parallel algorithms such as the ones in MD simulations. Previously, we implemented a GPU-optimized parallel neighbor list algorithm for our coarsegrained MD simulations, and we observed an N-dependent speed-up (or speed-down) compared to a CPU-optimized algorithm, where N is the number of interacting beads representing amino acids or nucleotides for proteins or RNAs, respectively. We had demonstrated that for MD simulations of the 70s ribosome (N=10,219), our GPU-optimized code was about 30x as fast as a CPU-optimized version. In our present study, we implement a hybrid neighbor/cell list algorithm that borrows components from the well-known neighbor list and the cell-list algorithms. We observe about 10% speedup as compared to our previous implementation of a GPU-optimized parallel neighbor list algorithm. expand
|
|
|
Gene Set Cultural Algorithm: A Cultural Algorithm Approach to Reconstruct Networks from Gene Sets |
| |
Thair Judeh,
Thaer Jayyousi,
Lipi Acharya,
Robert G. Reynolds,
Dongxiao Zhu
|
|
Pages: 641 |
|
doi>10.1145/2506583.2506650 |
|
Full text: PDF
|
|
With the increasing availability of gene sets, novel approaches that focus on reconstructing networks from gene sets are of interest. Currently, few computational approaches explore the search space of candidate networks using a parallel search. As such, ...
With the increasing availability of gene sets, novel approaches that focus on reconstructing networks from gene sets are of interest. Currently, few computational approaches explore the search space of candidate networks using a parallel search. As such, novel methods that employ search agents are needed to help better escape local optima. In particular, gene sets may model signal transduction events, which refer to linear chains or cascades of reactions starting at the cell membrane and ending at the cell nucleus. These events may be indirectly observed as a set of unordered and overlapping gene sets. Thus, the underlying goal is to reverse engineer the order information within each gene set to reconstruct the underlying source network. To achieve this goal, we developed the Gene Set Cultural Algorithm to discover the true order of the gene sets and to reconstruct the underlying network. In a proof of concept study, we show that the Gene Set Cultural Algorithm can satisfactorily reconstruct three E. coli networks from the DREAM initiative using simulated and unordered gene sets as the input. expand
|
|
|
POSTER SESSION: Posters |
|
|
|
|
Predicting protein transport mechanism and immune response using spatial protein motifs and epitopes: a case study of Chlamydophila MOMP |
| |
F. O. Atanu,
E. Oveido-Orta,
K. A. Watson
|
|
Pages: 649 |
|
doi>10.1145/2506583.2506656 |
|
Full text: PDF
|
|
Chlamydophila represents a distinct genus of gram negative bacteria associated with a spectrum of both human and animal disease and, as such, is an important health and economic concern. Central to the pathogenicity of Chlamydophila are antigenic proteins ...
Chlamydophila represents a distinct genus of gram negative bacteria associated with a spectrum of both human and animal disease and, as such, is an important health and economic concern. Central to the pathogenicity of Chlamydophila are antigenic proteins among which the Major Outer Membrane Proteins (MOMP) have received significant attention. MOMP from Chlamydophila pneumoniae and Chlamydia trachomatis, the human pathogens, to date, remain the best characterised. In addition, MOMP-derived peptides have been shown to potentiate anti-inflammatory and anti-atherogenecity, by a cell mediated system involving MHC class II proteins [2]. However, despite the importance of this protein as a vaccine target against inflammation driven pathologies attributed to Chlamydia, tremendous challenges in the isolation of this membrane-bound, cysteine-rich protein make it difficult to experimentally isolate the protein for detailed structural and immunological studies. In an effort to reveal a plausible structure and function for Chlamydophila MOMP and aid in the rational design of potential lead candidates for future drug design, computational methods have been employed. Firstly, a knowledge-based approach was used to explore Chlamydia MOMP sequences to identify Asp-ProX (NPX) motifs, commonly found in long chain fatty acid transporters such as aquaporins and some aquaglyceroporins. In Chlamydial MOMPs, a variety of substitutions are found at the third 'X' position; NP-A/S/E/T/K, which may account for its potential to transport a variety of solutes, as a strategy for coping with the absence of essential metabolic enzymes. Earlier findings suggest that MOMP from Chlamydophila pneumoniae is permeable to sugars, amino acids, dicarboxylates and ATP. Subsequent homology modeling of MOMP from C. pneumoniae (using template-based methods) has provided insight into the orientation of these functional motifs in the 3-dimensional structures of MOMPs, from which a plausible novel transport model has been derived [1]. In our model the NPA motif is oriented toward the extracellular side while the two NPS motifs are juxtaposed inside the barrel posed to fulfil a transport role, via solute binding. Secondly, to understand how such MOMP-derived peptides could potentiate an immune response, via binding MHC II alleles, a flexible molecular docking protocol was employed [1]. The reliability of the docking protocol was tested by docking peptides extracted from the PDB coordinates of the MHCs of interest and compared to the original peptide-MHC complex, as seen in the crystal structure of the complex. We used the docking protocol to score four peptides of interest for their candidacy as potential new leads in a rational drug design strategy against chronic inflammation, which characterises Chlamydial involvement in atherosclerosis and respiratory diseases. Our computational work offers new insight into the structure and functional mechanism involved in solute transport via MOMP across Chlamydial cell membranes, which could be used in the design of inhibitors that could obstruct nutrient uptake and halt Chlamydial viability in their host. Also, this work supports the role of MOMP-derived peptides as vaccine candidates for immune-therapy in chronic inflammation that can result in cardiovascular events. expand
|
|
|
In silico analysis of autoimmune diseases and genetic relationships to vaccination against infectious diseases |
| |
Peter McGarvey,
Baris E. Suzek,
Shruti Rao,
Subha Madhavan,
James N. Baraniuk,
Samir Lababidi,
Andrea Sutherland,
Richard Forshee
|
|
Pages: 650 |
|
doi>10.1145/2506583.2506657 |
|
Full text: PDF
|
|
Vaccines are profoundly important to global health in preventing infectious diseases. Reported adverse events following vaccination are diverse, rare and require thorough investigation and evaluation [1]. Autoimmune diseases (AD) have been reported after ...
Vaccines are profoundly important to global health in preventing infectious diseases. Reported adverse events following vaccination are diverse, rare and require thorough investigation and evaluation [1]. Autoimmune diseases (AD) have been reported after some vaccinations. Because autoimmune diseases are rare and have variable and prolonged onset times, it makes it difficult to fully assess the association between the autoimmune diseases and vaccination. One of the components of pharmacovigilance and vaccine safety evaluation is consideration of biologic plausibility. Knowledge of biologic plausibility may be enhanced by an understanding of molecular immune mechanisms responsible for the adverse events, natural infections and the pathogenesis of the associated, reported AD. The situation is complicated by the complex matrix of innate and adaptive immune responses to vaccine antigens, adjuvants, preservatives and stabilizers. A bioinformatics, systems biology approach was used to collect data from the literature and curated databases to understand post-vaccination Guillain-Barré Syndrome (GBS), Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE), and Idiopathic (or Immune) Thrombocytopenic Purpura (ITP). By mining multiple curated databases and using automated text mining of PubMed literature, followed by manual review to remove errors, 667 genes associated with RA, 448 with SLE, 49 with ITP and 73 with GBS were collected. While all data sources provided valuable and unique gene associations, text mining using natural language processing (NLP) algorithms provide the most by far but required additional curation to remove incorrect associations. Sixty-four direct interactions between six vaccine ingredients and forty-six genes were also collected. Though only six genes were associated with all four ADs, thirty-seven genes were associated with three ADs. Pathway analysis found thirty-three pathways in common between the four ADs. Classification of genes into twelve immune system related categories identified more "Chemokine plus Receptors" genes were associated with RA than SLE. RA also had more genes associated with the "Th17 T-cell" subtype than other ADs. Gene networks were created, visualized and analyzed by cluster analysis of interconnected modules. Analysis showed several clusters uniquely associated with RA including one with ten C-X-C motif chemokines, which are powerful neutrophil chemotactic factors. Other clusters contained genes common to other ADs. Figure 1 shows a subnetwork of ten genes associated with GBS, Influenza A infection and genes activated in response to influenza vaccination [2]. The nodes highlighted in green and shaded in the data panel represent genes associated with GBS only and not the other three ADs. Red triangles are vaccine ingredients that interact with genes in the network. Additional pathway analysis suggests a key role for the MAPK signaling pathway in GBS. Systems and methods to collect, organize and integrate large data sets are essential to enable researchers and public health agencies to utilize published data and develop hypotheses related to vaccine safety and efficacy. expand
|
|
|
Modularity and community detection in Semantic Similarity Networks trough Spectral Based Transformation and Markov Clustering |
| |
Pietro Hiram Guzzi,
Simone Truglia,
Marianna Milano,
Pierangelo Veltri,
Mario Cannataro
|
|
Pages: 652 |
|
doi>10.1145/2506583.2506658 |
|
Full text: PDF
|
|
Semantic Similarity Networks are currently used for modeling similarities among biological entities. Nodes of such networks are for instance proteins while weighted edges among them encode semantic similarity scores among them. Networks are usually affected ...
Semantic Similarity Networks are currently used for modeling similarities among biological entities. Nodes of such networks are for instance proteins while weighted edges among them encode semantic similarity scores among them. Networks are usually affected by noise. This paper presents an algorithm for de-noising these networks. The improvement of the use of mining algorithm on processed networks is also shown. expand
|
|
|
Age-Specific Signatures of Glioblastoma at the Genomic, Genetic, and Epigenetic levels |
| |
Serdar Bozdag,
Aiguo Li,
Gregory Riddick,
Yuri Kotliarov,
Mehmet Baysan,
Fabio M. Iwamoto,
Margaret C. Cam,
Svetlana Kotliarova,
Howard A. Fine
|
|
Pages: 654 |
|
doi>10.1145/2506583.2506659 |
|
Full text: PDF
|
|
Age is a powerful predictor of survival in glioblastoma multiforme (GBM) yet the biological basis for the difference in clinical outcome is mostly unknown. Discovering genes and pathways that would explain age-specific survival difference could generate ...
Age is a powerful predictor of survival in glioblastoma multiforme (GBM) yet the biological basis for the difference in clinical outcome is mostly unknown. Discovering genes and pathways that would explain age-specific survival difference could generate opportunities for novel therapeutics for GBM. Here we have integrated gene expression, exon expression, microRNA expression, copy number alteration, SNP, whole exome sequence, and DNA methylation data sets of a cohort of GBM patients in The Cancer Genome Atlas (TCGA) project to discover age-specific signatures at the transcriptional, genetic, and epigenetic levels and validated our findings on the REMBRANDT data set. We found major age-specific signatures at all levels including age-specific hypermethylation in polycomb group protein target genes and the upregulation of angiogenesis-related genes in older GBMs. These age-specific differences in GBM, which are independent of molecular subtypes, may in part explain the preferential effects of anti-angiogenic agents in older GBM and pave the way to a better understanding of the unique biology and clinical behavior of older versus younger GBMs. expand
|
|
|
SNP2Structure: A public database for mapping and modeling nsSNPs on human protein structures |
| |
Difei Wang,
Kevin Rosso,
Shruti Rao,
Lei Song,
Varun Singh,
Shailendra Singh,
Michael Harris,
Subha Madhavan
|
|
Pages: 655 |
|
doi>10.1145/2506583.2506660 |
|
Full text: PDF
|
|
With the development of deep DNA sequencing techniques, the cost for detecting mutations in the human genome falls significantly. Numerous non-synonymous single nucleotide polymorphisms (nsSNPs) have been identified and many of them are associated with ...
With the development of deep DNA sequencing techniques, the cost for detecting mutations in the human genome falls significantly. Numerous non-synonymous single nucleotide polymorphisms (nsSNPs) have been identified and many of them are associated with human disease. One of the long-standing challenges is to understand how nsSNPs change protein structure and further affect their function. While it is impractical to solve all the mutated protein structures experimentally, it is quite feasible to model the mutated structures in silico. Toward this goal, we are building a publicly available structure database (SNP2Structure) to facilitate our research endeavors. Compared with the existing web portals with a similar aim, ours has three major advantages. First, we corrected the existing sequence mapping discrepancies presented in others. Although the percentage of erroneously mapped structures is small, it is critical to correct such errors. Second, our portal offers comparison of two structures simultaneously. Third, the mutated structures are available to download locally for further investigations. We believe SNP2Structure will be a valuable tool to the research community to understand the functional impact of disease-causing nsSNPs. expand
|
|
|
An integrated pharmacogenomic analysis of doxorubicin response using genotype information on DMET genes |
| |
Krithika Bhuvaneshwar,
Michael Harris,
Thanemozhi Natarajan,
Laura Sheahan,
Difei Wang,
Subha Madhavan,
Mahlet G. Tadesse,
John Deeken
|
|
Pages: 657 |
|
doi>10.1145/2506583.2506661 |
|
Full text: PDF
|
|
Genetic variations like single nucleotide polymorphism (SNPs) in drug metabolizing and transporter (DMET) genes can impact their downstream function and behavior, and play a crucial role in the pharmacokinetics of substrate drugs. These polymorphisms ...
Genetic variations like single nucleotide polymorphism (SNPs) in drug metabolizing and transporter (DMET) genes can impact their downstream function and behavior, and play a crucial role in the pharmacokinetics of substrate drugs. These polymorphisms can alter drug response in some patients leading to adverse drug response like toxicity, resistance or lack of sensitivity. We have identified variants in a number of genes that are significantly associated with doxorubicin response in an effort to enhance personalized medicine in the clinic. expand
|
|
|
Classifying Proteins by Amino Acid Variations of Sequence Patterns |
| |
En-Shiun Annie Lee
|
|
Pages: 659 |
|
doi>10.1145/2506583.2506663 |
|
Full text: PDF
|
|
|
|
|
Using Global Network Alignment In The Context Of Aging |
| |
Tijana Milenković,
Han Zhao,
Fazle E. Faisal
|
|
Pages: 661 |
|
doi>10.1145/2506583.2506588 |
|
Full text: PDF
|
|
Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. This is important when studying human aging: since human aging is hard to study experimentally due ...
Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. This is important when studying human aging: since human aging is hard to study experimentally due to long lifespan, the knowledge about aging needs to be transferred from model species. NA faces two algorithmic challenges: 1) Which cost function to use to capture "similarities" between nodes in different networks? 2) Which alignment strategy to use to rapidly identify "high-scoring" alignments from all possible alignments? Since existing NA methods typically use both different cost functions and different alignment strategies, we "break down" existing state-of-the-art methods to evaluate each combination of their cost functions and alignment strategies. We find that a combination of the cost function of one method and the alignment strategy of another method beats the existing methods. Hence, we propose this combination as a novel superior NA method. Since susceptibility to diseases increases with age, studying aging is important. Thus, we use the existing and new NA methods to transfer aging-related knowledge from well annotated species to poorly annotated ones between aligned network regions. By doing so, we produce novel aging-related information, which complements currently available information about aging that has been obtained mainly by sequence alignment, especially in human. To our knowledge, we are the first to use NA to learn more about aging. This work was published as a full paper in Proceedings of ACM BCB 2013. expand
|
|
|
Dynamic networks reveal key players in aging |
| |
Fazle E. Faisal,
Tijana Milenković
|
|
Pages: 662 |
|
doi>10.1145/2506583.2506665 |
|
Full text: PDF
|
|
Motivation: Since susceptibility to diseases increases with age, studying aging gains importance. Analyses of gene expression or sequence data, which have been indispensable for investigating aging, have been limited to studying genes and their protein ...
Motivation: Since susceptibility to diseases increases with age, studying aging gains importance. Analyses of gene expression or sequence data, which have been indispensable for investigating aging, have been limited to studying genes and their protein products in isolation, ignoring their connectivities. However, proteins function by interacting with other proteins, and this is exactly what biological networks (BNs) model. Thus, analyzing the proteins' BN topologies could contribute to understanding of aging. Current methods for analyzing systems-level BNs deal with their static representations, even though cells are dynamic. For this reason, and because different data types can give complementary biological insights, we integrate current static BNs with aging-related gene expression data to construct dynamic, age-specific BNs. Then, we apply sensitive measures of topology to the dynamic BNs to study cellular changes with age. Results: While global BN topologies do not significantly change with age, local topologies of a number of genes do. We predict such genes as aging-related. We demonstrate credibility of our aging-related predictions by: 1) observing significant overlap between the predictions and "ground truth" aging-related genes; 2) showing that our predictions group by functions and diseases that are different than functions and diseases of genes that we do not predict as aging-related; 3) observing significant overlap between functions and diseases that are enriched in our predictions and those that are enriched in "ground truth" aging-related data; 4) providing evidence that diseases which are enriched in our predictions are linked to human aging; and 5) validating our predictions in the literature. This work was published in arXiv:1307.3388 [cs.CE], 2013. expand
|
|
|
Computational methods for alternative splicing detection using RNA-seq |
| |
Ruolin Liu,
Julie Dickerson
|
|
Pages: 663 |
|
doi>10.1145/2506583.2506666 |
|
Full text: PDF
|
|
RNA-seq technology promises a comprehensive picture of transcriptome. The traditional way of studying differential expression gene is questionable because it fails to consider alternative transcription and post-transcriptional modification. Although ...
RNA-seq technology promises a comprehensive picture of transcriptome. The traditional way of studying differential expression gene is questionable because it fails to consider alternative transcription and post-transcriptional modification. Although some studies have shown that transcript variants from a gene are predominantly generated from alternative transcription, including alternative promoters and transcriptional terminations, rather than splicing mechanisms, more computation methods focus on alternative splicing detection and quantification. Here we are only interested in methods which are able to detect condition-specific difference using RNA-seq and we categorize them into two major classes: Region Quantification (RQ) and Isoform Quantification (IQ). RQ breaks down the gene structure into"horizontally parallel pieces", exon units for example, and quantifies the expression in these "small pieces" and compares them across different conditions. While IR seeks to separate gene expression into "vertically parallel isoform", which itself is a challenging task but is more biologically meaningful, and compares a gene's isoform compositions across different conditions. In addition, based on their ability to localize significantly different regions we can further classify them into "gene-centric" or "exon-centric" method. The combination of two classification strategies yields 4 categories and we choose one representative for each category. These four representatives are Cufflinks-Cuffdiff package, DEXSeq, DiffSplice and SplicingCompass. We evaluate their performance on alternative splicing analysis using three experiments. The first experiment uses a published RNA-seq data of Arabidopsis under cold condition (NCBI SRA009031). The second experiment is a simulation study using a custom simulator by which we adopt negative binomial model to account for variability across biological replicates. The last experiment makes use of RT-PCR to evaluate the results from different methods. expand
|
|
|
Computer Assisted Surgery-Planning for Microwave Ablation |
| |
Xi Wen,
Hong Wang,
Weiming Zhai
|
|
Pages: 664 |
|
doi>10.1145/2506583.2506667 |
|
Full text: PDF
|
|
A novel preoperative surgery planning method is proposed for microwave ablation. An iterative framework for necrosis field simulation and 3D necrosis zone reconstruction is introduced here. The necrosis field of the ablation is computed with an adaptable ...
A novel preoperative surgery planning method is proposed for microwave ablation. An iterative framework for necrosis field simulation and 3D necrosis zone reconstruction is introduced here. The necrosis field of the ablation is computed with an adaptable method based on surgery trajectories, and then the 3D model of the necrosis zone is reconstructed and superimposed to the patient anatomy structures using advanced visualization techniques. The full surgery planning with multiple antennae is performed by the operator in an interactively way, until the optimal surgery plan is achieved. Experiments have been performed and the actual necrosis field is measured by comparing to postoperative CT images. Results show that this method is relative accurate for preoperative trajectory plan and could be used as an assistant to the clinical practice. expand
|
|
|
Improvement of Protein-Protein Interaction Prediction by Integrating Template-Based and Template-Free Protein Docking |
| |
Masahito Ohue,
Yuri Matsuzaki,
Takehiro Shimoda,
Takashi Ishida,
Yutaka Akiyama
|
|
Pages: 666 |
|
doi>10.1145/2506583.2506669 |
|
Full text: PDF
|
|
|
|
|
The MEGADOCK project: Ultra-high-speed protein-protein interaction prediction tools on supercomputing environments |
| |
Takehiro Shimoda,
Masahito Ohue,
Yuri Matsuzaki,
Takayuki Fujiwara,
Nobuyuki Uchikoga,
Takashi Ishida,
Yutaka Akiyama
|
|
Pages: 667 |
|
doi>10.1145/2506583.2506670 |
|
Full text: PDF
|
|
|
|
|
Semi-automated Constraint-based Metabolic Model Generation |
| |
Jesse R. Walsh,
Julie A. Dickerson
|
|
Pages: 668 |
|
doi>10.1145/2506583.2506671 |
|
Full text: PDF
|
|
Genome-scale models of metabolism are becoming increasingly important to understanding the relationship between genotype and phenotype in an organism on a systems level. There are many tools being developed which rely on a genome-scale metabolic reconstruction ...
Genome-scale models of metabolism are becoming increasingly important to understanding the relationship between genotype and phenotype in an organism on a systems level. There are many tools being developed which rely on a genome-scale metabolic reconstruction as input in order to suggest engineering interventions to improve or modify a strain's metabolism. As more organisms are being sequenced, the information available to reconstruct these models also increases. There are relatively few genome-scale models compared to available genome-scale reconstructions due to the difficulty and time required to create one. In a metabolic engineering context, a genome-scale model can be used to predict engineering interventions that will produce a desired change in an organism's metabolism. Such efforts often consist of iterative small engineering changes to an organism which must be individually analyzed and interpreted, and often updated with the results of analysis on a previous strain. Existing tools for semi-automated genome-scale model generation do not address the issue of updating existing genome-scale models to accommodate new data. A common practice in databases representing metabolic reactions is to represent all possible substrates which are compatible with an enzyme as a single generic metabolite representing the class of substrates which bind to the enzyme. These reactions are referred to as generic reactions, and are not suitable for use in genome-scale modeling, which requires only exact metabolite species to be represented. We have developed a new software for generating genome-scale metabolic models. This software facilitates modification of a base version of a Pathway Genome Database (PGDB) to align it with knowledge of a developed strain. It allows a group to maintain customized data content in an existing BioCyc database, while still being able to integrate newly released updates in a semi-automated fashion. Changes made to the engineered strain also need to be added to any metabolic models of that strain for use in constraint-based analysis. This software can be used to generate metabolic models from a strain specific database in a semi-automated process. expand
|
|
|
Incorporating Gene Annotations as Node Metadata to Improve Network Centrality Measures for Better Node Ranking |
| |
Divya Mistry,
Julie Dickerson
|
|
Pages: 669 |
|
doi>10.1145/2506583.2506672 |
|
Full text: PDF
|
|
Network centrality measures allow ranking of nodes and edges based on their importance to the network topology. Closeness centrality [1] and shortest path betweenness centrality [2] are two of the most popular and well-utilized centrality measures that ...
Network centrality measures allow ranking of nodes and edges based on their importance to the network topology. Closeness centrality [1] and shortest path betweenness centrality [2] are two of the most popular and well-utilized centrality measures that have provided good results [3,4,5,6]. Both of these centralities rely exclusively on topological features of the network [7] to calculate node importance. We propose an improvement to these path length based centrality measures that incorporate node-specific metadata to provide biologically relevant node ranking. We choose gene annotations and gene ontology (GO) evidences as our metadata to highlight the new approach. Application of the newly proposed centrality measures to synthetic networks, and pathogen infected barley's gene co-expression networks resulted in a significantly better prioritization of the nodes. We compared our results against unmodified centrality measures applied to the same networks. Our proposed improvements provide a new avenue for tailoring centrality measures for biological networks, and hold great potential for further improvement of random walk based [8] and motif-based centrality [9] measures. expand
|
|
|
Protein-protein Docking Using Information from Native Interaction Interfaces |
| |
Irina Hashmi,
Amarda Shehu
|
|
Pages: 670 |
|
doi>10.1145/2506583.2506675 |
|
Full text: PDF
|
|
We present a probabilistic search algorithm for rigid-body protein-protein docking. The algorithm is a realization of the basin hopping framework for sampling low-energy local minima of a given energy function. To save computational resources, the algorithm ...
We present a probabilistic search algorithm for rigid-body protein-protein docking. The algorithm is a realization of the basin hopping framework for sampling low-energy local minima of a given energy function. To save computational resources, the algorithm employs a machine learning model to score bound configurations prior to subjecting promising configurations to local optimization with a sophisticated force field. The machine learning model is a decision tree trained on known native dimers to learn features that constitute true interaction interfaces. The FoldX force field is employed only on sampled dimeric configurations determined by the decision tree model to contain true interaction interfaces. The preliminary results are promising and motivate us to further investigate such an informatics-driven approach to protein-protein docking. expand
|
|
|
Determining miRNA-disease associations using bipartite graph modelling |
| |
Joseph Nalluri,
Bhanu Kamapantula,
Preetam Ghosh,
Debmalya Barh,
Neha Jain,
Lucky Juneja,
Neha Barve
|
|
Pages: 672 |
|
doi>10.1145/2506583.2506676 |
|
Full text: PDF
|
|
Exploring miRNA-disease interactions is critical to identify the impact of a disease on other diseases. Mapping this problem to a graph theoretical concept offers a unique perspective to study unseen relationships among diseases. In our work, maximum ...
Exploring miRNA-disease interactions is critical to identify the impact of a disease on other diseases. Mapping this problem to a graph theoretical concept offers a unique perspective to study unseen relationships among diseases. In our work, maximum weighted matching has been used after mapping the miRNA-disease associations as a bipartite graph. We also address the limitation of this approach using disease ranking scheme and the results are presented. expand
|
|
|
Statistical Methods for Ambiguous Sequence Mappings |
| |
Tamer Aldwairi,
Bindu Nanduri,
Mahalingam Ramkumar,
Dilip Gautam,
Michael Johnson,
Andy Perkins
|
|
Pages: 674 |
|
doi>10.1145/2506583.2506678 |
|
Full text: PDF
|
|
Mapping RNA sequences to a reference genome often results in high percentages of short reads assigned to multiple locations within the genome. These mappings are known as "ambiguous mappings" and are often discarded by sequence mapping tools and pipelines. ...
Mapping RNA sequences to a reference genome often results in high percentages of short reads assigned to multiple locations within the genome. These mappings are known as "ambiguous mappings" and are often discarded by sequence mapping tools and pipelines. The number of ambiguous mappings within these data sets can sometimes be significantly large, occupying in certain cases as much as one third of the mapped sequences. We are developing task specific computer programs that utilize statistical methods as an alternative solution for the problem. This statistical approach is based upon identifying significantly expressed genomic locations. We handle ambiguous data through a multi-step process starting with a standard short read alignment tool to identify all the possible mappings within the genome for each sequence read. Custom programs are then used to identify expressed genomic locations by statistical methods. That is, we compare gene expression in the regions of interest with a number of randomly-selected genomic locations. Using these comparisons will help us in establishing a value at which a gene is significantly expressed and determine the locations that are most likely to be the best mapping for each ambiguous sequence. expand
|
|
|
ngPhylo: N-Gram Modeled Proteins with Substitution Matrices for Phylogenetic Analysis |
| |
Brigitte Hofmeister,
Brian R. King
|
|
Pages: 676 |
|
doi>10.1145/2506583.2506679 |
|
Full text: PDF
|
|
Phylogenetic tree constructions are important for understanding evolution and species relatedness. Most methods require a multiple sequence alignment (MSA) to be performed prior to inducing the phylogenetic tree. MSAs, however, are computationally expensive ...
Phylogenetic tree constructions are important for understanding evolution and species relatedness. Most methods require a multiple sequence alignment (MSA) to be performed prior to inducing the phylogenetic tree. MSAs, however, are computationally expensive and increasingly error prone as the number of sequences increase, as the average sequence length increases, and as the sequences in the set become more divergent. We introduce a new method called ngPhylo, an n-gram based method that addresses many of the limitations of MSA-based phylogenetic methods, and computes alignment-free phylogenetic analyses on large sets of proteins that also have long sequences. Unlike other methods, we incorporate the use of standard substitution matrices to improve similarity measures between sequences. Our results show that highly similar phylogenies are produced to existing MSA-based methods with less computational resources required. expand
|
|
|
Automated protein structure refinement using i3Drefine software and its assessment in CASP10 |
| |
Debswapna Bhattacharya,
Jianlin Cheng
|
|
Pages: 678 |
|
doi>10.1145/2506583.2506680 |
|
Full text: PDF
|
|
We present fully automated and computationally efficient protein 3D structure refinement software called i3Drefine, based on an iterative and highly convergent energy minimization algorithm with a powerful all-atom composite physics and knowledge-based ...
We present fully automated and computationally efficient protein 3D structure refinement software called i3Drefine, based on an iterative and highly convergent energy minimization algorithm with a powerful all-atom composite physics and knowledge-based force fields and hydrogen bonding (HB) network optimization technique. In the recent community-wide blind experiment, CASP10, i3Drefine (as 'MULTICOM-CONSTRUCT') was ranked as the best method in the server section as per the official assessment of CASP10 experiment. Our analysis demonstrates that i3Drefine is only fully-automated server participating in CASP10 exhibiting consistent improvement over the initial structures in both global and local structural quality metrics. Executable version of i3Drefine software is freely available at http://protein.rnet.missouri.edu/i3drefine/. expand
|
|
|
A PCA-guided Search Algorithm to Probe the Conformational Space of the Ras Protein |
| |
Rudy Clausen,
Amarda Shehu
|
|
Pages: 679 |
|
doi>10.1145/2506583.2506681 |
|
Full text: PDF
|
|
We present an algorithm to probe the conformational space of Ras, a critical enzyme that employs conformational switching for its biological activity. The algorithm is guided by experimental data on crystallographic structures of wildtype and mutant ...
We present an algorithm to probe the conformational space of Ras, a critical enzyme that employs conformational switching for its biological activity. The algorithm is guided by experimental data on crystallographic structures of wildtype and mutant Ras. A principal component analysis (PCA) over these structures provides search directions, which are used in combination with energetic refinement to sample low-energy conformations of Ras. Our results show that experimental structures are reproduced, and the space is further populated with novel structures, warranting further investigation into structural characterization of Ras. expand
|
|
|
A Novel Algorithm for Feature Detection and Hiding from Ultrasound Images |
| |
Haris Godil,
Sonya Davey,
Raj Shekhar
|
|
Pages: 681 |
|
doi>10.1145/2506583.2506683 |
|
Full text: PDF
|
|
Female feticide, a common problem in many countries such as India and China, is skewing the population and lowering the status of women in these societies. Ultrasound machines are the tool of choice for sex determination of a fetus. We have developed ...
Female feticide, a common problem in many countries such as India and China, is skewing the population and lowering the status of women in these societies. Ultrasound machines are the tool of choice for sex determination of a fetus. We have developed an algorithm that will take an ultrasound image, and will detect and localize the genitalia and output the image with the genital area blurred. Our algorithm uses a sliding window based approach and generates a number of features for each window. Then, a trained classifier recognizes whether the window contains genitalia or not, and provides a confidence level. expand
|
|
|
Bacterial pan-genomes: data representation and analysis |
| |
Leonid Zaslavsky,
Boris Fedorov,
Tatiana Tatusova
|
|
Pages: 683 |
|
doi>10.1145/2506583.2506684 |
|
Full text: PDF
|
|
Bacterial genomes at NCBI represent a large collection of strains with different levels of sequence and assembly quality as well as sampling density. Among these, there are densely-sampled sets of related genomes, usually human pathogens, whose organization ...
Bacterial genomes at NCBI represent a large collection of strains with different levels of sequence and assembly quality as well as sampling density. Among these, there are densely-sampled sets of related genomes, usually human pathogens, whose organization and protein content could be directly analyzed within the concept of pan-genome. Even in groups of close genomes, protein families appear with very different frequencies, with "core proteins" at one end and "dispensable proteins" at another and "accessory proteins" in between. In order to organize genomes available in the NCBI repositories in related groups (species-level clades), we use a distance method based on a robust distance between sets of ribosomal proteins. The threshold is selected to have one species per clade in most of the cases, with some clades containing genomes from a few species. Within each clade, we then build trees based on similarity of protein content using hierarchical clustering with tight parameters. In order to identify protein families for genomes within a clade accurately and reliably, we use a combined approach taking into account both sequence similarity and genome context: First, proteins are clustered in tentative clusters using inclusive parameters. Then, within each of tentative clusters, local genome context and protein phylogenetic tree are used to separate paralogs. The combined approach allows defining core and conservative clusters for the pan-genome more accurately than by sequence-based clustering. For computational efficiency, protein redundancy and near-redundancy is eliminated, with one representative sequence from each near-redundant group used. expand
|
|
|
Using Machine Learning to Predict the Health of HIV-Infected Patients |
| |
Charles L. Cole,
Brian R. King
|
|
Pages: 684 |
|
doi>10.1145/2506583.2506685 |
|
Full text: PDF
|
|
Human immunodeficiency virus-1 is a complex retrovirus that gradually destroys the body's immune system, making it harder for the individual to fight infections. The worst prognosis for an infected individual is AIDS; however this result does not occur ...
Human immunodeficiency virus-1 is a complex retrovirus that gradually destroys the body's immune system, making it harder for the individual to fight infections. The worst prognosis for an infected individual is AIDS; however this result does not occur with everyone. Moreover, not every infected person develops AIDS at the same rate [5]. We developed a method that can predict the disease prognosis of human HIV infections based on non-redundant HIV genomic and proteomic sequence data. Using the random forest classification method on the genomic data, we obtained over 91% accuracy over four different disease levels. We also analyzed the proteins expressed from five of the nine genes in HIV. We found that the rev gene had the highest predictive performance for disease level. Using a decision tree, we were able to output rules that contained specific variants in the protein that can suggest disease outcomes. This information may help researchers understand underlying variants of the gene that have different patient outcomes. Moreover, this knowledge can improve the selection of appropriate treatment methods depending on the predicted infection level, and also improve drug targeting. expand
|
|
|
RNA-Seq analyses to reveal the human transcriptome landscape |
| |
Nan Deng,
Dongxiao Zhu
|
|
Pages: 686 |
|
doi>10.1145/2506583.2506603 |
|
Full text: PDF
|
|
Alternative splicing plays important roles in many biological processes including diseases. It markedly increases the diversity of transcriptome and proteome since over 90% of human genes are alternatively spliced. Recently, the high-throughput RNA-Seq ...
Alternative splicing plays important roles in many biological processes including diseases. It markedly increases the diversity of transcriptome and proteome since over 90% of human genes are alternatively spliced. Recently, the high-throughput RNA-Seq technology makes it possible to better characterize and understand transcriptomes. Differential expression and differential splicing are two fundamental yet crucial analyses to study differences between transcriptomes. The results from analyses may reveal the landscape of human transcriptomes and yield new insight into cell differentiation that may lead to human disease. We present the analysis results from two RNA-Seq data sets to study the transcriptomes of a human disease and a type of human cell differentiation. For the first study, we applied our analysis pipeline to a RNA-Seq data set of human Idiopathic Pulmonary Fibrosis (IPF) disease. We present a joint analysis result of differential expression and differential splicing to view genes from both aspects simultaneously. We also provide several non-differentially spliced genes with splicing variants validated by qRT-PCR experiments. For the second study, we developed a novel computational method, and applied it on a public RNA-Seq data set of human H1 and H1 differentiation into neural progenitor cell lines. We systematically detected many significant differential splicing events falling into five well-known types of alternative splicing. We present the proportion of the five types of detected differential splicing events in this study. For each type of splicing event, we show a case study to demonstrate the detection procedure of the differential splicing event. expand
|
|
|
Initial Results In Using de Novo Motif Inference to Detect Cis-Regulatory Modules |
| |
Jeffrey A. Thompson,
Clare Bates Congdon
|
|
Pages: 687 |
|
doi>10.1145/2506583.2506689 |
|
Full text: PDF
|
|
In this work, we extend GAMI, a de novo motif inference system, to find sets of motifs that may function as part of a cis-regulatory module (CRM). Evidence suggests that most transcription factors in humans are part of a CRM, so this approach is expected ...
In this work, we extend GAMI, a de novo motif inference system, to find sets of motifs that may function as part of a cis-regulatory module (CRM). Evidence suggests that most transcription factors in humans are part of a CRM, so this approach is expected to yield stronger candidates for de novo inference of candidate regulatory elements. expand
|
|
|
Comparative network analysis of gene co-expression networks reveals the conserved and species-specific functions of cell-wall related genes between Arabidopsis and Poplar |
| |
Daifeng Wang,
Eric Pan,
Gang Fang,
Sunita Kumari,
Fei He,
Doreen Ware,
Sergei Maslov,
Mark Gerstein
|
|
Pages: 689 |
|
doi>10.1145/2506583.2506690 |
|
Full text: PDF
|
|
In this study, we established a computational framework of comparative network analysis to identify the conserved and species-specific functions of cell-wall (CW) related genes [1, 2], an important gene family related to plant bio-fuel productions across ...
In this study, we established a computational framework of comparative network analysis to identify the conserved and species-specific functions of cell-wall (CW) related genes [1, 2], an important gene family related to plant bio-fuel productions across multiple tissue types between Arabidopsis and Poplar. The co-expressed genes are believed to coordinate in transcription so that they may have similar functions [3, 4]. Also, a comparative analysis across species for gene co-expression networks (GCNs) provides a systematic way to understand genomic conserved or species-specific functions [5]. Therefore, to understand the functions of CW genes in different tissue types, we integrated and compared the network characteristics of CW genes across GCNs from different tissue types including leaf, flower and shoot for Arabidopsis and Poplar [6]. First, by aligning the gene co-expression sub-networks associated with CW genes between two plants for each tissue type, we grouped the tissue types based on the alignment of the CW genes along with their neighboring orthologous genes. For those tissues with good alignments, it suggests that CW genes coordinate in a similar way for both plants, which may have involved in the conserved functions. For the tissues with poor alignments, however, CW genes may take part in species-specific functions. The gene ontology enrichment and signaling pathways of their co-expressed neighboring genes were identified to provide new insight for cell wall biology. Second, since the genes with high network centralities of a GCN, so called "hub" genes, are believed to have key functions [7], we investigated the network centralities for the CW genes between two plants to understand their functions in a global network point of view. The network centralities of GCN that we used are clustering coefficient (CC) for measuring gene's local cliqueness, and eigenvector centrality (EC) for measuring gene's global influence over the entire network. Besides finding hub genes for each tissue type within and across two plants, we also identified the conserved hub genes and tissue-specific hub genes in either local or global fashion. The CW genes that happen to become hub were particularly of interest to study. If many CW genes are global hubs in certain tissues, it implies that cell wall related activities may interact with the whole plant in those tissues, but if local hubs, they may coordinate with certain local activities only. Finally, we used the genomic variation data to identify the species-specific SNPs, especially in the promoter regions of the CW co-expressed neighboring genes across tissues, and associate them with corresponding species-specific functions. In summary, our comparative network analysis framework studied gene co-expression networks for the cell wall related genes across different tissue types in Arabidopsis and Poplar, and identified their conserved and species-specific functions and variations. This framework can also be used to study other gene families along with their functions across multiple species. expand
|
|
|
Reachability analysis in large probabilistic biological networks |
| |
Andrei Todor,
Haitham Gabr,
Alin Dobra,
Tamer Kahveci
|
|
Pages: 691 |
|
doi>10.1145/2506583.2506691 |
|
Full text: PDF
|
|
|
|
|
Scheduling of virtual screening application on multi-user pilot-agent platform on grid/cloud to optimize the stretch |
| |
Bui The Quang,
Nguyen Hong Quang,
Emmanuel Medernach,
Vincent Breton
|
|
Pages: 692 |
|
doi>10.1145/2506583.2512369 |
|
Full text: PDF
|
|
In this paper, we present our research on the scheduling of virtual screening platform on grid/cloud which is shared by many users. We find the scheduling policy to ensure the fairness between users. We evaluate two policies in existing platform (FIFO ...
In this paper, we present our research on the scheduling of virtual screening platform on grid/cloud which is shared by many users. We find the scheduling policy to ensure the fairness between users. We evaluate two policies in existing platform (FIFO and Round Robin) and two candidate policies from literature (SPT and LPT) by our simulator. Simulation result showed that SPT improve performance of scheduling policies in existing platform. expand
|
|
|
Exploring Local Features and the Bag-of-Visual-Words Approach for Bioimage Classification |
| |
Afzal Godil,
Zhouhui Lian,
Asim Wagan
|
|
Pages: 694 |
|
doi>10.1145/2506583.2512370 |
|
Full text: PDF
|
|
With recent advances in imaging technologies large numbers of bioimages are currently being acquired. Automated classification of these bio-images is a very important and challenging problem. Here we investigate the capabilities of local features and ...
With recent advances in imaging technologies large numbers of bioimages are currently being acquired. Automated classification of these bio-images is a very important and challenging problem. Here we investigate the capabilities of local features and the Bag-of-Visual-Words (BOV) approach in the area of bioimage classification. We have tested both sparse and dense placement of local features. The local feature that we have tested is Scale-Invariant Feature Transform (SIFT), but we are in the process of testing other local features. The standard BOV approach is based on counting the number of local descriptors assigned to each quantization. In our case we are also using other statistics (mean and covariance of local descriptors). The classifier used for this study is the Support Vector Machine (SVM). We have performed classification experimentation on the well-tested single cell dataset of 2D HeLa from CMU and have achieved performance similar to the state of the art. expand
|
|
|
Quality of Care and Electronic Health Record Systems |
| |
Arshia Khan,
John Grillo
|
|
Pages: 696 |
|
doi>10.1145/2506583.2512371 |
|
Full text: PDF
|
|
The state of healthcare in the United States of America is in jeopardy. Researchers have suggested the integration of technology to improve the staggering quality of care. Most urban hospitals have support in terms of finances, research, and professional ...
The state of healthcare in the United States of America is in jeopardy. Researchers have suggested the integration of technology to improve the staggering quality of care. Most urban hospitals have support in terms of finances, research, and professional services whereas the rural hospitals are required to accomplish the integration of technology with little means to support and monitor the process. This study examined the quality of care in rural critical access hospitals with the maintaining of an up-to-date problem list. expand
|
|
|
Prediction of Biological Protein-protein Interaction Types Using Short-Linear Motifs |
| |
Manish Pandit,
Luis Rueda,
Alioune Ngom
|
|
Pages: 698 |
|
doi>10.1145/2506583.2512372 |
|
Full text: PDF
|
|
Protein-protein interactions (PPIs) play a key role in many biological processes and functions in living cells. Thus, identification, prediction, and analysis of PPIs are important aspects in molecular biology. We propose a computational model to predict ...
Protein-protein interactions (PPIs) play a key role in many biological processes and functions in living cells. Thus, identification, prediction, and analysis of PPIs are important aspects in molecular biology. We propose a computational model to predict biological PPI types using short-linear motifs (SLiMs). The information contained in protein sequences is used to distinguish between interaction types, namely obligate and non-obligate. Classifiers, such as k-nearest neighbor (k-NN), support vector machine (SVM) and linear dimensionality reduction (LDR) on two well-known datasets confirm the power of the proposed model with accuracy above 99%. The results show that the information contained in the training sequences is crucial for prediction and analysis of biological PPIs. expand
|
|
|
Conditional Random Field for Candidate Gene Prioritization |
| |
Bingqing Xie,
Gady Agam,
Natalia Maltsev,
Conrad Gilliam
|
|
Pages: 700 |
|
doi>10.1145/2506583.2512374 |
|
Full text: PDF
|
|
Prioritization of novel disease genes is a major challenge in bioinformatics. The large amount of data collected from modern biological experiments makes it difficult for biologists to determine how information on a particular gene relates to a disease ...
Prioritization of novel disease genes is a major challenge in bioinformatics. The large amount of data collected from modern biological experiments makes it difficult for biologists to determine how information on a particular gene relates to a disease or phenotype, whereas performing exhaustive experiments on all possible combinations is impossible. Computational approaches are thus crucial in automating the process of extracting critical annotation and patterns and predicting relevant novel genes with high confidence. In this paper we propose a new method for prioritizing disease genes using both annotations on the genes as well as the underlying gene interaction network. Our approach is unique in that it uses a conditional random field to simultaneously exploit both network and annotation information directly without attempting to convert the network information into features or vice versa. Performance evaluation on standard data sets achieves a median ranking of 29% and over 0.6 area under curve value in cross-validation experiments on 42 diseases. expand
|
|
|
Quantum Sequence Analysis: A New Alignment-free Technique For Analyzing Sequences in Feature Space |
| |
Mosaab Daoud
|
|
Pages: 702 |
|
doi>10.1145/2506583.2512375 |
|
Full text: PDF
|
|
In this paper, we propose a new alignment-free sequence analysis technique (quantum sequence analysis) that can be used to analyze sequences in feature space. The proposed technique can used to estimate the membership value of a given query sequence ...
In this paper, we propose a new alignment-free sequence analysis technique (quantum sequence analysis) that can be used to analyze sequences in feature space. The proposed technique can used to estimate the membership value of a given query sequence with respect to different classes of sequences using stochastic approximation, and without assuming any prior stochastic assumptions. We implemented the proposed technique using real datasets, and the proposed technique shows effectiveness in analyzing sequences in feature space. expand
|
|
|
Predicting Breast Cancer Patient Survival Using Machine Learning |
| |
David Solti,
Haijun Zhai
|
|
Pages: 704 |
|
doi>10.1145/2506583.2512376 |
|
Full text: PDF
|
|
Our null hypothesis was that a computer algorithm will not predict breast cancer patients' 10-year survival with greater accuracy than the 64.3% baseline of the Surveillance Epidemiology and End Results (SEER) database [3]. The aims of this study were ...
Our null hypothesis was that a computer algorithm will not predict breast cancer patients' 10-year survival with greater accuracy than the 64.3% baseline of the Surveillance Epidemiology and End Results (SEER) database [3]. The aims of this study were to (1) Build an infrastructure to convert SEER data into a machine readable format; (2) Train Machine Learning (ML) algorithms to predict breast cancer patients' 10-year survival; and (3) Measure the predictive accuracy of the ML algorithms. We downloaded 657,711 breast cancer patients' clinical and demographic characteristics from the SEER database and converted them into machine-readable feature vectors. An oncologist generated a list of potential variables for the ML algorithms. We trained the WEKA Machine Learning package's Logistic Regression (LR), Naive Bayes, and C4.5 Decision Tree algorithms on the data using ten-fold cross validation. LR, Naive Bayes, and C4.5 Decision Tree achieved accuracies of 76.29%, 59.71%, and 77.43% respectively. We compared the results of the LR algorithm with those of a well-known website, Adjuvant! Online. The results rejected the null hypothesis for LR and C4.5 Decision Tree, but failed to reject for Naïve Bayes. Of the algorithms tested, C4.5 proved to be the most accurate predictor of patient survival in ten years. In addition, LR provided more accurate predictions than Adjuvant! without Adjuvant!'s limitations. expand
|
|
|
ngsShoRT: A Software for Pre-processing Illumina Short Read Sequences for De Novo Genome Assembly |
| |
Chuming Chen,
Sari S. Khaleel,
Hongzhan Huang,
Cathy H. Wu
|
|
Pages: 706 |
|
doi>10.1145/2506583.2512377 |
|
Full text: PDF
|
|
The shorter sequence read length, higher base-call error rate, non-uniform coverage and platform-specific artifacts of next-generation sequencing (NGS) technologies hinder the de novo genome assembly. We developed ngsShoRT (next-generation sequencing ...
The shorter sequence read length, higher base-call error rate, non-uniform coverage and platform-specific artifacts of next-generation sequencing (NGS) technologies hinder the de novo genome assembly. We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open source software package written in Perl that implements novel algorithms as well as other commonly used pre-processing algorithms from the literature to process Illumina short read sequences for downstream data analyses. We compared the effects of different pre-processing algorithms/methods on the de novo assembly of C. elegans genome by measuring the assembly contiguity and correctness. Our experiments show that removing reads with ambiguous "N" bases and adapter sequences, and trimming of 3′-ends of reads using our novel quality score-based trimming algorithms improved the assembly quality. The source code of ngsShoRT is freely available and can be easily incorporated as a pre-processing step for genome/transcriptome assembly pipeline. expand
|
|
|
Role of Quality in Electronic Health Record Systems |
| |
Arshia Khan,
John Grillo
|
|
Pages: 708 |
|
doi>10.1145/2506583.2512379 |
|
Full text: PDF
|
|
The state of healthcare in the United States of America is in jeopardy. Researchers have suggested the integration of technology to improve the staggering quality of care. Most urban hospitals have support in terms of finances, research, and professional ...
The state of healthcare in the United States of America is in jeopardy. Researchers have suggested the integration of technology to improve the staggering quality of care. Most urban hospitals have support in terms of finances, research, and professional services whereas the rural hospitals are required to accomplish the integration of technology with little means to support and monitor the process. This study examined the quality of care in rural critical access hospitals with the maintaining of an up-to-date problem list. expand
|
|
|
Sparse and Stable Reconstruction of Genetic Regulatory Networks Using Time Series Gene Expression Data |
| |
Roozbeh Manshaei,
Matthew Kyan
|
|
Pages: 710 |
|
doi>10.1145/2506583.2512380 |
|
Full text: PDF
|
|
Gene regulatory networks represent the regulatory and physical interactions between genes of an organism. In this application, we are presented with a set of time series gene expression data, from which an unknown topology describing the regulatory interactions ...
Gene regulatory networks represent the regulatory and physical interactions between genes of an organism. In this application, we are presented with a set of time series gene expression data, from which an unknown topology describing the regulatory interactions between genes must be inferred. To this end, we formulate an algorithm for reconstructing a genetic regulatory network to explain time series data obtained from genetic experiments. Our algorithm minimizes the trade-off between of the sparsity of gene interactions in the inferred network and the best model accuracy, where stability and prior knowledge are considered as constraints. Our algorithm is applied to time series gene expression data from yeast cell-cycle regulation, and results show improved reconstruction. The convex nature of the proposed model makes it suitable for application to large-scale networks. expand
|
|
|
Listing Sorting Sequences of Reversals and Translocations |
| |
Amritanjali,
G. Sahoo
|
|
Pages: 712 |
|
doi>10.1145/2506583.2512381 |
|
Full text: PDF
|
|
Algorithms for sorting by reversals and translocations (SBRT) are often used to propose evolutionary scenarios of multichromosomal genomes. The existing algorithms for the SBRT problem provide a single sorting sequence of reversals and translocations. ...
Algorithms for sorting by reversals and translocations (SBRT) are often used to propose evolutionary scenarios of multichromosomal genomes. The existing algorithms for the SBRT problem provide a single sorting sequence of reversals and translocations. In this paper we propose multiple solutions for the problem and present a methodology to list all possible sorting sequences of reversals and translocations for a given pair of genomes without knots. Listing all the sorting sequences is useful in assessing the biological merits of various parsimonious rearrangement scenarios. expand
|
|
|
An Overview on Semantic Analysis of Proteomics Data |
| |
Pietro Hiram Guzzi,
Marco Mina,
Concettina Guerra,
Mario Cannataro
|
|
Pages: 714 |
|
doi>10.1145/2506583.2512382 |
|
Full text: PDF
|
|
The availability of biological knowledge, recently encoded in ontologies such as the Gene Ontology, is leading the development of novel methods for the analysis of experimental data that integrate prior information. A recent trend consists in the use ...
The availability of biological knowledge, recently encoded in ontologies such as the Gene Ontology, is leading the development of novel methods for the analysis of experimental data that integrate prior information. A recent trend consists in the use of Semantic Similarity Measures (SSMs) to quantify the functional similarity of biological molecules starting from qualitative data (i.e. their functions or localization within cells). A plethora of SSMs and analysis frameworks based on them have been recently proposed. There are, however, several issues in the use of SSMs still to be fully addressed, as well as their assessment with respect to biological features (e.g. is there any correlation between SSMs and biological properties such as sequence similarity?). In this work, after a brief introduction of the main SSMs, we dissect the ongoing assessment efforts. expand
|
|
|
decisivatoR: an R infrastructure package that addresses the problem of phylogenetic decisiveness |
| |
Ilya Y. Zhbannikov,
Joseph W. Brown,
James A. Foster
|
|
Pages: 716 |
|
doi>10.1145/2506583.2512383 |
|
Full text: PDF
|
|
One of the major challenges in evolutionary biology is reconstructing the Tree of Life. Many phylogenetic trees have been estimated for many sequence and character datasets. Supertree methods have been developed to take advantage of this infrastructure ...
One of the major challenges in evolutionary biology is reconstructing the Tree of Life. Many phylogenetic trees have been estimated for many sequence and character datasets. Supertree methods have been developed to take advantage of this infrastructure by combining separate phylogenies into a single "supertree". However, the trees often do not contain enough data to construct a unique supertree. In this case multiple supertrees are consistent with the constituent trees, and there is no a priori way to choose the best. A set of taxa and characters used to build component trees is decisive when they are sufficient to determine a unique supertree. Clearly, it would be useful to know when a set of characters and taxa are decisive, and therefore when they lead to a unique supertree. We have developed and validated an R package, the DECISIVATOR, to provide this service. expand
|
|
|
Estimating the Number of Manually Segmented Cellular Objects Required to Evaluate the Performance of a Segmentation Algorithm |
| |
Adele Peskin,
Joe Chalfoun,
Karen Kafadar,
John Elliott
|
|
Pages: 718 |
|
doi>10.1145/2506583.2512384 |
|
Full text: PDF
|
|
We propose a new strategy for estimating the number of cellular objects that should be manually segmented for evaluating the segmentation performance of an algorithm. The strategy uses geometric and edge quality measurements that are directly related ...
We propose a new strategy for estimating the number of cellular objects that should be manually segmented for evaluating the segmentation performance of an algorithm. The strategy uses geometric and edge quality measurements that are directly related to segmentation performance, but do not require highly accurate segmentation. Sample sizes are determined from standard deviations of cell features calculated from the entire image set. We examine the relationship between approximate confidence level and sample size. The use of our strategy may reduce the effort and time required for generating a reference dataset for evaluating segmentation algorithm performance with images of biological cells. We demonstrate the usefulness of this methodology on a large and diverse data set for which reference data are available. expand
|
|
|
Abstraction of Kinetic Models For Biochemical Networks |
| |
Calvin Hopkins,
Soha Hassoun
|
|
Pages: 720 |
|
doi>10.1145/2506583.2512386 |
|
Full text: PDF
|
|
Constructing kinetic models that describe the time-dependent behavior of every enzyme-catalyzed reaction in a genome-scale model is a daunting task. Mechanistic knowledge of enzyme kinetics is often unavailable, and estimating a consistent set of rate ...
Constructing kinetic models that describe the time-dependent behavior of every enzyme-catalyzed reaction in a genome-scale model is a daunting task. Mechanistic knowledge of enzyme kinetics is often unavailable, and estimating a consistent set of rate parameters from time-series data requires a large experimental effort for even a moderately-sized network. Model construction techniques that derive computationally efficient, biochemically meaningful, accurate dynamic models are needed. A new method, Abstraction of Kinetic Models, explores constructing predictive kinetic models of modules, where the primary goal is preserving module-level behavior instead of developing accurate kinetic expressions for each reaction within the module. When eliminating a module variable (e.g. concentration of a particular metabolite), our method compensates by shifting the roles of other metabolites within the module as activators and inhibitors, if needed, and by calculating a new set of parameter values. AKM provides a systematic method for exploring accuracy vs. simplicity tradeoffs during abstract model construction. Validation efforts on two test cases demonstrate such tradeoffs, and show that modest loss of accuracy is attainable when some internal metabolite concentrations are eliminated and when the newly constructed network model compensates for missing variables. expand
|
|
|
Co-occurrence Clusters of Aligned Pattern Clusters |
| |
Sanderz Fung
|
|
Pages: 721 |
|
doi>10.1145/2506583.2506664 |
|
Full text: PDF
|
|
Advances in bioinformatics have provided researchers with a large influx of novel sequences, thus making the analysis of the sequences for inherent biological knowledge crucial. Important protein segments can be represented by variable patterns, obtained ...
Advances in bioinformatics have provided researchers with a large influx of novel sequences, thus making the analysis of the sequences for inherent biological knowledge crucial. Important protein segments can be represented by variable patterns, obtained as set of Aligned Pattern Clusters (APC) by using pattern discovery and pattern synthesis on protein family sequences. We develop a method for clustering APCs based on their co-occurrences on the same protein sequence. Their co-occurrence indicates how protein segments in a protein family interact with one another. The purpose of this paper is to provide a method that, given a list of discovered APCs from a family of a protein sequences, finds a set of interdependent APC clusters with high cooccurrence in sequences of a protein family. The significance of these co-occurrence clusters are verified by their corresponding three-dimensional structure and function of the protein. We applied our method to eight protein families obtained from pFam, including triosephosphate isomerase and ubiquitin. We found that the closely co-occurring clusters of APCs in each protein family are close in the three-dimensional protein structures, inferring interactions of the APC segments. In conclusion, we discover that there is a connection between high co-occurrence between APCs and three-dimensional closeness. expand
|
|
|
Three-Dimensional Spot Detection in Ratiometric Fluorescence Imaging For Measurement of Subcellular Organelles |
| |
William W. Lau,
Calvin A. Johnson,
Sara Lioi,
Joseph A. Mindell
|
|
Pages: 722 |
|
doi>10.1145/2506583.2512387 |
|
Full text: PDF
|
|
Lysosomes are subcellular organelles playing a vital role in the endocytosis process of the cell. Lysosomal acidity is an important factor in assuring proper functioning of the enzymes within the organelle, and can be assessed by labeling the lysosomes ...
Lysosomes are subcellular organelles playing a vital role in the endocytosis process of the cell. Lysosomal acidity is an important factor in assuring proper functioning of the enzymes within the organelle, and can be assessed by labeling the lysosomes with pH-sensitive fluorescence probes. To enhance our understanding of the acidification mechanisms, the goal of this work is to develop a method that can accurately detect and characterize the acidity of each lysosome captured in ratiometric fluorescence images. We present an algorithm that utilizes the h-dome transformation and reconciles spots detected independently from two wavelength channels. We evaluated our algorithm using simulated images for which the exact locations were known. The h-dome algorithm achieved an f-score as high as 0.890. We also computed the fluorescence ratios from lysosomes in live HeLa cell images with known lysosomal pHs. Using leave-one-out cross-validation, we demonstrated that the new algorithm was able to achieve much better pH prediction accuracy than the conventional method. expand
|
|
|
Evaluating theoretical models of protein interaction network evolution without seed graphs |
| |
Todd A. Gibson,
Debra S. Goldberg
|
|
Pages: 724 |
|
doi>10.1145/2506583.2512388 |
|
Full text: PDF
|
|
Here we develop an alternate method to evaluate the evolutionary mechanics of theoretical network models which is free of the bias introduced by seed graph selection. We run a model in reverse directly on empirical data, and then run the model forward ...
Here we develop an alternate method to evaluate the evolutionary mechanics of theoretical network models which is free of the bias introduced by seed graph selection. We run a model in reverse directly on empirical data, and then run the model forward to generate a network topology to compare against the empirical data. We implement this method on a well-regarded gene duplication and divergence model, and find that it is unable to generate the high clustering found in the empirical data. expand
|
|
|
The Atomizer: Extracting Implicit Molecular Structure from Reaction Network Models |
| |
Jose-Juan Tapia,
James R. Faeder
|
|
Pages: 726 |
|
doi>10.1145/2506583.2512389 |
|
Full text: PDF
|
|
In this paper we introduce the Atomizer, an expert system for extracting implicit information from reaction network models, like those encoded by the Systems Modeling Markup Language (SBML), to create a structured translation using the rule-based modeling ...
In this paper we introduce the Atomizer, an expert system for extracting implicit information from reaction network models, like those encoded by the Systems Modeling Markup Language (SBML), to create a structured translation using the rule-based modeling paradigm. Atomized models can be visualized in a compact form through contact maps, which show the underlying molecules, components, and interactions used to construct a model. Analysis of the atomized reactions reveals simplifying assumptions made in the construction of a model that limit the combinatorial complexity. These benefits are elucidated through a case study. We anticipate that the library of translated rule-based models we can generate using the Atomizer will be useful to the biological modeling community by providing a more accessible view of the available models and by facilitating their extension and merging. 9939 expand
|
|
|
Studies of biological networks with statistical model checking: application to immune system cells |
| |
Natasa Miskov-Zivanov,
Paolo Zuliani,
Edmund M. Clarke,
James R. Faeder
|
|
Pages: 728 |
|
doi>10.1145/2506583.2512390 |
|
Full text: PDF
|
|
We use computational modeling and formal analysis techniques to study temporal behavior of a discrete logical model of the naïve T cell differentiation. The model is analyzed formally and automatically by performing temporal logic queries via statistical ...
We use computational modeling and formal analysis techniques to study temporal behavior of a discrete logical model of the naïve T cell differentiation. The model is analyzed formally and automatically by performing temporal logic queries via statistical model checking. The results obtained using model checking provide details about relative timing of events in the system, which would otherwise be very cumbersome and time consuming to obtain through simulations only. expand
|
|
|
An Algorithm for Constructing Hypothetical Evolutionary Trees Using Common Mutation Similarity Matrices |
| |
Peter Z. Revesz
|
|
Pages: 730 |
|
doi>10.1145/2506583.2512391 |
|
Full text: PDF
|
|
In this paper, we introduce a new evolutionary tree algorithm that is based on common mutation similarity matrices instead of distance matrices.
In this paper, we introduce a new evolutionary tree algorithm that is based on common mutation similarity matrices instead of distance matrices. expand
|
|
|
Predicting Protein Families using Protein Shape Context |
| |
Jun Tan,
Donald Adjeroh
|
|
Pages: 733 |
|
doi>10.1145/2506583.2512392 |
|
Full text: PDF
|
|
Given the rapidly increasing quantity of available genomic and proteomic data, efficient and reliable analysis of protein 3D structures has become a major challenge in the post genomic era. In this work, we introduce the sorted protein shape context, ...
Given the rapidly increasing quantity of available genomic and proteomic data, efficient and reliable analysis of protein 3D structures has become a major challenge in the post genomic era. In this work, we introduce the sorted protein shape context, and its encoding into a protein shape string as an effective descriptor for protein 3D structures. Based on the new encoding, we present a method for predicting the functional family for a given protein 3D structure. Applying the proposed method on a dataset of known protein families from Pfam resulted in an average Type I error rate of 10% and Type II error rate of 0.1%. expand
|
|
|
WORKSHOP SESSION: Workshop CSBW |
|
|
|
|
Fast and Accurate Structure-Based Prediction of Resistance to the HIV-1 Integrase Inhibitor Raltegravir |
| |
Majid Masso
|
|
Pages: 735 |
|
doi>10.1145/2506583.2506703 |
|
Full text: PDF
|
|
Integration of reverse transcribed viral DNA into the human genome represents an essential step in the replication cycle of HIV-1, a process mediated by the viral enzyme integrase (IN). Raltegravir (RAL), an HIV-1 strand transfer inhibitor that binds ...
Integration of reverse transcribed viral DNA into the human genome represents an essential step in the replication cycle of HIV-1, a process mediated by the viral enzyme integrase (IN). Raltegravir (RAL), an HIV-1 strand transfer inhibitor that binds integrase, is the first drug in its class to be approved for clinical use. As with HIV-1 protease and reverse transcriptase inhibitors, the degree of susceptibility to RAL can vary in patients due to mutations in the viral genome region that encodes IN. Employing a dataset of over two hundred translated IN sequences, each with a quantified susceptibility value and harboring a unique set of amino acid replacements relative to the native IN, here we develop and evaluate statistical learning models for predicting phenotype (i.e., RAL susceptibility) from genotype (i.e., translated IN sequences). Each IN mutant is represented as a feature vector of structure-based attributes obtained via an in silico mutagenesis approach that quantifies IN residue-specific environmental perturbations upon mutation. Cross-validated performance is consistent among four classification models (random forest, support vector machine, decision tree, and neural network), with balanced accuracy reaching 93%, and two regression models (reduced-error pruned tree, and support vector regression), with a correlation coefficient as high as r = 0.90. expand
|
|
|
Improving the Prediction of Kinase Binding Affinity Using Homology Models |
| |
Jeffrey Chyan,
Mark Moll,
Lydia E. Kavraki
|
|
Pages: 741 |
|
doi>10.1145/2506583.2506704 |
|
Full text: PDF
|
|
Kinases are a class of proteins very important to drug design; they play a pivotal role in many of the cell signaling pathways in the human body. Thus, many drug design studies involve finding inhibitors for kinases in the human kinome. However, identifying ...
Kinases are a class of proteins very important to drug design; they play a pivotal role in many of the cell signaling pathways in the human body. Thus, many drug design studies involve finding inhibitors for kinases in the human kinome. However, identifying inhibitors of high selectivity is a difficult task. As a result, computational prediction methods have been developed to aid in this drug design problem. The recently published CCORPS method [3] is a semi-supervised learning method that identifies structural features in protein kinases that correlate with kinase binding affinity to inhibitors. However, CCORPS is dependent on the amount of available structural data. The amount of known structural data for proteins is extremely small compared to the amount of known protein sequences. To paint a clearer picture of how kinase structure relates to binding affinity, we propose extending the CCORPS method by integrating homology models for predicting kinase binding affinity. Our results show that using homology models significantly improves the prediction performance for some drugs while maintaining comparable performance for other drugs. expand
|
|
|
A Constrained K-shortest Path Algorithm to Rank the Topologies of the Protein Secondary Structure Elements Detected in CryoEM Volume Maps |
| |
Kamal Al Nasr,
Lin Chen,
Desh Ranjan,
M. Zubair,
Dong Si,
Jing He
|
|
Pages: 749 |
|
doi>10.1145/2506583.2506705 |
|
Full text: PDF
|
|
Although many electron density maps have been produced into the medium resolutions, it is still challenging to derive the atomic structure from such volumetric data. Current methods primarily rely on the availability of an existing atomic structure for ...
Although many electron density maps have been produced into the medium resolutions, it is still challenging to derive the atomic structure from such volumetric data. Current methods primarily rely on the availability of an existing atomic structure for fitting or a homologous template structure for modeling. In the process of developing a template-free, de novo, method, the topology of the secondary structure elements need to be resolved first. In this paper, we extend our previous algorithm of finding the optimal solution in the constraint graph problem. We illustrate an approach to obtain the top-K topologies by combining a dynamic programming algorithm with the K-shortest path algorithm. The effectiveness of the algorithms is demonstrated from the test using three datasets of different nature. The algorithm improves the accuracy, space and time of an existing method. expand
|
|
|
Exploring the Structure Space of Wildtype Ras Guided by Experimental Data |
| |
Rudy Clausen,
Amarda Shehu
|
|
Pages: 756 |
|
doi>10.1145/2506583.2506706 |
|
Full text: PDF
|
|
The Ras enzyme mediates critical signaling pathways in cell proliferation and development by transitioning between GTP- (active) and GDP-bound (inactive) states. Many cancers are linked to specific Ras mutations affecting its conformational switching ...
The Ras enzyme mediates critical signaling pathways in cell proliferation and development by transitioning between GTP- (active) and GDP-bound (inactive) states. Many cancers are linked to specific Ras mutations affecting its conformational switching between active and inactive states. A detailed understanding of the sequence-structure-function space in Ras is missing. In this paper, we provide the first steps towards such an understanding. We conduct a detailed analysis of X-ray structures of wildtype and mutant variants of Ras. We embed the structures onto a low-dimensional structure space by means of Principal Component Analysis (PCA) and show that these structures are energetically feasible for wildtype Ras. We then propose a probabilistic conformational search algorithm to further populate the structure space of wildtype Ras. The algorithm explores a low-dimensional map as guided by the principal components obtained through PCA. Generated conformations are rebuilt in all-atom detail and energetically refined through Rosetta in order to further populate the structure space of wildtype Ras with energetically-feasible structures. Results show that a variety of novel structures are revealed, some of which reproduce experimental structures not subjected to the PCA but withheld for the purpose of validation. This work is a first step towards a comprehensive characterization of the sequence-structure space in Ras, which promises to reveal novel structures not probed in the wet laboratory, suggest new mutations, propose new binding sites, and even elucidate unknown interacting partners of Ras. expand
|
|
|
Beta-sheet Detection and Representation from Medium Resolution Cryo-EM Density Maps |
| |
Dong Si,
Jing He
|
|
Pages: 764 |
|
doi>10.1145/2506583.2506707 |
|
Full text: PDF
|
|
Secondary structure element (SSE) identification from volumetric protein density maps is critical for de-novo backbone structure derivation in electron cryo-microscopy (cryoEM). Although multiple methods have been developed to detect SSE from the density ...
Secondary structure element (SSE) identification from volumetric protein density maps is critical for de-novo backbone structure derivation in electron cryo-microscopy (cryoEM). Although multiple methods have been developed to detect SSE from the density maps, accurate detection either need use intervention or carefully adjusting various parameters. It is still challenging to detect the SSE automatically and accurately from cryoEM density maps at medium resolutions (~5-10Å). A detected β-sheet can be represented by either the voxels of the β-sheet density or by many piecewise polygons to compose a rough surface. However, none of these is effective in capturing the global surface feature of the β-sheet. We present an effective single-parameter approach, SSEtracer, to automatically identify helices and β-sheets from the cryoEM three-dimensional (3D) maps at medium resolutions. More importantly, we present a simple mathematical model to represent the β-sheet density. It was tested using eleven cryoEM β-sheets detected by SSEtracer. The RMSE between the density and the model is 1.88Å. The mathematical model can be used for the β-strands detection from medium resolution density maps. expand
|
|
|
Informatics-driven Protein-protein Docking |
| |
Irina Hashmi,
Amarda Shehu
|
|
Pages: 771 |
|
doi>10.1145/2506583.2506709 |
|
Full text: PDF
|
|
Predicting the structure of protein assemblies is fundamental to our ability to understand the molecular basis of biological function. The basic protein-protein docking problem involving two protein units docking onto each-other remains challenging. ...
Predicting the structure of protein assemblies is fundamental to our ability to understand the molecular basis of biological function. The basic protein-protein docking problem involving two protein units docking onto each-other remains challenging. One direction of research is exploring probabilistic search algorithms with high exploration capability, but these algorithms are limited by errors in current energy functions. A complementary direction is choosing to understand what constitutes true interaction interfaces. In this paper we present a method that combines the two directions and advances research into computationally-efficient yet high-accuracy docking. We present an informatics-driven probabilistic search algorithm for rigid protein-protein docking. The algorithm builds upon the powerful basin hopping framework, which we have shown in many settings in molecular modeling to have high exploration capability. Rather than operate de novo, the algorithm employs information on what constitutes a native interaction interface. A predictive machine learning model is built and trained a priori on known dimeric structures to learn features correlated with a true interface. The model is fast, accurate, and replaces expensive physics-based energy functions in scoring sampled configurations. A sophisticated energy function is used to refine only high-scoring configurations. The result is an ensemble of high-quality decoy configurations that we show here to approach the known native dimeric structure better than other state-of-the-art docking methods. We believe the proposed method advances computationally-efficient high-accuracy docking. expand
|
|
|
An Evolutionary Conservation & Rigidity Analysis Machine Learning Approach for Detecting Critical Protein Residues |
| |
Filip Jagodzinski,
Bahar Akbal-Delibas,
Nurit Haspel
|
|
Pages: 779 |
|
doi>10.1145/2506583.2506708 |
|
Full text: PDF
|
|
In proteins, certain amino acids may play a critical role in determining their structure and function. Examples include flexible regions which allow domain motions, and highly conserved residues on functional interfaces which play a role in binding and ...
In proteins, certain amino acids may play a critical role in determining their structure and function. Examples include flexible regions which allow domain motions, and highly conserved residues on functional interfaces which play a role in binding and interaction with other proteins. Detecting these regions facilitates the analysis and simulation of protein rigidity and conformational changes, and aids in characterizing protein-protein binding. We present a machine-learning based method for the analysis and prediction of critical residues in proteins. We combine amino-acid specific information and data obtained by two complementary methods. One method, KINARI-Mutagen, performs graph-based analysis to find rigid clusters of amino acids in a protein, and the other method uses evolutionary conservation scores to find functional interfaces in proteins. We devised a machine learning model that combines both methods, in addition to amino acid type and solvent accessible surface area, to a dataset of proteins with experimentally known critical residues, and were able to achieve over 77% prediction rate, more than either of the methods separately. expand
|
|
|
Multi-Resolution Rigidity-Based Sampling of Protein Conformational Paths |
| |
Dong Luo,
Nurit Haspel
|
|
Pages: 786 |
|
doi>10.1145/2506583.2506710 |
|
Full text: PDF
|
|
We present a geometry-based, sampling based method to explore conformational pathways in medium and large proteins which undergo large-scale conformational transitions. In a past work we developed a coarse-grained geometry-based method that was able ...
We present a geometry-based, sampling based method to explore conformational pathways in medium and large proteins which undergo large-scale conformational transitions. In a past work we developed a coarse-grained geometry-based method that was able to trace large-scale conformational motions in proteins using residues between secondary structure elements as hinges, and a simple yet effective energy function. In this work we apply a rigidity-analysis tool to determine the rigid and flexible regions in protein structures, since hinges may not always lie on loops between secondary structure elements. This method allows for better accuracy in determining the rotational degrees of freedom of the proteins. We conducted a multi-resolution search scheme, as both C-α and backbone representations are used for sampling the protein conformational paths. Characteristic conformations detected by clustering the paths are converted to full atom protein structures and minimized to detect interesting intermediate conformations that may correspond to transition states or other events. Our algorithm was able to run efficiently on a proteins of various sizes and the results agree with experimentally determined intermediate protein structures. expand
|
|
|
A Combined Molecular Dynamics, Rigidity Analysis Approach for Studying Protein Complexes |
| |
Brian Orndorff,
Filip Jagodzinski
|
|
Pages: 793 |
|
doi>10.1145/2506583.2506711 |
|
Full text: PDF
|
|
Proteins form complexes when they bind to other molecules, which is often accompanied by a conformation change in one or both interacting partners. Details of how a compound associates with a target protein can be used to better design medicines that ...
Proteins form complexes when they bind to other molecules, which is often accompanied by a conformation change in one or both interacting partners. Details of how a compound associates with a target protein can be used to better design medicines that therapeutically regulate disease-causing proteins. Experimental and computational techniques for studying the binding process are available, however many of them are time and money intensive, or are computationally expensive, and hence cannot be done on a large dataset. In this work, we present a hybrid, computationally efficient approach for studying the stability of protein complex. We use short Molecular Dynamics (MD) simulations to generate a small ensemble of protein-complex conformations, whose flexibility we then analyze using an efficient graph-theoretic method implemented in the KINARI software. For our dataset of proteins, we show that our combined MD-rigidity analysis approach provides information about the stability of the protein-complex that would not be attained by either of the two methods alone. expand
|
|
|
WORKSHOP SESSION: Workshop ECF |
|
|
|
|
Relating mammalian replication program to large-scale chromatin folding |
| |
B. Audit,
A. Baker,
R. E. Boulos,
H. Julienne,
A. Arneodo,
C. L. Chen,
Y. d'Aubenton-Carafa,
C. Thermes,
A. Goldar,
G. Guilbaud,
A. Rappailles,
O. Hyrien
|
|
Pages: 799 |
|
doi>10.1145/2506583.2506699 |
|
Full text: PDF
|
|
We review the existence of a new type of megabase-sized replication domains along the human genome. These domains are revealed in 7 somatic cell types by U-shaped patterns in the replication timing profiles. In the germline, these domains appear as N-shaped ...
We review the existence of a new type of megabase-sized replication domains along the human genome. These domains are revealed in 7 somatic cell types by U-shaped patterns in the replication timing profiles. In the germline, these domains appear as N-shaped patterns in the DNA compositional asymmetry profiles resulting from replication-associated mutational asymmetries. We demonstrated that the average replication fork polarity is directly proportional to both the DNA compositional asymmetry and the derivative of the replication timing profile. Hence, the average fork polarity changes in a linear manner across U/N-replication domains enlightening a robust mode of replication across cell types and during evolution. Using genome-wide chromatin conformation data, we found that the replication domains remarkably coincide with self-interacting folding units of the chromatin fiber and that their borders are long-range interconnected hubs in the chromatin interaction graph. Altogether our results suggest that the spatio-temporal replication program is intimately coupled to a high-order 3D organization of the human genome. expand
|
|
|
Chromatin structure fully determines replication timing program in human cells |
| |
Sven Bilke,
Yevgeniy Gindin,
Paul S. Meltzer
|
|
Pages: 811 |
|
doi>10.1145/2506583.2506700 |
|
Full text: PDF
|
|
DNA replication is a tightly regulated process that follows a strict, yet poorly understood, temporal program [12, 9]. This timing program is intricately linked to many aspects of cell biology [1], it is cell type specific [11, 6] and altered in cancer ...
DNA replication is a tightly regulated process that follows a strict, yet poorly understood, temporal program [12, 9]. This timing program is intricately linked to many aspects of cell biology [1], it is cell type specific [11, 6] and altered in cancer cells [4, 5]. Although on the genome scale DNA replication appears as a highly orchestrated process, at the level of individual initiation events it is found to be stochastic [3, 8]. The mechanisms controlling global DNA replication timing remain largely unknown. Recently, stochastic DNA replication models, where global timing emerges from the collective action of unregulated initiation events, have been proposed [7, 2] yet there is still no quantitative model that can explain complex timing patterns observed in metazoan genomes. Contributing to the dearth of such models is the incomplete characterization of replication initiation sites in these genomes. We show that this issue does not prevent building a successful DNA replication timing model because we find that (a) the replication timing program is so robust that knowledge of exact firing probabilities is unnecessary and (b) high efficiency replicators are sufficiently localized by a specific chromatin mark. We arrive at these conclusions based on simulations generated by a simple mechanistic model and comparisons to experimental timing data [6]. The input to our model is an "Initiation Probability Landscape" (IPLS), a mathematical construct representing the location of proposed initiation sites. We find that even a simplistic IPLS, based on positions of transcription start sites, produces a remarkably accurate replication timing prediction (r=0.75 prediction vs. experiment). In principle, any genomic dataset can be used to define an IPLS and we systematically tried all ENCODE datasets [10], performed simulations and ranked the resulting predictions according to their agreement with empirical data. A number of chromatin marks dominate the top of this ranking, but only a single mark remains fully predictive after reducing the mutual interdependence between the top contenders. expand
|
|
|
Unsupervised pattern discovery in human chromatin structure through genomic segmentation |
| |
Michael M. Hoffman,
Orion J. Buske,
Jie Wang,
Zhiping Weng,
Jeff A. Bilmes,
William Stafford Noble
|
|
Pages: 813 |
|
doi>10.1145/2506583.2506701 |
|
Full text: PDF
|
|
Sequence census methods like ChIP-seq now produce an unprecedented amount of genome-anchored data. We have developed an integrative method to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, ...
Sequence census methods like ChIP-seq now produce an unprecedented amount of genome-anchored data. We have developed an integrative method to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, discovering joint patterns across different assay types. We apply this method to ENCODE chromatin data for the human chronic myeloid leukemia cell line K562, including ChIP-seq data on covalent histone modifications and transcription factor binding, and DNase-seq and FAIRE-seq readouts of open chromatin. In an unsupervised fashion, we identify patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. The method yields a model which elucidates the relationship between assay observations and functional elements in the genome. This model identifies sequences likely to affect transcription, and we verify these predictions in laboratory experiments. We have made software and an integrative genome browser track freely available (noble.gs.washington.edu/proj/segway/). expand
|
|
|
WORKSHOP SESSION: Workshop ICIW |
|
|
|
|
iAtheroSim: atherosclerosis process simulator on smart devices |
| |
Francesco Pappalardo,
Marzio Pennisi,
Ferdinando Chiacchio,
Salvatore Musumeci,
Santo Motta
|
|
Pages: 815 |
|
doi>10.1145/2506583.2512356 |
|
Full text: PDF
|
|
Atherosclerosis is the major cause of heart attacks, some strokes, aneurysms, and peripheral artery disease. Arteries are a well organized system: it provides the organs and tissues of the body with the blood needed to extract sufficient amounts of nutrients ...
Atherosclerosis is the major cause of heart attacks, some strokes, aneurysms, and peripheral artery disease. Arteries are a well organized system: it provides the organs and tissues of the body with the blood needed to extract sufficient amounts of nutrients and oxygen. However, today, we are exposed to high levels of potentially toxic lipids and other factors. This may lead to the development of inflammation, scarring and disruption of the tunica intima, of which the sub-endothelial layer is initially affected in the process of atherogenesis. For these reasons, any increased awareness of potential atherosclerosis risk is a great improvement over current healthcare. We present iAtheroSim, an iOS application easy to use and freely available on the Apple Store. It allows the general public an easy way to access to the SimAthero simulator, which predicts atherogenesis risks taking into account some personal bio-physiological indicators. This, at least, may convince people to schedule a medical doctor appointment in order to better evaluate the atherogenesis risk and to take counter- measures before that the disease process could cause severe complications. This is one of the first example in which it is showed that modern biological approaches based on computational modeling could reach the general public by means of extremely diffused architectures like smart- phones. iAtheroSim application is freely available on Apple Store. expand
|
|
|
Characterizing Amino Acid Variations of Scavenger Receptors by Class Information Gain |
| |
En-Shiun Annie Lee,
Fiona J. Whelan,
Dawn M. E. Bowdish,
Andrew K. C. Wong
|
|
Pages: 818 |
|
doi>10.1145/2506583.2512357 |
|
Full text: PDF
|
|
Conserved amino acids in sequences, which may be discovered as patterns across or along sequences, reveal functional domains within proteins. Conversely, less conserved amino acid sequences reveal areas of evolutionary divergence. Traditional protein ...
Conserved amino acids in sequences, which may be discovered as patterns across or along sequences, reveal functional domains within proteins. Conversely, less conserved amino acid sequences reveal areas of evolutionary divergence. Traditional protein classification trains patterns using pre-defined class labels (i.e. information about the input sequences such as gene name or family) in order to predict the class of novel sequences. However, these supervised algorithms may be inherently biased by such class dependent techniques. Therefore, we have created an unsupervised algorithm that is not affected by the inherent errors or class balance biases in the class labels. Our algorithm first discovers statistically significant sequence patterns, then aligns and clusters them into Aligned Pattern Clusters (APCs), which represent conserved amino acid sequences. APCs reveal sequence patterns (horizontal regions of amino acid homology), regions of conservation (vertical regions of amino acid homology), and regions of divergence (areas of vertical amino acid variation) within families of proteins. Finally, the algorithm verifies the results using two measures -- class entropy and class information gain -- both of which incorporate the class labels. The advantage of our method is that it does not require any a priori knowledge of a protein's structure or function. We applied our unsupervised algorithm to the class A Scavenger Receptor (cA-SR) protein family consisting of two distinct but related proteins, MARCO and SRAI. Using MARCO and SRAI as the class labels, we applied our class measures, class entropy and information gain. We found that class entropy revealed conservation of patterns and amino acids between sequences from all classes. The class information gain indicated which of these amino acids were found distinct to the MARCO class or the SRAI class, which allowed us to make important predictions as to the differing biological functions of these proteins. expand
|
|
|
Biomarkers in Immunology: from Concepts to Applications |
| |
Ping Zhang,
Lou Chitkushev,
Vladimir Brusic,
Guang Lan Zhang
|
|
Pages: 826 |
|
doi>10.1145/2506583.2512358 |
|
Full text: PDF
|
|
In this paper, we summarized the challenges and promises of the study of immune biomarkers. We reviewed key concepts in biomarker discovery and discussed the framework for applying these concepts in the study of the immune system and its effects on the ...
In this paper, we summarized the challenges and promises of the study of immune biomarkers. We reviewed key concepts in biomarker discovery and discussed the framework for applying these concepts in the study of the immune system and its effects on the disease -- cancer, infection, allergy, immunodeficiencies, and autoimmunity. The immune system plays a special role in biomarker discovery since it interacts with all other systems in the human body and immune biomarkers are relevant for large number of diseases. expand
|
|
|
Landscape of neutralizing assessment of monoclonal antibodies against dengue virus |
| |
Jing Sun,
Guang Lan Zhang,
Lars Rønn Olsen,
Ellis L. Reinherz,
Vladimir Brusic
|
|
Pages: 836 |
|
doi>10.1145/2506583.2512359 |
|
Full text: PDF
|
|
The majority of antibody binding sites (B-cell epitopes) on antigens are discontinuous. The binding between antigen and antibody is specific, but in some cases, the antibody elicited by one antigen will show cross-reactivity against other antigens. We ...
The majority of antibody binding sites (B-cell epitopes) on antigens are discontinuous. The binding between antigen and antibody is specific, but in some cases, the antibody elicited by one antigen will show cross-reactivity against other antigens. We have developed a bioinformatics-based approach for the analysis of sequence variability of neutralizing antibody binding sites and the assessment of coverage by individual neutralizing antibodies. The antigenic analysis of functional sites on the envelope (E) protein from dengue virus has been used as a case study. The description of B-cell epitopes, measurement of epitope similarity among different strains, and estimation of antibody neutralizing coverage provide insights in antibody cross-reactivity. We have defined a generalized method for the analysis of cross-reactivity of neutralizing antibodies that is also applicable to the analysis of other pathogens. This method adds to the toolset available for the characterization and the design of broadly neutralizing vaccines. expand
|
|
|
HPVdb: a Data Mining System for Knowledge Discovery in Human Papillomavirus with Applications in T cell Immunology and Vaccinology |
| |
Guang Lan Zhang,
Angelika Riemer,
Derin B. Keskin,
Lou Chitkushev,
Ellis L. Reinherz,
Vladimir Brusic
|
|
Pages: 843 |
|
doi>10.1145/2506583.2512360 |
|
Full text: PDF
|
|
High-risk human papilloma viruses (HPV) are the causes of many cancers, including cervical, anal, vulvar, vaginal, penile and oropharyngeal. To facilitate diagnosis, prognosis, and characterization of these cancers, we constructed the Human Papillomavirus ...
High-risk human papilloma viruses (HPV) are the causes of many cancers, including cervical, anal, vulvar, vaginal, penile and oropharyngeal. To facilitate diagnosis, prognosis, and characterization of these cancers, we constructed the Human Papillomavirus T cell Antigen Database (HPVdb). It contains 2865 curated antigen entries of antigenic proteins derived from 18 genotypes of high-risk HPV and 18 genotypes of low-risk HPV. HPVdb also catalogs 96 verified T cell epitopes and 45 verified HLA ligands. Primary amino acid sequences of HPV antigens were collected and annotated from UniProtKB. T cell epitopes and HLA ligands were collected from data mining of scientific literature. The data were subject to extensive quality control (redundancy elimination, error detection, and vocabulary consolidation). A set of computational tools for in-depth analysis, such as sequence comparison using BLAST search, multiple alignments of antigens, classification of HPV types based on cancer risk, and T cell epitope/HLA ligand visualization, have been integrated in HPVdb. Predicted Class I and Class II HLA binding peptides for 15 common HLA alleles are included in this database as putative targets. HPVdb is a specialized database that integrates curated data and information with tailored analysis tools to facilitate data mining to aid rational vaccine design by discovery of vaccine targets. To our best knowledge, HPVdb is a unique data source providing a comprehensive list of antigen peptides in HPV. It is available at http://cvc.dfci.harvard.edu/hpv/ and http://met-hilab.bu.edu/hpvdb/. expand
|
|
|
DNA Vaccine Design for Chikungunya Virus Based On the Conserved Epitopes Derived from Structural Protein |
| |
Parvez Singh Slathia
|
|
Pages: 849 |
|
doi>10.1145/2506583.2516950 |
|
Full text: PDF
|
|
Chikungunya virus has emerged as an epidemic with widespread distribution. With no vaccine available for chikungunya efforts are required for vaccine development. Structural polyprotein of this virus was analysed for the presence of conserved T and B ...
Chikungunya virus has emerged as an epidemic with widespread distribution. With no vaccine available for chikungunya efforts are required for vaccine development. Structural polyprotein of this virus was analysed for the presence of conserved T and B cell epitopes. Using bioinformatics tools the epitope prediction was done and the predicted epitopes were used for designing DNA vaccine. The predicted epitopes were subjected to BLAST analysis, reverse translation, CpG optimization for increasing the efficacy of vaccine. expand
|
|
|
Defining Functional Redundancy of Epitope Data as Potential Antigenic Cross-Reactivity |
| |
Salvador Eugenio C. Caoili
|
|
Pages: 851 |
|
doi>10.1145/2506583.2517168 |
|
Full text: PDF
|
|
In assembling epitope datasets (e.g., to develop and benchmark epitope-prediction tools), sequence redundancy among epitopes must be considered to avoid misleading results that reflect overrepresentation of functionally similar epitopes. However, potentially ...
In assembling epitope datasets (e.g., to develop and benchmark epitope-prediction tools), sequence redundancy among epitopes must be considered to avoid misleading results that reflect overrepresentation of functionally similar epitopes. However, potentially useful data may be needlessly discarded by excluding epitopes that share an apparently high degree of sequence similarity, as even only a single-residue difference may manifest as extreme functional dissimilarity between epitopes. The present work thus introduces a reduced epitope count in order to account for functional redundancy of epitope data (FRED) within an epitope dataset (such that FRED is quantified as the difference between the total and reduced epitope counts); this can be used, for example, to characterize the dataset instead of excluding epitopes with apparently similar sequences from it, thereby maximizing the utilization of available epitope data. For a set of epitopes (construed as epitope structures, e.g., linear peptidic sequences), functional redundancy can be expressed as a reduced epitope count r such that 1 ≤ r ≤ n where n is the total epitope count, with r = 1 and r = n respectively corresponding to extreme cases wherein every epitope is either maximally or minimally similar to every other epitope from a functional standpoint (noting that although structurally identical epitopes would be maximally similar to one another, structurally nonidentical epitopes might also be regarded as maximally similar if their structural difference were negligible or otherwise irrelevant from a functional standpoint). For simplicity, the approach presented herein focuses on cases wherein all epitopes are structurally unique (which is typical of real-world epitope databases insofar as they regard each epitope as a structurally unique entity) and are peptidic sequences of equal length (which is clearly applicable to typical class-I MHC-restricted T-cell epitopes and relatively short linear B-cell epitopes, although generalization to longer B-cell epitopes and class-II MHC-restricted T-cell epitopes would entail additional considerations). expand
|
|
|
WORKSHOP SESSION: Workshop IWBNA |
|
|
|
|
Identifying Pathway Proteins in Networks using Convergence |
| |
Kathryn Dempsey,
Hesham Ali
|
|
Pages: 853 |
|
doi>10.1145/2506583.2506695 |
|
Full text: PDF
|
|
One of the key goals of systems biology concerns the analysis of experimental biological data available to the scientific public. New technologies are rapidly developed to observe and report whole-scale biological phenomena; however, few methods exist ...
One of the key goals of systems biology concerns the analysis of experimental biological data available to the scientific public. New technologies are rapidly developed to observe and report whole-scale biological phenomena; however, few methods exist with the ability to produce specific, testable hypotheses from this noisy 'big' data. In this work, we propose an approach that combines the power of data-driven network theory along with knowledge-based ontology to tackle this problem. Network models are especially powerful due to their ability to display elements of interest and their relationships as internetwork structures. Additionally, ontological data actually supplements the confidence of relationships within the model without clouding critical structure identification. As such, we postulate that given a (gene/protein) marker set of interest, we can systematically identify the core of their interactions (if they are indeed working together toward a biological function), via elimination of original markers and addition of additional necessary markers. This concept, which we refer to as "convergence," harnesses the idea of "guilt-by-association" and recursion to identify whether a core of relationships exists between markers. In this study, we test graph theoretic concepts such as shortest-path, k-Nearest-Neighbor and clustering) to identify cores iteratively in data- and knowledge-based networks in the canonical yeast Pheromone Mating Response pathway. Additionally, we provide results for convergence application in virus infection, hearing loss, and Parkinson's disease. Our results indicate that if a marker set has common discrete function, this approach is able to identify that function, its interacting markers, and any new elements necessary to complete the structural core of that function. The result below find that the shortest path function is the best approach of those used, finding small target sets that contain a majority or all of the markers in the gold standard pathway. The power of this approach lies in its ability to be used in investigative studies to inform decisions concerning target selection. expand
|
|
|
A Neural-network Algorithm for All k Shortest Paths Problem |
| |
Kun Zhao,
Abdoul Sylla
|
|
Pages: 861 |
|
doi>10.1145/2506583.2506696 |
|
Full text: PDF
|
|
One of the fundamental computations for network analysis is to calculate the shortest path (SP) and all k shortest paths (KSP) between two nodes. Finding SP and KSP in a large graph is not trivial since the computation time increases as the number of ...
One of the fundamental computations for network analysis is to calculate the shortest path (SP) and all k shortest paths (KSP) between two nodes. Finding SP and KSP in a large graph is not trivial since the computation time increases as the number of nodes and edges increases. A recent neural network algorithm for calculating SP showed its advantage of not depending on the number of nodes and edges but the topology of a graph. Reasonable performance of the algorithm was reported. However, this algorithm is limited to the SP problem. This paper reports the progress of extending a neural network algorithm to solve the KSP problem. How to apply the new KSP algorithm to a whole-genome sequencing problem is also discussed. expand
|
|
|
Revealing Protein Structures by Co-Occurrence Clustering of Aligned Pattern Clusters |
| |
Sanderz Fung,
En-Shiun Annie Lee,
Andrew K.C. Wong
|
|
Pages: 869 |
|
doi>10.1145/2506583.2506697 |
|
Full text: PDF
|
|
Proteins can be represented in several ways, including primary protein sequence, where the protein is represented as a string of amino acids, and three-dimensional structure, where the sequence is folded into a structure. By analyzing proteins from the ...
Proteins can be represented in several ways, including primary protein sequence, where the protein is represented as a string of amino acids, and three-dimensional structure, where the sequence is folded into a structure. By analyzing proteins from the same protein family, we can find conserved protein regions that are common within that protein family, gaining biological knowledge. Compared to the amount of protein three-dimensional structures available, there is an abundance of protein sequences, hence, making analysis of protein sequence to find characteristics in its three-dimensional structures crucial. Through sequence pattern discovery and alignment, statistically significant sequence patterns in protein families are found and represented as Aligned Pattern Clusters (APCs). When two or more APCs occur frequently together on the same protein, this implies that they together have important relationship in the protein. A co-occurrence score is used to quantify such relationship between the APCs, which are further used to cluster APCs into APC clusters. The purpose of this paper is to examine the validate of the proposed method by applying our method to two protein families, triosephosphate isomerase and G-alpha. The results are then verified using three-dimensional structures to check both to examine whether the results comply with the structure and how often with different known structures. The results for both protein families comply in majority with the known structures, and their APCs were close in three-dimensional distance. We found three characteristics that are common in the resulting APC clusters from both sets of protein data: the APC cluster forming a complete graph, the APC cluster having a high co-occurrence score, and the APC cluster containing APCs with more than one patterns. Furthermore, our method and results are currently being verified by important proteins crystallized from an immunology lab. expand
|
|
|
Comparative analysis of network algorithms to address modularity with gene expression temporal data |
| |
Suhaib Mohammed
|
|
Pages: 876 |
|
doi>10.1145/2506583.2506698 |
|
Full text: PDF
|
|
In recent years, hierarchical networks have received comparatively less attention to explore microarray gene expression data although hierarchical modularity of biological networks has been demonstrated. We compare three networking algorithms for the ...
In recent years, hierarchical networks have received comparatively less attention to explore microarray gene expression data although hierarchical modularity of biological networks has been demonstrated. We compare three networking algorithms for the study of complex biological network modularity: RedeR, weighted correlation network analysis (WGCNA) and statistical inference of modular networks (SIMoNe). Our main contributions in this work include a filtering process, which filters out non-differentially expressed genes and a novel score for performance measurement. We show in this paper how the performance of algorithms can be improved using this filtering process. expand
|
|
|
WORKSHOP SESSION: Workshop ParBio |
|
|
|
|
MEGADOCK-GPU: Acceleration of Protein-Protein Docking Calculation on GPUs |
| |
Takehiro Shimoda,
Takashi Ishida,
Shuji Suzuki,
Masahito Ohue,
Yutaka Akiyama
|
|
Pages: 883 |
|
doi>10.1145/2506583.2506693 |
|
Full text: PDF
|
|
Protein-protein docking is a method for predicting the protein complex structure from monomeric protein structures. Because protein structural information has been increased and the application field has been expanded to more difficult ones such as interactome ...
Protein-protein docking is a method for predicting the protein complex structure from monomeric protein structures. Because protein structural information has been increased and the application field has been expanded to more difficult ones such as interactome prediction, a faster protein-protein docking method has been eagerly demanded. MEGADOCK is fast protein-protein docking software but more acceleration is demanded for an interactome prediction, which is composed of millions of protein pairs. In this paper, we developed an ultra-fast protein-protein docking software named MEGADOCK-GPU by using general purpose GPU computing techniques. We implemented a system that utilizes all CPU cores and GPUs in a computation node. As results, MEGADOCK-GPU on 12 CPU cores and 3 GPUs achieved a calculation speed that was 37.0 times faster than MEGADOCK on 1 CPU core. The novel docking software will facilitate the application of docking techniques to assist large-scale protein interaction network analyses. MEGADOCK-GPU is freely available at http://www.bi.cs.titech.ac.jp/megadock/gpu/. expand
|
|
|
pXAlign: A parallel implementation of XAlign |
| |
Aditi A. Magikar,
John A. Springer
|
|
Pages: 890 |
|
doi>10.1145/2506583.2506694 |
|
Full text: PDF
|
|
Proteomics involves the assessment of a large number of protein molecules, and mass spectrometry is a proteomic tool that is used for assessment of these protein molecules. As an example, the Proteome Discovery Pipeline at Purdue University Bindley Bioscience ...
Proteomics involves the assessment of a large number of protein molecules, and mass spectrometry is a proteomic tool that is used for assessment of these protein molecules. As an example, the Proteome Discovery Pipeline at Purdue University Bindley Bioscience Center carries out data processing and discovery of proteins using mass spectrometry-based proteomics. Each stage of the Proteome Discovery Pipeline does a different computation task, and currently, each stage of the pipeline is executed in a serial manner. The XAlign stage of the pipeline enables data processing and alignment of the protein peaks across different samples. The XAlign stage deals with vast amounts of data, and this can be a potential data processing bottleneck in the pipeline. Moreover, the serial nature of the XAlign code can cause additional bottlenecks. Using commonly used parallelization techniques, MPI and OpenMP, our work introduces parallelism into the XAlign code to investigate possible performance improvements, and as a result, we found a notable speedup of the XAlign software. expand
|
|
|
SESSION: Health Informatics Symposium |
|
|
|
|
GaitTrack: Health Monitoring of Body Motion from Spatio-Temporal Parameters of Simple Smart Phones |
| |
Qian Cheng,
Joshua Juen,
Yanen Li,
Valentin Prieto-Centurion,
Jerry A. Krishnan,
Bruce R. Schatz
|
|
Pages: 897 |
|
doi>10.1145/2506583.2512362 |
|
Full text: PDF
|
|
Detecting abnormal health is an important issue for mobile health, especially for chronic diseases. We present a free-living health monitoring system based on simple standalone smart phones, which can accurately compute walking speed. This phone app ...
Detecting abnormal health is an important issue for mobile health, especially for chronic diseases. We present a free-living health monitoring system based on simple standalone smart phones, which can accurately compute walking speed. This phone app can be used to validate status of the major chronic condition, Chronic Obstructive Pulmonary Disease (COPD), by estimating gait speed of actual patients. We first show that smart phone sensors are as accurate for monitoring gait as expensive medical accelerometers. We then propose a new method of computing human body motion to estimate gait speed from the spatio-temporal gait parameters generated by regular phone sensors. The raw sensor data is processed in both time and frequency domain and pruned by a smoothing algorithm to eliminate noise. After that, eight gait parameters are selected as the input vector of a support vector regression model to estimate gait speed. For trained subjects, the overall root mean square error of absolute gait speed is <0.088 m/s, and the error rate is <6.11%. We design GaitTrack, a free living health monitor which runs on Android smart phones and integrates known activity recognition and position adjustment technology. The GaitTrack system enables the phone to be carried normally for health monitoring by transforming carried spatio-temporal motion into stable human body motion with energy saving sensor control for continuous tracking. We present validation by monitoring COPD patients during timed walk tests and healthy subjects during free-living walking. We show that COPD patients can be detected by spatio-temporal motion and abnormal health status of healthy subjects can be detected by personalized trained models with accuracy >84%. expand
|
|
|
Aggregating Personal Health Messages for Scalable Comparative Effectiveness Research |
| |
Jason H.D. Cho,
Vera Q.Z. Liao,
Yunliang Jiang,
Bruce R. Schatz
|
|
Pages: 907 |
|
doi>10.1145/2506583.2512363 |
|
Full text: PDF
|
|
Comparative Effectiveness Research (CER) is defined as the generation and synthesis of evidence that compares the benefits and harms of different prevention and treatment methods. This is becoming an important field in informing health care providers ...
Comparative Effectiveness Research (CER) is defined as the generation and synthesis of evidence that compares the benefits and harms of different prevention and treatment methods. This is becoming an important field in informing health care providers about the best treatment for individual patients. Currently, the two major approaches in conducting CER are observational studies and randomized clinical trials. These approaches, however, often suffer from either scalability or cost issues. In this paper, we propose a third approach of conducting CER by utilizing online personal health messages, e.g., comments on online medical forums. The approach is effective in resolving the scalability and cost issues, enabling rapid deployment of system to identify treatments of interests, and developing hypotheses for formal CER studies. Moreover, by utilizing the demographic information of the patients, this approach may provide valuable results on the preferences of different demographic groups. Demographic information is extracted using our high precision automated demographic extraction algorithm. This approach is capable of extracting more than 30% of users' age and gender information. We conducted CER by utilizing personal health messages on breast cancer and heart disease. We were able to generate statiatically valid results, many of which have already been validated by clinical trials. Others could become hypothesis to be tested in future CER research. expand
|
|
|
Locating Discharge Medications in Natural Language Summaries |
| |
Simon Diemert,
Morgan Price,
Jens H. Weber
|
|
Pages: 917 |
|
doi>10.1145/2506583.2512364 |
|
Full text: PDF
|
|
The extraction of information from clinical narrative is one of the major applications of natural language processing in health care. Among the different kinds of use cases for such methods, the extraction of medications from textual summaries has been ...
The extraction of information from clinical narrative is one of the major applications of natural language processing in health care. Among the different kinds of use cases for such methods, the extraction of medications from textual summaries has been studied extensively. In this paper, we focus on a slightly modified problem of specifically locating and extracting discharge medications, as opposed to other medications mentioned in discharge summaries. We present a new algorithm for addressing the problem of locating discharge medications, along with experimental results and an overall description of how the NLP method can be integrated in an EMR-based clinical workflow. expand
|
|
|
Evidence of a Pathway of Reduction in Bacteria: Reduced Quantities of Restriction Sites Impact tRNA Activity in a Trial Set |
| |
Oliver Bonham-Carter,
Lotfollah Najjar,
Dhundy Bastola
|
|
Pages: 926 |
|
doi>10.1145/2506583.2512365 |
|
Full text: PDF
|
|
Occurring naturally along the genomes of many viruses and other pathogens, short palindromic restriction sites (<14bps) are often exploited by bacterial restriction enzymes as autoimmune defenses to end pathogen threats. These motifs may also appear ...
Occurring naturally along the genomes of many viruses and other pathogens, short palindromic restriction sites (<14bps) are often exploited by bacterial restriction enzymes as autoimmune defenses to end pathogen threats. These motifs may also appear in the host's genome where they are methylated so as not to attract restriction enzymes to the host's genetic material. Since these motifs in the host's genome may pose a significant danger, it is likely that their numbers have been reduced due to possible failures of methylation during evolutionary time. These palindromes are composed of bases likely containing information relating to codons used for protein translation. If palindromes are reduced in the genome, then its sequence composition making up the codons may also be found in reduced quantities. Furthermore, during translation codons are associated with tRNAs for protein fabrication which may also occur in reduced numbers. We suggest that a pathway of reduction that can be followed from the onset of these missing palindromes to the reduction (or absence) of specific tRNAs correlated to the codons from the palindromes. To create evidence for this pathway, we studied the bacterial genomes of Bacillus subtilis, Escherichia coli, Haemophilus influenzae, Methanococcus jannaschii, Mycoplasma genitalium, Synechocystis sp. and Marchantia polymorpha. Across these organisms, we applied statistical data from reduced palindromic populations (biological and non-relevant words) to regression models and performed an analysis of genomic tRNA presence from their compositions. We illustrate a pathway of reduction that extends from palindromes to tRNAs which may follow from evolutionary pressures concerning restriction site handling. expand
|
|
|
Mobility Patterns of Doctors Using Electronic Health Records on iPads |
| |
Allan C. Lin,
Meng-Hsiu Chang,
Mike Y. Chen,
Travis Yu,
Lih-ching Chou,
Dian-Je Tsai,
Jackey Wang
|
|
Pages: 933 |
|
doi>10.1145/2506583.2512366 |
|
Full text: PDF
|
|
Before Electronic Health Records (EHRs) were available on touch-panel tablets, doctors were confined to accessing the records on their hospital's computer stations, in their offices or at nurse stations. We deployed Dr. Pad, a mobile EHR application ...
Before Electronic Health Records (EHRs) were available on touch-panel tablets, doctors were confined to accessing the records on their hospital's computer stations, in their offices or at nurse stations. We deployed Dr. Pad, a mobile EHR application on the iPad, to resident doctors at the Taipei Veterans General Hospital in Taipei, Taiwan. We are able to extract direct usage and motion data from a large-scale in-the-wild use of a mobile EHR by 179 resident doctors over 4 weeks. Using machine-learning techniques, we can predict the doctors' mobile behaviors while using Dr. Pad, which were previously unobserved and mainly self-reported. Our data revealed trends in the doctors' use of the mobile EHR, which supported claims by doctors on their usage habits, our observations of their work routines, and even showed that the doctors used Dr. Pad more frequently than we had expected. expand
|
|
|
Modeling Incidental Findings in Radiology Records |
| |
Eamon Johnson,
W. Christopher Baughman,
Gultekin Ozsoyoglu
|
|
Pages: 940 |
|
doi>10.1145/2506583.2512367 |
|
Full text: PDF
|
|
Information loss can occur between radiologists and patients with regard to incidental findings (unexpected or uncertain results) in the interpretation of an image. When a healthcare provider fails to inform a patient of a potential medical issue, quality ...
Information loss can occur between radiologists and patients with regard to incidental findings (unexpected or uncertain results) in the interpretation of an image. When a healthcare provider fails to inform a patient of a potential medical issue, quality of care is decreased and medical-legal issues arise. We discuss issues in modeling incidental findings in clinical records, examine available machine learning inputs, and propose a clinical text analysis system using weighted syntactic matching and user feedback learning. To demonstrate that our proposal would support better quality of care at lower cost than prior process-based solutions, we evaluate a prototype system on a gold-standard set of 580 records, yielding 82% sensitivity and 92% specificity, as compared with 43% sensitivity and 100% specificity for an existing manual review process. expand
|
|
|
Enforcing Minimum Necessary Access in Healthcare Through Integrated Audit and Access Control |
| |
Paul Martin,
Aviel D. Rubin,
Rafae Bhatti
|
|
Pages: 946 |
|
doi>10.1145/2506583.2512368 |
|
Full text: PDF
|
|
One of the most important requirements of HIPAA is the "minimum-necessary" access requirement, which states that healthcare personnel must be granted no more access to electronic healthcare data than is necessary in order to work effectively. Due to ...
One of the most important requirements of HIPAA is the "minimum-necessary" access requirement, which states that healthcare personnel must be granted no more access to electronic healthcare data than is necessary in order to work effectively. Due to the complexity of constructing such a policy, many hospitals do not comply with the regulation and instead manually audit the logs when they suspect that abuse has occurred. This audit-only approach is error-prone and difficult due to the volume of data contained in the logs. To address this problem, we have built a policy engine capable of automatically auditing logs and separating normal accesses from abnormal accesses. Our policy engine implicitly constructs role-based policies from the audit data in order to produce a workable policy that can be used to enforce minimum-necessary access. The policy engine can also audit an existing role-based access policy by comparing it to observed accesses in order to determine whether the existing policy is overpermissive compared to actual usage patterns. expand
|