Publications

BioRxiv

Reproducible processing of TCGA regulatory networks

Viola Fanfani; Katherine H. Shutta; Panagiotis Mandros; Jonas Fischer; Enakshi Saha; Soel Micheletti; Chen Chen; Marouen Ben Guebila; Camila M. Lopes-Ramos; John Quackenbush

BiorXiv

Abstract Paper NetworkDataCompanion Nextflow workflow

Background: Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline. Findings: We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed. Conclusions: tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.

OUP

Higher-order correction of persistent batch effects in correlation networks

Soel Micheletti; Daniel Schlauch; John Quackenbush; Marouen Ben Guebila

Bioinformatics

Abstract Paper BioRxiv Code Website

Systems biology methods often rely on correlations in gene expression profiles to infer co-expression networks, commonly used as input for gene regulatory network inference or to identify functional modules of co-expressed or co-regulated genes. While systematic biases, including batch effects, are known to induce spurious associations and confound differential gene expression analyses (DE), the impact of batch effects on gene co-expression has not been fully explored. Methods have been developed to adjust expression values, ensuring conditional independence of mean and variance from batch or other covariates for each gene. These adjustments have been shown to improve the fidelity of DE analysis. However, these methods do not address the potential for spurious differential co-expression (DC) between groups. Consequently, uncorrected, artifactual DC can skew the correlation structure, leading network inference methods that use gene co-expression to identify false, nonbiological associations, even when the input data is corrected using standard batch correction. In this work, we demonstrate the persistence of confounders in covariance after standard batch correction using synthetic and real-world gene expression data examples. Subsequently, we introduce Co-expression Batch Reduction Adjustment (COBRA), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix. COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates. COBRA is computationally efficient, leveraging the inherently modular structure of genomic data to estimate accurate gene regulatory associations and facilitate functional analysis for high-dimensional genomic data.

RECOMB

Gene-level Inference of Regulatory effects As Factorizations of Functions of Expressions (GIRAFFE)

Soel Micheletti; Alexander Marx; Julia Vogt; John Quackenbush; Jonas Fischer; Panagiotis Mandros

Research in Computational Molecular Biology

Abstract Poster Code

Accurately estimating gene regulatory mechanisms is crucial to inform our understanding of important cellular processes, yielding new insights into disease progression and intervention design. Exiting methods suffer of one or multiple issues when it comes to interpretability, scalability to the human genome, and flexibility to incorporate covariates of interest such as clinical information about the samples. For instance, state-of-the-art algorithms either do not distinguish between enhancing and inhibitory regulation, or do not scale beyond a few hundreds genes. Moreover, not all existing methods are interpretable. They may incorporate complex non-linear relationships to optimize predictions, yet these are incompatible with human reasoning - a critical component for ensuring the safety, ethics, and accountability of models supporting oncology decisions. For effective gene regulatory network inference, we propose GIRAFFE, a scalable matrix factorization-based algorithm to jointly infer regulatory effects and transcription factor activities from gene expression data. GIRAFFE integrates prior knowledge about regulation to guide the optimization, yielding an interpretable linear model. Moreover, it can be customized to the requirements of the downstream application by adjusting for variables of interest, such as confounders, or by adding sparsity constraints. We demonstrate the effectiveness of this approach with extensive experiments on synthetic, as well as real world data. Our algorithm outperforms state-of-the-art gene regulatory network inference methods in predicting interactions between transcription factors and target genes. In contrast to existing work, it is able to distinguish between activating and inhibitory effects, yielding plausible results in downstream applications including gene set enrichment analysis.

RECOMB

Higher-order correction of persistent batch effects in correlation networks (poster)

Soel Micheletti; Daniel Schlauch; John Quackenbush; Marouen Ben Guebila

Research in Computational Molecular Biology

Abstract Poster

Systems biology methods often rely on correlations in gene expression profiles to infer co-expression networks, commonly used as input for gene regulatory network inference or to identify functional modules of co-expressed or co-regulated genes. While systematic biases, including batch effects, are known to induce spurious associations and confound differential gene expression analyses (DE), the impact of batch effects on gene co-expression has not been fully explored. Methods have been developed to adjust expression values, ensuring conditional independence of mean and variance from batch or other covariates for each gene. These adjustments have been shown to improve the fidelity of DE analysis. However, these methods do not address the potential for spurious differential co-expression (DC) between groups. Consequently, uncorrected, artifactual DC can skew the correlation structure, leading network inference methods that use gene co-expression to identify false, nonbiological associations, even when the input data is corrected using standard batch correction. In this work, we demonstrate the persistence of confounders in covariance after standard batch correction using synthetic and real-world gene expression data examples. Subsequently, we introduce Co-expression Batch Reduction Adjustment (COBRA), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix. COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates. COBRA is computationally efficient, leveraging the inherently modular structure of genomic data to estimate accurate gene regulatory associations and facilitate functional analysis for high-dimensional genomic data.

JAHA

Profiling Daily Life Performance Recovery in the Early Subacute Phase After Stroke Using a Graphical Modeling Approach

Janne M. Veerbeek; Clemens Hutter; Soel Micheletti; Simone Riedi; Enrico Bianchi; Beatrice Ottiger; Noortje Maaijwee; Tim Vanbellingen; Thomas Nyffeler

Journal of the American Heart Association

Abstract Paper Code

Laboratory-based assessments have shown that stroke recovery is heterogeneous between patients and af- fected domains such as motor and language function. However, laboratory-based assessments are not ecologically valid and do not necessarily reflect patients’ daily life performance. Therefore, we aimed to give an innovative view on stroke recovery by profiling daily life performance recovery across domains in patients with early subacute stroke and determine their inter- relatedness, taking stroke localization into account. Daily life performance was observed at neurorehabilitation admission and weekly thereafter until dis- charge, using a scale containing 7 daily life domains. Graphical modeling was applied to investigate the conditional independ- ence between recovery of these domains depending on stroke localization. There were 592 patients analyzed. Four clusters of interrelated domains were identified within the first 6weeks poststroke. The first cluster included recovery in learning and applying knowledge, general tasks and demands, and domestic life. The second cluster comprised recovery in self-care and general tasks and demands. The third cluster included recovery in mobility and self-care; it incorporated interpersonal interac- tions and relationships in left supratentorial stroke, and learning and applying knowledge in right supratentorial stroke. The final cluster included only communication recovery. Daily life recovery dynamics early poststroke show that although impairments in body functions are anatomically determined, their impact on performance is comparable. Second, some, but by no means all, domains show an interrelated recovery. Domains requiring cognitive abilities are especially interrelated and seem to be essential for concomitant recovery in mobility and domestic life.

Outreach

Calcolo di superfici con il metodo Monte Carlo

Soel Micheletti

During high school I had the opportunity of working on a stochastic simulation project. As a recognition to the results of the project (Odd Fellow Award at the National Competition by Schweizer Jugend Forscht and a First Award in the CS category at the Taiwan International Science Fair 2018), a summary of our work has been featured on a special edition of Il Volterriano, the journal of the Swiss-Italian Mathematics Commission.

Paper

Publications

Reproducible processing of TCGA regulatory networks

Higher-order correction of persistent batch effects in correlation networks

Gene-level Inference of Regulatory effects As Factorizations of Functions of Expressions (GIRAFFE)

Higher-order correction of persistent batch effects in correlation networks (poster)

Profiling Daily Life Performance Recovery in the Early Subacute Phase After Stroke Using a Graphical Modeling Approach

Outreach

Calcolo di superfici con il metodo Monte Carlo