Preprints

Pancreatic adenocarcinoma (PDAC) is a rapidly progressing cancer that responds poorly to immunotherapies. Intratumoral tertiary lymphoid structures (TLS) have been associated with rare long-term PDAC survivors, but the role of TLS in PDAC and their spatial relationships within the context of the broader tumor microenvironment remain unknown. We generated a spatial multi-omics atlas encompassing 26 PDAC tumors from patients treated with combination immunotherapies. Using machine learning-enabled H&E image classification models and unsupervised gene expression matrix factorization methods for spatial transcriptomics, we characterized cellular states within TLS niches spanning across distinct morphologies and immunotherapies. Unsupervised learning generated a TLS-specific spatial gene expression signature that significantly associates with improved survival in PDAC patients. These analyses demonstrate TLS-associated intratumoral B cell maturation in pathological responders, confirmed with spatial proteomics and BCR profiling. Our study also identifies spatial features of pathologic immune responses, revealing TLS maturation colocalizing with IgG/IgA distribution and extracellular matrix remodeling. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=188 SRC="FIGDIR/small/613714v1_ufig1.gif" ALT="Figure 1"> View larger version (69K): org.highwire.dtl.DTLVardef@d5e1a5org.highwire.dtl.DTLVardef@1d14012org.highwire.dtl.DTLVardef@c57c6corg.highwire.dtl.DTLVardef@16baa56_HPS_FORMAT_FIGEXP M_FIG C_FIG HIGHLIGHTSO_LIIntegrated multi-modal spatial profiling of human PDAC tumors from neoadjuvant immunotherapy clinical trials reveal diverse spatial niches enriched in TLS. C_LIO_LITLS maturity is influenced by tumor location and the cellular neighborhoods in which TLS immune cells are recruited. C_LIO_LIUnsupervised machine learning of genome-wide signatures on spatial transcriptomics data characterizes the TLS-enriched TME and associates TLS transcriptomes with survival outcomes in PDAC. C_LIO_LIInteractions of spatially variable gene expression patterns showed TLS maturation is coupled with immunoglobulin distribution and ECM remodeling in pathologic responders. C_LIO_LIIntratumoral plasma cell and immunoglobin gene expression spatial dynamics demonstrate trafficking of TLS-driven humoral immunity in the PDAC TME. C_LI SignificanceWe report a spatial multi-omics atlas of PDAC tumors from a series of immunotherapy neoadjuvant clinical trials. Intratumorally, pathologic responders exhibit mature TLS that propagate plasma cells into malignant niches. Our findings offer insights on the role of TLS-associated humoral immunity and stromal remodeling during immunotherapy treatment.

10.1101/2024.09.22.613714

Analyzing multi-sample spatial transcriptomics data requires accounting for biological variation. We present multi-sample non-negative spatial factorization (mNSF), an alignment-free framework extending single-sample spatial factorization (NSF) to multi-sample datasets. mNSF incorporates sample-specific spatial correlation modeling and extracts low-dimensional data representations. Through simulations and real data analysis, we demonstrate mNSFs efficacy in identifying true factors, shared anatomical regions, and region-specific biological functions. mNSFs performance is comparable to alignment-based methods when alignment is feasible, while enabling analysis in scenarios where spatial alignment is unfeasible. mNSF shows promise as a robust method for analyzing spatially resolved transcriptomics data across multiple samples.

10.1101/2024.07.01.599554

Vast quantities of multi-omic data have been produced to characterize the development and diversity of cell types in the cerebral cortex of humans and other mammals. To more fully harness the collective discovery potential of these data, we have assembled gene-level transcriptomic data from 188 published studies of neocortical development, including the transcriptomes of [~]30 million single-cells, extensive spatial transcriptomic experiments and RNA sequencing of sorted cells and bulk tissues: nemoanalytics.org/landing/neocortex. Applying joint matrix decomposition (SJD) to mouse, macaque and human data in this collection, we defined transcriptome dynamics that are conserved across mammalian neurogenesis and which elucidate the evolution of outer, or basal, radial glial cells. Decomposition of adult human neocortical data identified layer-specific signatures in mature neurons and, in combination with transfer learning methods in NeMO Analytics, enabled the charting of their early developmental emergence and protracted maturation across years of postnatal life. Interrogation of data from cerebral organoids demonstrated that while broad molecular elements of in vivo development are recapitulated in vitro, many layer-specific transcriptomic programs in neuronal maturation are absent. We invite computational biologists and cell biologists without coding expertise to use NeMO Analytics in their research and to fuel it with emerging data (carlocolantuoni.org).

10.1101/2024.02.26.581612

Cells are fundamental units of life, constantly interacting and evolving as dynamical systems. While recent spatial multi-omics can quantitate individual cells characteristics and regulatory programs, forecasting their evolution ultimately requires mathematical modeling. We develop a conceptual framework--a cell behavior hypothesis grammar--that uses natural language statements (cell rules) to create mathematical models. This allows us to systematically integrate biological knowledge and multi-omics data to make them computable. We can then perform virtual "thought experiments" that challenge and extend our understanding of multicellular systems, and ultimately generate new testable hypotheses. In this paper, we motivate and describe the grammar, provide a reference implementation, and demonstrate its potential through a series of examples in tumor biology and immunotherapy. Altogether, this approach provides a bridge between biological, clinical, and systems biology researchers for mathematical modeling of biological systems at scale, allowing the community to extrapolate from single-cell characterization to emergent multicellular behavior.

10.1101/2023.09.17.557982

Single cell transcriptomics technologies can uncover changes in the molecular states that underlie cellular phenotypes. However, understanding the dynamic cellular processes requires extending from inferring trajectories from snapshots of cellular states to estimating temporal changes in cellular gene expression. To address this challenge, we have developed a neural ordinary differential equation-based method, RNAForecaster, for predicting gene expression states in single cells for multiple future time steps in an embedding-independent manner. We demonstrate that RNAForecaster can accurately predict future expression states in simulated single cell transcriptomic data with cellular tracking over time. We then show that using metabolic labeling scRNA-seq data from constitutively dividing cells, RNAForecaster accurately recapitulates many of the expected changes in gene expression during progression through the cell cycle over a three day period. Thus, RNAForecaster enables short term estimation of future expression states in biological systems from high-throughput datasets with temporal information.

10.1101/2022.08.04.502825

Pancreatic ductal adenocarcinoma (PDAC) is an aggressive malignancy characterized by a heterogeneous tumor microenvironment (TME) that is enriched with cancer associated fibroblasts (CAFs)1. Cell-cell interactions involving these CAFs promote an immunosuppressive phenotype with altered inflammatory gene expression. While single-cell transcriptomics provides a tool to dissect the complex intercellular pathways that regulate cancer-associated inflammation in human tumors, complementary experimental systems for mechanistic validation remain limited. This study integrated single-cell data from human tumors and novel organoid co-cultures to study the PDAC TME. We derived a comprehensive atlas of PDAC gene expression from six published human single-cell RNA sequencing (scRNA-seq) datasets2-7 to characterize intercellular signaling pathways between epithelial tumor cells and CAFs that regulate the inflammatory TME. Analysis of the epithelial cell compartment identified global gene expression pathways that modulate inflammatory signaling and are correlated with CAF composition. We then generated patient-derived organoid-CAF co-cultures to serve as a biological model of the cellular interactions learned from human tissue in the atlas. Transfer learning analysis to additional scRNA-seq data of this co-culture system and mechanistic experiments confirmed the epithelial response to fibroblast signaling. This bidirectional approach of complementary computational and in vitro applications provides a framework for future studies identifying important mechanisms of intercellular interactions in PDAC.

10.1101/2022.07.14.500096

Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. Still, inferring biological processes requires additional post hoc statistics and annotation for interpretation of features learned from software packages developed for NMF implementation. Here, we aim to introduce a suite of computational tools that implement NMF and provide methods for accurate, clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations, and open questions in the field is followed by three vignettes for the Bayesian NMF algorithm CoGAPS (Coordinated Gene Activity across Pattern Subsets). Each vignette will demonstrate NMF analysis to quantify cell state transitions in public domain single-cell RNA-sequencing (scRNA-seq) data of malignant epithelial cells in 25 pancreatic ductal adenocarcinoma (PDAC) tumors and 11 control samples. The first uses PyCoGAPS, our new Python interface for CoGAPS that we developed to enhance runtime of Bayesian NMF for large datasets. The second vignette steps through the same analysis using our R CoGAPS interface, and the third introduces two new cloud-based, plug-and-play options for running CoGAPS using GenePattern Notebook and Docker. By providing Python support, cloud-based computing options, and relevant example workflows, we facilitate user-friendly interpretation and implementation of NMF for single-cell analyses.

10.1101/2022.07.09.499398

RNA velocity analysis of single cells promises to predict temporal dynamics from gene expression. Indeed, in many systems, it has been observed that RNA velocity produces a vector field that qualitatively reflects known features of the system. Despite this observation, the limitations of RNA velocity estimates are poorly understood. Using real data and simulations, we dissect the impact of different steps in the RNA velocity workflow on the estimated vector field. We find that the process of mapping RNA velocity estimates into a low-dimensional representation, such as those produced by UMAP, has a large impact on the result. The RNA velocity vector field strongly depends on the k-NN graph of the data. This dependence leads to significant estimator errors when the k-NN graph is not a faithful representation of the true data structure, a feature that cannot be known for most real datasets. Finally, we establish that RNA velocity estimates expression speed neither at the gene nor cellular level. We propose that RNA velocity is best considered a smoothed interpolation of the observed k-NN structure, as opposed to an extrapolation of future cellular states, and that the use of RNA velocity as a validation of latent space embedding structures is circular.

10.1101/2022.06.19.494717

Trans-differentiation of human induced pluripotent stem cells into neurons via Ngn2-induction (hiPSC-N) has become an efficient system to quickly generate neurons for disease modeling and in vitro assay development, a significant advance from previously used neoplastic and other cell lines. Recent single-cell interrogation of Ngn2-induced neurons however, has revealed some similarities to unexpected neuronal lineages. Similarly, a straightforward method to generate hiPSC derived astrocytes (hiPSC-A) for the study of neuropsychiatric disorders has also been described. Here we examine the homogeneity and similarity of hiPSC-N and hiPSC-A to their in vivo counterparts, the impact of different lengths of time post Ngn2 induction on hiPSC-N (15 or 21 days) and of hiPSC-N / hiPSC-A co-culture. Leveraging the wealth of existing public single-cell RNA-seq (scRNA-seq) data in Ngn2-induced neurons and in vivo data from the developing brain, we provide perspectives on the lineage origins and maturation of hiPSC-N and hiPSC-A. While induction protocols in different labs produce consistent cell type profiles, both hiPSC-N and hiPSC-A show significant heterogeneity and similarity to multiple in vivo cell fates, and both more precisely approximate their in vivo counterparts when co-cultured. Gene expression data from the hiPSC-N show enrichment of genes linked to schizophrenia (SZ) and autism spectrum disorders (ASD) as has been previously shown for neural stem cells and neurons. These overrepresentations of disease genes are strongest in our system at early times (day 15) in Ngn2-induction/maturation of neurons, when we also observe the greatest similarity to early in vivo excitatory neurons. We have assembled this new scRNA-seq data along with the public data explored here as an integrated biologist-friendly web-resource for researchers seeking to understand this system more deeply: nemoanalytics.org/p?l=DasEtAlNGN2&g=PRPH.

10.1101/2022.06.15.495952

Recent advances in spatial transcriptomics (ST) enable gene expression measurements from a tissue sample while retaining its spatial context. This technology enables unprecedented in situ resolution of the regulatory pathways that underlie the heterogeneity in the tumor and its microenvironment (TME). The direct characterization of cellular co-localization with spatial technologies facilities quantification of the molecular changes resulting from direct cell-cell interaction, as occurs in tumor-immune interactions. We present SpaceMarkers, a novel bioinformatics algorithm to infer molecular changes from cell-cell interaction from latent space analysis of ST data. We apply this approach to infer molecular changes from tumor-immune interactions in Visium spatial transcriptomics data of metastasis, invasive and precursor lesions, and immunotherapy treatment. Further transfer learning in matched scRNA-seq data enabled further quantification of the specific cell types in which SpaceMarkers are enriched. Altogether, SpaceMarkers can identify the location and context-specific molecular interactions within the TME from ST data.

10.1101/2022.06.02.490672

SO_SCPLOWUMMARYC_SCPLOWIntegrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.

10.1101/2021.03.22.435728

BackgroundThe cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. ResultsHere, we present tricycle, an R/Bioconductor package, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the use of transfer learning. We estimate a cell cycle embedding using a fixed reference dataset and project new data into this reference embedding; an approach that overcomes key limitations of learning a dataset dependent embedding. Tricycle then predicts a cell-specific position in the cell cycle based on the data projection. The accuracy of tricycle compares favorably to gold-standard experimental assays, which generally require specialized measurements in specifically constructed in vitro systems. Using internal controls which are available for any dataset, we show that tricycle predictions generalize to datasets with multiple cell types, across tissues, species and even sequencing assays. ConclusionsTricycle generalizes across datasets, is highly scalable and applicable to atlas-level single-cell RNA-seq data.

10.1101/2021.04.06.438463

Latent space techniques have emerged as powerful tools to identify genes and gene sets responsible for cell-type and species-specific differences in single-cell data. Transfer learning methods can compare learned latent spaces across biological systems. However, the robustness that comes from leveraging information across multiple genes in transfer learning is often attained at the sacrifice of gene-wise precision. Thus, methods are needed to identify genes, defined as important within a particular latent space, that significantly differ between contexts. To address this challenge, we have developed a new framework, scProject, and a new metric, projectionDrivers, to quantitatively examine latent space usage across single-cell experimental systems while concurrently extracting the genes driving the differential usage of the latent space between defined contrasts. Here, we demonstrate the efficacy, utility, and scalability of scProject with projectionDrivers and provide experimental validation for predicted species-specific differences between the developing mouse and human retina.

10.1101/2021.08.25.457650

Variability between human pluripotent stem cell (hPSC) lines remains a challenge and opportunity in biomedicine. We identified differences in the early lineage emergence across hPSC lines that mapped on the antero-posterior axis of embryonic development. RNA-seq analysis revealed dynamic transcriptomic patterns that defined the emergence of mesendodermal versus neuroectodermal lineages conserved across hPSC lines and cell line-specific transcriptional signatures that were invariant across differentiation. The stable cell line-specific transcriptomic patterns predicted the retinoic acid (RA) response of the cell lines, resulting in distinct bias towards fore-versus hind-brain fates. Replicate hPSC lines and paired adult donor tissue demonstrated that cells from individual humans expressed unique and long-lasting transcriptomic signatures associated with evolutionarily recent genes. In addition to this genetic contribution, we found that replicate lines from a single donor showed divergent brain regional fates linked to distinct chromatin states, indicating that epigenetic mechanisms also contribute to neural fate differences. This variation in lineage bias and its correlation with RA responsive gene expression was also observed in a large collection of hPSC lines. These results define transcriptomic differences in hPSCs that initiate a critical early step specifying anterior or posterior neural fates.

10.1101/2021.03.17.435870

Parallel processing circuits are thought to dramatically expand the network capabilities of the nervous system. Magnocellular and parvocellular oxytocin neurons have been proposed to subserve two parallel streams of social information processing, which allow a single molecule to encode a diverse array of ethologically distinct behaviors, although to date direct evidence to support this hypothesis is lacking. Here we provide the first comprehensive characterization of magnocellular and parvocellular oxytocin neurons, validated across anatomical, projection target, electrophysiological, and transcriptional criteria. We next used novel multiple feature selection tools in Fmr1 KO mice to provide direct evidence that normal functioning of the parvocellular but not magnocellular oxytocin pathway is required for autism-relevant social reward behavior. Finally, we demonstrate that autism risk genes are uniquely enriched in parvocellular oxytocin neurons. Taken together these results provide the first evidence that oxytocin pathway specific pathogenic mechanisms account for social impairments across a broad range of autism etiologies. One Sentence SummaryPathoclisis of parvocellular oxytocin neurons plays an important role in the pathogenesis of social impairments in autism.

10.1101/2020.03.13.990549

Better understanding the progression of neural stem cells (NSCs) in the developing cerebral cortex is important for modeling neurogenesis and defining the pathogenesis of neuropsychiatric disorders. Here we used RNA-sequencing, cell imaging and lineage tracing of mouse and human in vitro NSCs to model the generation of cortical neuronal fates. We show that conserved signaling mechanisms regulate the acute transition from proliferative NSCs to committed glutamatergic excitatory neurons. As human telencephalic NSCs developed from pluripotency in vitro, they first transitioned through organizer states that spatially pattern the cortex before generating glutamatergic precursor fates. NSCs derived from multiple human pluripotent lines varied in these early patterning states leading differentially to dorsal or ventral telencephalic fates. This work furthers systematic analysis of the earliest patterning events that generate the major neuronal trajectories of the human telencephalon.

10.1101/577544

The development of single-cell RNA-Sequencing (scRNA-Seq) has allowed high resolution analysis of cell type diversity and transcriptional networks controlling cell fate specification. To identify the transcriptional networks governing human retinal development, we performed scRNA-Seq over retinal organoid and in vivo retinal development, across 20 timepoints. Using both pseudotemporal and cross-species analyses, we examined the conservation of gene expression across retinal progenitor maturation and specification of all seven major retinal cell types. Furthermore, we examined gene expression differences between developing macula and periphery and between two distinct populations of horizontal cells. We also identify both shared and species-specific patterns of gene expression during human and mouse retinal development. Finally, we identify an unexpected role for ATOH7 expression in regulation of photoreceptor specification during late retinogenesis. These results provide a roadmap to future studies of human retinal development, and may help guide the design of cell-based therapies for treating retinal dystrophies.

10.1101/779694

MotivationDimension reduction techniques are widely used to interpret high-dimensional biological data. Features learned from these methods are used to discover both technical artifacts and novel biological phenomena. Such feature discovery is critically import to large single-cell datasets, where lack of a ground truth limits validation and interpretation. Transfer learning (TL) can be used to relate the features learned from one source dataset to a new target dataset to perform biologically-driven validation by evaluating their use in or association with additional sample annotations in that independent target dataset.\n\nResultsWe developed an R/Bioconductor package, projectR, to perform TL for analyses of genomics data via TL of clustering, correlation, and factorization methods. We then demonstrate the utility TL for integrated data analysis with an example for spatial single-cell analysis.\n\nAvailabilityprojectR is available on Bioconductor and at https://github.com/genesofeve/projectR.\n\nContactgsteinobrien@jhmi.edu; ejfertig@jhmi.edu

10.1101/726547

Bioinformatics techniques to analyze time course bulk and single cell omics data are advancing. The absence of a known ground truth of the dynamics of molecular changes challenges benchmarking their performance on real data. Realistic simulated time-course datasets are essential to assess the performance of time course bioinformatics algorithms. We develop an R/Bioconductor package, CancerInSilico, to simulate bulk and single cell transcriptional data from a known ground truth obtained from mathematical models of cellular systems. This package contains a general R infrastructure for running cell-based models and simulating gene expression data based on the model states. We show how to use this package to simulate a gene expression data set and consequently benchmark analysis methods on this data set with a known ground truth. The package is freely available via Bioconductor: http://bioconductor.org/packages/CancerInSilico/

10.1101/328807

Tumor heterogeneity provides a complex challenge to cancer treatment and is a critical component of therapeutic response, disease recurrence, and patient survival. Single-cell RNA-sequencing (scRNA-seq) technologies reveal the prevalence of intra-and inter-tumor heterogeneity. Computational techniques are essential to quantify the differences in variation of these profiles between distinct cell types, tumor subtypes, and patients to fully characterize intra-and inter-tumor molecular heterogeneity. We devised a new algorithm, Expression Variation Analysis in Single Cells (EVAsc), to perform multivariate statistical analyses of differential variation of expression in gene sets for scRNA-seq. EVAsc has high sensitivity and specificity to detect pathways with true differential heterogeneity in simulated data. We then apply EVAsc to several public domain scRNA-seq tumor datasets to quantify the landscape of tumor heterogeneity in several key applications in cancer genomics, i.e. immunogenicity, cancer subtypes, and metastasis. Immune pathway heterogeneity in hematopoietic cell populations in breast tumors corresponded to the amount diversity present in the T-cell repertoire of each individual. In head and neck squamous cell carcinoma (HNSCC) patients, we found dramatic differences in pathway dysregulation across basal primary tumors. Within the basal primary tumors we also identified increased immune dysregulation in individuals with a high proportion of fibroblasts present in the tumor microenvironment. Moreover, cells in HNSCC primary tumors had significantly more heterogeneity across pathways than cells in metastases, consistent with a model of clonal outgrowth. These results demonstrate the broad utility of EVAsc to quantify inter-and intra-tumor heterogeneity from scRNA-seq data without reliance on low dimensional visualization.

10.1101/479287

New approaches are urgently needed to glean biological insights from the vast amounts of single cell RNA sequencing (scRNA-Seq) data now being generated. To this end, we propose that cell identity should map to a reduced set of factors which will describe both exclusive and shared biology of individual cells, and that the dimensions which contain these factors reflect biologically meaningful relationships across different platforms, tissues and species. To find a robust set of dependent factors in large-scale scRNA- Seq data, we developed a Bayesian non-negative matrix factorization (NMF) algorithm, scCoGAPS. Application of scCoGAPS to scRNA-Seq data obtained over the course of mouse retinal development identified gene expression signatures for factors associated with specific cell types and continuous biological processes. To test whether these signatures are shared across diverse cellular contexts, we developed projectR to map biologically disparate datasets into the factors learned by scCoGAPS. Because projecting these dimensions preserve relative distances between samples, biologically meaningful relationships/factors will stratify new data consistent with their underlying processes, allowing labels or information from one dataset to be used for annotation of the other--a machine learning concept called transfer learning. Using projectR, data from multiple datasets was used to annotate latent spaces and reveal novel parallels between developmental programs in other tissues, species and cellular assays. Using this approach we are able to transfer cell type and state designations across datasets to rapidly annotate cellular features in a new dataset without a priori knowledge of their type, identify a species-specific signature of microglial cells, and identify a previously undescribed subpopulation of neurosecretory cells within the lung. Together, these algorithms define biologically meaningful dimensions of cellular identity, state, and trajectories that persist across technologies, molecular features, and species.\n\nGRAPHICAL ABSTRACT\n\nO_FIG O_LINKSMALLFIG WIDTH=174 HEIGHT=200 SRC=\"FIGDIR/small/395004_ufig1.gif\" ALT=\"Figure 1\">\nView larger version (81K):\norg.highwire.dtl.DTLVardef@dd1c07org.highwire.dtl.DTLVardef@5b1109org.highwire.dtl.DTLVardef@bb6714org.highwire.dtl.DTLVardef@16c66f0_HPS_FORMAT_FIGEXP M_FIG C_FIG

10.1101/395004

Precise temporal control of gene expression in neuronal progenitors is necessary for correct regulation of neurogenesis and cell fate specification. However, the extensive cellular heterogeneity of the developing CNS has posed a major obstacle to identifying the gene regulatory networks that control these processes. To address this, we used single cell RNA-sequencing to profile ten developmental stages encompassing the full course of retinal neurogenesis. This allowed us to comprehensively characterize changes in gene expression that occur during initiation of neurogenesis, changes in developmental competence, and specification and differentiation of each of the major retinal cell types. These data identify transitions in gene expression between early and late-stage retinal progenitors, as well as a classification of neurogenic progenitors. We identify here the NFI family of transcription factors (Nfia, Nfib, and Nfix) as genes with enriched expression within late RPCs, and show they are regulators of bipolar interneuron and Muller glia specification and the control of proliferative quiescence.

10.1101/378950

Omics data contains signal from the molecular, physical, and kinetic inter- and intra-cellular interactions that control biological systems. Matrix factorization techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in topics ranging from pathway discovery to time course analysis. We review exemplary applications of matrix factorization for systems-level analyses. We discuss appropriate application of these methods, their limitations, and focus on analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with matrix factorization enables discovery from high-throughput data beyond the limits of current biological knowledge--answering questions from high-dimensional data that we have not yet thought to ask.

10.1101/196915

BACKGROUNDTargeted therapies specifically act by blocking the activity of proteins that are encoded by genes critical for tumorigenesis. However, most cancers acquire resistance and long-term disease remission is rarely observed. Understanding the time course of molecular changes responsible for the development of acquired resistance could enable optimization of patients treatment options. Clinically, acquired therapeutic resistance can only be studied at a single time point in resistant tumors. To determine the dynamics of these molecular changes, we obtained high throughput omics data weekly during the development of cetuximab resistance in a head and neck cancer in vitro model.\n\nRESULTSAn unsupervised algorithm, CoGAPS, was used to quantify the evolving transcriptional and epigenetic changes. Applying a PatternMarker statistic to the results from CoGAPS enabled novel heatmap-based visualization of the dynamics in these time course omics data. We demonstrate that transcriptional changes result from immediate therapeutic response or resistance, whereas epigenetic alterations only occur with resistance. Integrated analysis demonstrates delayed onset of changes in DNA methylation relative to transcription, suggesting that resistance is stabilized epigenetically.\n\nCONCLUSIONSGenes with epigenetic alterations associated with resistance that have concordant expression changes are hypothesized to stabilize resistance. These genes include FGFR1, which was associated with EGFR inhibitor resistance previously. Thus, integrated omics analysis distinguishes the timing of molecular drivers of resistance. Our findings provide a relevant towards better understanding of the time course progression of changes resulting in acquired resistance to targeted therapies. This is an important contribution to the development of alternative treatment strategies that would introduce new drugs before the resistant phenotype develops.

10.1101/136564

Cancer is a complex disease, driven by aberrant activity in numerous signaling pathways in even individual malignant cells. Epigenetic changes are critical mediators of these functional changes that drive and maintain the malignant phenotype. Changes in DNA methylation, histone acetylation and methylation, non-coding RNAs, post-translational modifications are all epigenetic drivers in cancer, independent of changes in the DNA sequence. These epigenetic alterations, once thought to be crucial only for the malignant phenotype maintenance, are now recognized as critical also for disrupting essential pathways that protect the cells from uncontrolled growth, longer survival and establishment in distant sites from the original tissue. In this review, we focus on DNA methylation and chromatin structure in cancer. While associated with cancer, the precise functional role of these alterations is an area of active research using emerging high-throughput approaches and bioinformatics analysis tools. Therefore, this review describes these high-throughput measurement technologies, public domain databases for high-throughput epigenetic data in tumors and model systems, and bioinformatics algorithms for their analysis. Advances in bioinformatics data integration techniques that combine these epigenetic data with genomics data are essential to infer the function of specific epigenetic alterations in cancer, and are therefore also a focus of this review. Future studies using these emerging technologies will elucidate how alterations in the cancer epigenome cooperate with genetic aberrations to cause tumorigenesis initiation and progression. This deeper understanding is essential to future studies that will precisely infer patients prognosis and select patients who will be responsive to emerging epigenetic therapies.

10.1101/114025

SummaryNon-negative Matrix Factorization (NMF) algorithms associate gene expression with biological processes (e.g., time-course dynamics or disease subtypes). Compared with univariate associations, the relative weights of NMF solutions can obscure biomarkers. Therefore, we developed a novel PatternMarkers statistic to extract genes for biological validation and enhanced visualization of NMF results. Finding novel and unbiased gene markers with PatternMarkers requires whole-genome data. However, NMF algorithms typically do not converge for the tens of thousands of genes in genome-wide profiling. Therefore, we also developed Genome-Wide CoGAPS Analysis in Parallel Sets (GWCoGAPS), the first robust whole genome Bayesian NMF using the sparse, MCMC algorithm, CoGAPS. This software contains analytic and visualization tools including a Shiny web application, patternMatcher, which are generalized for any NMF. Using these tools, we find granular brain-region and cell-type specific signatures with corresponding biomarkers in GTex data, illustrating GWCoGAPS and patternMarkers ascertainment of data-driven biomarkers from whole-genome data.\n\nAvailabilityPatternMarkers & GWCoGAPS are in the CoGAPS Bioconductor package (3.5) under the GPL license.\n\nContactgsteinobrien@jhmi.edu; ccolantu@jhmi.edu; ejfertig@jhmi.edu

10.1101/083717