pubmed.ncbi.nlm.nih.gov

Automated deconvolution of structured mixtures from heterogeneous tumor genomic data - PubMed

️Sun Jan 01 2017

Comparative Study

Automated deconvolution of structured mixtures from heterogeneous tumor genomic data

Theodore Roman et al. PLoS Comput Biol. 2017.

Abstract

With increasing appreciation for the extent and importance of intratumor heterogeneity, much attention in cancer research has focused on profiling heterogeneity on a single patient level. Although true single-cell genomic technologies are rapidly improving, they remain too noisy and costly at present for population-level studies. Bulk sequencing remains the standard for population-scale tumor genomics, creating a need for computational tools to separate contributions of multiple tumor clones and assorted stromal and infiltrating cell populations to pooled genomic data. All such methods are limited to coarse approximations of only a few cell subpopulations, however. In prior work, we demonstrated the feasibility of improving cell type deconvolution by taking advantage of substructure in genomic mixtures via a strategy called simplicial complex unmixing. We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data, as well as the ability to process quantitative RNA expression data, and heterogeneous combinations of RNA and CNV data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We further demonstrate their effectiveness in identifying mixture substructure in true breast cancer CNV data from the Cancer Genome Atlas (TCGA). Source code is available at https://github.com/tedroman/WSCUnmix.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Overview of the full analysis pipeline: Input samples are represented by collections of copy number (CN) call files and/or RNA expression measurements, which are converted to a matrix format.**
These matrix inputs are passed to our simplicial complex inference code, which infers a mixed membership model of the data and associated model likelihood. The inference is computed by (1) principal components analysis (PCA) to perform dimensionality reduction and denoising of geometric structure; (2) medoidshift pre-clustering to identify low-dimensional sub-manifolds corresponding to distinct submixtures of the data; (3) dimensionality inference via sliver estimation to estimate likely numbers of mixture components needed to model each submixture; (4) unmixing on each substructure to identify preliminary mixture decompositions of the submixtures; and (5) a K-nearest-neighbor (KNN-based) reconciliation model to identify likely shared vertices between submanifolds. Each of these steps is explained in more detail in the main text. The inferred low dimension subspaces may be partially- or non-intersecting. We require, however, that the subspaces form a continuous structure, and merge disconnected subspaces using a maximum likelihood model. The inferred mixture components are then used in downstream functional annotation to identify dysregulated pathways or term associations.

**Fig 2. Pseudocode for merging protocol to select most likely from a set of candidate models provided none are simplicial complexes.**

**Fig 3. Visualization of TCGA RNA-Seq data with inferred maximum likelihood simplicial complex structure.**
Note that the tetrahedron inferred was considered alongside other simplices and simplicial complex but considered most likely. The data are enclosed in the tetrahedron, and as such can be approximated as mixtures of the vertices. Data points, corresponding to distinct tumor samples plotted in principal component space, are color coded by immunohistological subtype (red circle: Her2+, purple plus: ER/PR+, blue asterisk: triple-negative).

**Fig 4. Visualization of TCGA CNV data with inferred maximum likelihood simplicial complex structure.**
The inferred structure of three arms sharing a point corresponds to a phylogeny of one most recent common ancestor, and three branches of a tree. Data points, corresponding to distinct tumor samples plotted in principal component space, are color coded by immunohistological subtype (red circle: Her2+, purple plus: ER/PR+, blue asterisk: triple-negative).

**Fig 5. Visualization of TCGA combined RNA-Seq and CNV data with inferred maximum likelihood simplicial complex structure.**
The inferred structure of a tetrahedron and triangle sharing a point corresponds to two phylogenetic branches, one with four components and one with three components. Data points, corresponding to distinct tumor samples plotted in principal component space, are color coded by immunohistological subtype (red circle: Her2+, purple plus: ER/PR+, blue asterisk: triple-negative).

Cited by

DeCAF: a novel method to identify cell-type specific regulatory variants and their role in cancer risk.
Kalita CA, Gusev A. Kalita CA, et al. Genome Biol. 2022 Jul 8;23(1):152. doi: 10.1186/s13059-022-02708-9. Genome Biol. 2022. PMID: 35804456 Free PMC article.
A graph-based algorithm for estimating clonal haplotypes of tumor sample from sequencing data.
Wang Y, Zhang X, Ding S, Geng Y, Liu J, Zhao Z, Zhang R, Xiao X, Wang J. Wang Y, et al. BMC Med Genomics. 2019 Jan 31;12(Suppl 1):27. doi: 10.1186/s12920-018-0457-4. BMC Med Genomics. 2019. PMID: 30704456 Free PMC article.

References

1. Loeb LA. A Mutator Phenotype in Cancer. Cancer Research. 2001;61(8):3230–3239. - PubMed
1. Marusyk A, Almendro V, Polyak K. Intra-Tumour Heterogeneity: A Looking Glass for Cancer? Nature Reviews Cancer. 2012;12(5):323–334. doi: 10.1038/nrc3261 - DOI - PubMed
1. Shackney SE, Smith CA, Pollice A, Brown K, Day R, Julian T, et al. Intracellular Patterns of Her-2/neu, Ras, and Ploidy Abnormalities in Primary Human Breast Cancers Predict Postoperative Clinical Disease-Free Survival. Clinical Cancer Research. 2004;10(9):3042–3052. doi: 10.1158/1078-0432.CCR-0401-3 - DOI - PubMed
1. Heselmeyer-Haddad K, Garcia LYB, Bradley A, Ortiz-Melendez C, Lee WJ, Christensen R, et al. Single-Cell Genetic Analysis of Ductal Carcinoma in situ and Invasive Breast Cancer Reveals Enormous Tumor Heterogeneity Yet Conserved Genomic Imbalances and Gain of MYC During Progression. The American Journal of Pathology. 2012;181(5):1807–1822. doi: 10.1016/j.ajpath.2012.07.012 - DOI - PMC - PubMed
1. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, et al. Tumour Evolution Inferred by Single-Cell Sequencing. Nature. 2011;472(7341):90–94. doi: 10.1038/nature09807 - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Automated deconvolution of structured mixtures from heterogeneous tumor genomic data - PubMed