Network regularized sparse non-negative tri-matrix factorization for pathway identification. The binary mutation matrix X is factorized into products of three matrices, patient cluster U, pathway information V and cancer-type and pathway association… — Network regularized sparse non-negative tri-matrix factorization for pathway identification. The binary mutation matrix X is factorized into products of three matrices, patient cluster U, pathway information V and cancer-type and pathway association S. Prior knowledge is introduced from gene–gene interaction networks A

About

NTriPath is a method to integrate somatic mutations with biological prior knowledge (e.g., protein-protein interaction networks, pathway database) to detect cancer-type specific altered pathways by somatic mutations across cancers.

NOTE: We missed to include a normalization step after updating S and V in each iteration in Algorithm Table 1 in the paper. Our multiplicative updating rules have many desirable properties, e.g., convergence and correctness proved in Supplementary information, but yield a highly sparse solution (i.e., almost all elements in both matrices, S and V, become almost zero after iterations) when the input matrix, X, is from mutation data that is also considerably sparse. To prevent this, we added the normalization step to the algorithm, where each column in V is normalized to sum to 1 and each column of S is scaled accordingly. Due to this modification, we cannot guarantee anymore that the objective function value will monotonically decrease over iterations. However, the method (including both the updating rules and the normalization step) still worked well in our additional experiments; we tested the method on the same problem as in the paper but with different pathway databases (KEGG, Biocarta or Reactome) or on simulation datasets as in Supplementary information. For the experiment on TCGA data in the paper, although the objective function value increased in first few iterations, it eventually decreased over iterations.

Download 🡇

Workflow

Four types of data were used as input for our algorithm:

First, we generated a binary matrix X of patients x genes, with ‘1’ indicating a mutation and ‘0’ no mutation.
Second, we constructed gene-gene interaction networks A
Third, we incorporated a pathway database V_0 (e.g., conserved 4,620 subnetworks across species12).
Fourth, we included clinical data on the patient's tumor type U.

NTriPath produces two matrices as output:

Altered pathways by mutated genes V
Altered pathways by cancer type matrix S

The use of both large-scale somatic mutation profiles and gene-gene interaction networks enabled NTriPath to identify cancer-related pathways containing known cancer genes mutated at different frequencies across cancers with newly added member genes according to high network connectivity. Finally we use the altered pathways by cancer type matrix S to identify altered pathways that are specific for each cancer type.

Supplementary Data

Additional experimental results using different gene-gene ineteraction networks from Human Protein Reference Database (HPRD) and Rossin, E.J. et al PLoS Genetics 2011.
Consensus clustering results using different number of top ranked cancer-type specific disrupted pathways.
We generated consensus clustering results using 1) only top 1st ranked pathway, 2) only top 1st and 2nd ranked pathways, 3) top 1st, 2nd, and 3rd ranked pathways (included in the manuscript). Overall, the use of each top 1st, 2nd, or 3rd pathways consistently showed that patient groups are having significant clinical outcomes.
Consensus clustering results using updated TCGA patient datasets (downloaded May 9th 2015 through cbioportal.org).
We generated consensus clustering results using top 3 ranked pathways (used in our original study) to newly updated TCGA patient datasets. There are 3607 patients data from TCGA with survival information. Please read "summary.txt" first.
Individual gene analysis based on mutation (e.g., patients with mutation of gene A vs wild type) and gene expression per each cancer type.
We generated KM survival plot based on each member gene in the top-ranked cancer type specific altered pathways based on its mutation and gene expression. For example, "TCGA_Individual_gene_mutation_KMplots" folder contains single gene analysis based on mutation status. Red color subgroup in the each KM plot represents patients with mutation and black color subgroup represent patients with no mutation for corresponding gene. "TCGA_Individual_gene_GE_KMplots" folder contains single gene analysis based on expression. Red color subgroup in the each KM plot represent patients with overexpression of the corresponding gene. We divided patients into two subgroups based on 25% and 75% percentile of the expression of the gene and the median expression of the gene.

Citation

Sunho Park, Seung-Jun Kim, Donghyeon Yu, Samuel Pena-Llopis, Jianjiong Gao,J in Suk Park, Beibei Chen, Jessie Norris, Xinlei Wang, Min Chen, Minsoo Kim,Jeongsik Yong, Zabi Wardak, Kevin Choe, Michael Story, Timothy Starr, Jae-Ho Cheong and Tae Hyun Hwang., An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types, Bioinformatics, 2016 Jun 1; 3(11): 1643-1651. doi: 10.1093/bioinformatics/btv692

Contact

Please contact parks@[no-spam]ccf.org with any questions, comments, or concerns.