天天看點

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

寫在前面:

  • 方法:下載下傳處理TCGA資料的R包很多,資料來源也不一樣,這一部分開始對幾個包分别進行使用,寫出心得。
  • 結果最終想得到的是用其中兩個包
  • 這部分,RTCGA包
  • 參考:作者文檔和這個以及生信技能樹
---------------------------------------------------------------------------

1 安裝并加載包

# Load the bioconductor installer. 
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")
# Install the clinical and mRNA gene expression data packages
biocLite("RTCGA.clinical") ## 14Mb
biocLite('RTCGA.rnaseq') ##  (612.6 MB)
biocLite("RTCGA.mRNA") ##  (85.0 MB)
biocLite('RTCGA.mutations')  ## (103.8 MB)
library(RTCGA.clinical) 
library(RTCGA.mRNA)
library(RTCGA.rnaseq)
library(RTCGA.mutations)           

複制

> library(RTCGA)
Welcome to the RTCGA (version: 1.8.0).           

複制

#檢視BRCA的内容
> checkTCGA('DataSets', 'LIHC')
   Size                                                                                                                                     Name
1   61K                                                                                   LIHC.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
2  932K                                                                                        LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz
3  1.6G LIHC.Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data.Level_3.2016012800.0.0.tar.gz
4  1.5M                  LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2016012800.0.0.tar.gz
5   22M               LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data.Level_3.2016012800.0.0.tar.gz
6  255K                LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz
7   78K   LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz.bak.20160128
8   83M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__exon_expression__data.Level_3.2016012800.0.0.tar.gz
9  8.7M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz
10 8.0M                LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__splice_junction_expression__data.Level_3.2016012800.0.0.tar.gz
11 102M                            LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz
12  32M                 LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz
13 281M                         LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms__data.Level_3.2016012800.0.0.tar.gz
14  80M              LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data.Level_3.2016012800.0.0.tar.gz
15 905M                   LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data.Level_3.2016012800.0.0.tar.gz
16  70M               LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data.Level_3.2016012800.0.0.tar.gz
17 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg.Level_3.2016012800.0.0.tar.gz
18 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
19 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg.Level_3.2016012800.0.0.tar.gz
20 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
21 210M                                                                                LIHC.Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz
22 2.0M                                                                               LIHC.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
23 416M                                                                            LIHC.Mutation_Packager_Coverage.Level_3.2016012800.0.0.tar.gz
24  25M                                                                     LIHC.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0.tar.gz
25  47M                                                                 LIHC.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz
26 3.8M                                                                           LIHC.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
27 765M                                                                        LIHC.Mutation_Packager_Raw_Coverage.Level_3.2016012800.0.0.tar.gz
28 585K                                                                                 LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz
29 275K                                                                    LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz.bak.20160128
30 312M                                                                                    LIHC.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
31 3.4M                                                                              LIHC.miRseq_Mature_Preprocess.Level_3.2016012800.0.0.tar.gz
32 2.8M                                                                                     LIHC.miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz           

複制

關于這些資料:
  • mRNA是晶片資料
  • ranseq是測序資料

    具體參考這裡和這裡

  • miRSeq is micro-RNA seq. Micro RNAs are a class of non protein coding RNAs that have regulatory functions (they typically bind to the 3`UTR of coding mRNAs and regulate that way)
  • mRNA, in this situation, refers to a cDNA microarray, where pre-designed probes on the microarray surface will bind to known target mRNAs, which may be coding or non-coding
  • mRNASeq is what most will loosely refer to as 'RNA-seq'. Most protocols will capture all classes of RNA species that have poly-adenylated (poly(A)) tails. RNAs that don't have these tails include ribsomal RNAs and many enhancer RNAs, but there are always exceptions to these rules.

推薦(建議參考)

miRSeq
  • illuminahiseq_mirnaseq-miR_gene_expression - normalised micro-RNAseq counts over each micro-RNA
  • illuminahiseq_mirnaseq-miR_isoform_expression - nomalised micro-RNAseq counts over each splice isoform of each micro-RNA

    mRNA

  • agilent4502a_07_3-unc_lowess_normalization_gene_level - LOWESS-normalised cDNA expression values over each gene (data from University of North Carolina)

    mRNASeq

  • illuminahiseq_rnaseq-gene_expression - normalised RNAseq counts over each gene
  • illuminahiseq_rnaseq-exon_expression - normalised RNAseq counts over each exon of each gene
> checkTCGA('Dates')
 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21"
[11] "2012-04-12" "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25"
[21] "2012-09-13" "2012-10-04" "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16"
[31] "2013-02-03" "2013-02-22" "2013-03-09" "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23"
[41] "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10" "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16"
[51] "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17" "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01"
[61] "2015-08-21" "2015-11-01" "2016-01-28"           

複制

2 下載下傳某惡性良性腫瘤中Datasets中的某類資料

可以用前述的checkTCGA檢視類型,然後針對性下載下傳,用法如下:

downloadTCGA(cancerTypes, dataSet = "Merge_Clinical.Level_1", destDir, date = NULL, untarFile = TRUE, removeTar = TRUE, allDataSets = FALSE)

  • dataSet在

    checkTCGA

    中檢視,可以看出資料類型名字很長,是以這裡隻要部分(連續)比對即可
  • destDir在目前工作目錄下建立一個新檔案
setwd("E:/00TCGA/data/LIHC")
#downloading the miRNA DATA
dir.create('miRNA')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'miR_gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])

#downloading the ranseq data
dir.create('rnaseq')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'Level_3__gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])           

複制

3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)

3.0 擷取mRNA表達資料

擷取EZH2,PTEN,HGFAC,FYN,TIGD1等5個gene在BRCA,OV中的mRNA表達資料(LIHC沒有mRNA)

expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))           

複制

> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode          dataset     EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>                        <chr>      <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA -1.79   1.36  -2.31 -0.798 -0.86 
 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA -1.57   0.428 -2.49 -0.532 -1.06 
 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA -2.49   1.31  -2.65  0.007 -1.06 
 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA -2.14   0.810 -2.58 -0.466 -0.714
 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA  0.529  0.251 -2.95  0.960 -0.664
 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA -1.44   1.31  -3.26 -0.622 -0.490
 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -0.426 -0.237 -2.71  0.208  0.759
 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -0.579 -1.24  -2.45  0.563  0.022
 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA -1.16   1.21  -2.04 -0.616 -0.287
10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.894  0.288 -2.31 -0.314  1.33 
# ... with 1,295 more rows           

複制

3.1 ggplot2繪制指定基因在不同惡性良性腫瘤中的表達boxplot

expr<-expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
expr1<-expr[,-1]
expr2<-gather(expr1, key ="mRNA", value="value", -dataset)

ggplot(expr2, aes(y=value,
             x=reorder(dataset, value, mean),
             fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(mRNA~.)+
  theme(legend.position = "top")           

複制

boxplot如下

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

5genes in three kinds of cancers.jpeg

3.2 ggpubr繪制指定基因在不同惡性良性腫瘤中的表達boxplot并進行統計學分析作圖

這部分參考jimmy

檢視樣本量

table(expr$dataset)           

複制

> nb_samples
BRCA.mRNA LUSC.mRNA   OV.mRNA 
      590       154       561           

複制

把bcr_patient_barcode這列改名,以便下一步可視化作圖

expr$dataset <- gsub(pattern = ".mRNA", replacement = "",  expr$dataset)
expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))
expr           

複制

> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode dataset   EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>               <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 BRCA1               BRCA    -1.79   1.36  -2.31 -0.798 -0.86 
 2 BRCA2               BRCA    -1.57   0.428 -2.49 -0.532 -1.06 
 3 BRCA3               BRCA    -2.49   1.31  -2.65  0.007 -1.06 
 4 BRCA4               BRCA    -2.14   0.810 -2.58 -0.466 -0.714
 5 BRCA5               BRCA     0.529  0.251 -2.95  0.960 -0.664
 6 BRCA6               BRCA    -1.44   1.31  -3.26 -0.622 -0.490
 7 BRCA7               BRCA    -0.426 -0.237 -2.71  0.208  0.759
 8 BRCA8               BRCA    -0.579 -1.24  -2.45  0.563  0.022
 9 BRCA9               BRCA    -1.16   1.21  -2.04 -0.616 -0.287
10 BRCA10              BRCA    -0.894  0.288 -2.31 -0.314  1.33           

複制

繪制EZH2表達值的boxplot圖

library(ggpubr)
ggboxplot(expr, x = "dataset", y = "EZH2",
          title = "EZH2", ylab = "Expression",
          color = "dataset", palette = "jco")           

複制

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

ezh2.jpeg

4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)

  • ggplot2和ggpubr的用法與第3部分幾乎一緻,而boxplotTCGA無法直接擷取mRNA表達資料,隻有rnaseq,另外還有mutation等資料。
  • 不同的是,不僅需要gene symbol還要entrez的ID,如MET|4233

4.0 擷取rnaseq表達資料

  • 擷取EZH2,PTEN,HGFAC,FYN等4個gene在BRCA,OV和LUSC中的rnaseq表達資料
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))           

複制

> expr
# A tibble: 2,071 x 6
   bcr_patient_barcode          dataset     `EZH2|2146` `PTEN|5728` `HGFAC|3083` `FYN|2534`
   <chr>                        <chr>             <dbl>       <dbl>        <dbl>      <dbl>
 1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq        487.       1724.        4.83       247. 
 2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq        941.       1107.        9.79       523. 
 3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq        492.       1479.        7.25       814. 
 4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq        334.       1877.        9.10       562. 
 5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq        297.       1740.        1.70       549. 
 6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq        238.       1597.        7.63       689. 
 7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq        258.       1374.        0.815      853. 
 8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq        253.       2181.        3.14        84.7
 9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq        392.       2529.        0.901      740. 
10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq        191.       1876.        0          522. 
# ... with 2,061 more rows           

複制

4.1 ggplot2繪制指定基因在不同惡性良性腫瘤中的表達boxplot(rnaseq)

#rnaseq ggplot2
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))

expr1<-expr[,-1]
expr2<-gather(expr1, key ="rnaseq", value="value", -dataset)

ggplot(expr2, aes(y=value,
                  x=reorder(dataset, value, mean),
                  fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(rnaseq~.)+
  theme(legend.position = "top")           

複制

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

rnaseq ggplot2.jpeg

4.2 ggpubr

檢視樣本量

nb_samples <- table(expr$dataset)
nb_samples           

複制

> nb_samples
BRCA.rnaseq LUSC.rnaseq   OV.rnaseq 
       1212         552         307            

複制

ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",
          title = "ESR1|2099", ylab = "Expression",
          color = "dataset", palette = "jco")           

複制

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

PTEN_5728.jpeg

4.3 boxplotTCGA

具體參考RTCGA的文檔

expressionsTCGA(LIHC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,extra
                extract.cols = "MET|4233") %>%
  rename(cohort = dataset,
  MET = `MET|4233`) %>%
  #cancer samples
  filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
  ACC_BLCA_BRCA_OV.rnaseq           

複制

boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")           

複制

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

MET RNASEQ.jpeg

boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)")           

複制

TCGA資料下載下傳分析(1-1):RTCGA包gene擷取表達值并可視化1 安裝并加載包2 下載下傳某惡性良性腫瘤中Datasets中的某類資料3 擷取某基因在任意癌症中的mRNA表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)4 擷取某基因在任意癌症中的rnaseq表達資料并可視化(ggplot2,ggpubr和boxplotTCGA)後記-----------------------------------------

Rplot.jpeg

後記-----------------------------------------

這部分隻是用RTCGA示範了如何下載下傳資料;mRNA和rnaseq表達值的plot。