HTA2.0晶片資料分析比較麻煩，我也嘗試給你一些思路

表達晶片資料處理教程，早在2016年我就系統性整理了釋出在生信菜鳥團部落格：http://www.bio-info-trainee.com/2087.html 配套教學視訊在B站：https://www.bilibili.com/video/av26731585/ 代碼都在：https://github.com/jmzeng1314/GEO

但并不是萬能的，即使是表達晶片，也有一些是我的知識盲區，主要是因為沒有時間去繼續探索整理寫教程，畢竟大家都不支援我寫教程，沒有人打賞或者留言鼓勵我！

不過最近學徒問到了[HTA-2_0] Affymetrix Human Transcriptome Array 2.0晶片的事情，其實挺麻煩的，首先需要搞清楚下面3個平台的差異：

GPL17586 [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]
GPL19251 [HuGene-2_0-st] Affymetrix Human Gene 2.0 ST Array [probe set (exon) version]
GPL16686 [HuGene-2_0-st] Affymetrix Human Gene 2.0 ST Array [transcript (gene) version]

如果是晶片cel原始資料處理

那麼看我的教程你要挖的公共資料集作者上傳了錯誤的表達矩陣腫麼辦（如何讓高手心甘情願的幫你呢？）裡面有提到：

# BiocManager::install(c( 'oligo' ),ask = F,update = F)
library(oligo) 
# BiocManager::install(c( 'pd.hg.u133.plus.2' ),ask = F,update = F)
library(pd.hg.u133.plus.2)

dir='~/Downloads/GSE84571_RAW/'
  od=getwd()
  setwd(dir)
  celFiles <- list.celfiles(listGzipped = T)
  celFiles
  affyRaw <- read.celfiles( celFiles )
  setwd(od)
  eset <- rma(affyRaw)
  eset
  # http://math.usu.edu/jrstevens/stat5570/1.4.Preprocess_4up.pdf
  save(eset,celFiles,file = f)
  # write.exprs(eset,file="data.txt")

複制

當然了，你沒有R基礎是看不懂的哈。自行去B站補充下面的視訊知識點：

如果是晶片注釋到基因

也是很簡單

library(GEOquery)
  #Download GPL file, put it in the current directory, and load it:
  gpl <- getGEO('GPL19251', destdir=".")
  colnames(Table(gpl))  
  head(Table(gpl)[,c(1,10)]) ## you need to check this , which column do you need
  probe2gene=Table(gpl)[,c(1,10)]
  head(probe2gene)
  library(stringr)  
  probe2gene$symbol=trimws(str_split(probe2gene$gene_assignment,'//',simplify = T)[,2])
  plot(table(table(probe2gene$symbol)),xlim=c(1,50))
  head(probe2gene)
  save(probe2gene,file='probe2gene.Rdata')

複制

不同平台，就替換GPL，然後自行看輸入輸出判斷一下即可，也需要R基礎啦。

如果是差異分析

當然了也可以直接修改我的代碼，不過這個HTA2.0晶片比較麻煩的在于，基于exon和基于transcript的有一點點差別，因為基于exon理論上可以擷取不同剪切體的表達量差異的，雖然實際上很少有人願意去探索。

發表在 BMC Genomics. 2017; 文章 RNA sequencing and transcriptome arrays analyses show opposing results for alternative splicing in patient derived samples 就提到過。

更麻煩的是，因為這個晶片記錄35萬個外顯子，這樣矩陣就非常大，都可以根據illumina的450K甲基化晶片媲美啦！

可以看到，大量的基因有着1-30個探針，如下：

是以有文章采取非常特殊的差異分析政策：

很大機率上，你是看不懂的

最後留一個作業

大家可以拿 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118222 資料集，測試看看，走我給大家的代碼，得到PCA,熱圖，火山圖，看看是否合理。

GSM3321243    LNCaP EtOH rep 1
GSM3321244    LNCaP EtOH rep 2
GSM3321245    LNCaP DHT rep 1
GSM3321246    LNCaP DHT rep 2
GSM3321247    LNCaP ABT-DHT rep 1
GSM3321248    LNCaP ABT-DHT rep 2
GSM3321249    C4-2 DMSO rep 1
GSM3321250    C4-2 DMSO rep 2
GSM3321251    C4-2 ABT rep 1
GSM3321252    C4-2 ABT rep 2

複制