天天看点

<转录组>对stringtie得到的表达量数据进行整理

通过stringtie软件得到表达量数据如下:

$ head SRR3823868
Gene ID	Gene Name	Reference	Strand	Start	End	Coverage FPKM	TPM
gene-Aa1Ag00004	Aa1Ag00004	Chr1A	-	47479	48231	0.721659 0.111406	0.140519
gene-Aa1Ag00001	Aa1Ag00001	Chr1A	-	14477	27718	17.4181582.688935	3.391617
gene-Aa1Ag00005	Aa1Ag00005	Chr1A	+	61262	67021	0.574441 0.088680	0.111854
gene-Aa1Ag00006	Aa1Ag00006	Chr1A	-	67992	68593	9.194076 1.419339	1.790246
           

目的:将多个样本的表达量结果整合到一个文件中。

1. 利用awk提取结果中的Gene name和FPKM

$ awk -F '\t' '{print $2"\t"$8}' SRR3823868 >SRR3823868_FPKM.txt
$ head SRR3823868_FPKM.txt
Gene Name	FPKM
Aa1Ag00004	0.111406
Aa1Ag00001	2.688935
Aa1Ag00005	0.088680
Aa1Ag00006	1.419339
           

2. 利用sed替换文件中的FPKM,为了防止混淆各个样本的表达量,将表头添加SRR号

$ sed -i 's#FPKM#SRR3823868FPKM#' SRR3823868_FPKM.txt
$ head SRR3823868_FPKM.txt
Gene Name	SRR3823868FPKM
Aa1Ag00004	0.111406
Aa1Ag00001	2.688935
Aa1Ag00005	0.088680
Aa1Ag00006	1.419339
           

3. 将其他样本也做同样的处理

$ head SRR6274689_FPKM.txt
Gene Name	SRR6274689FPKM
Aa1Ag00006	1.176829
Aa1Ag00001	4.954225
Aa1Ag00002	0.556997
Aa1Ag00007	2.232162
$ head SRR3823655_FPKM.txt
Gene Name	SRR3823655FPKM
Aa1Ag00004	0.000000
Aa1Ag00005	0.080253
Aa1Ag00002	0.837885
Aa1Ag00003	0.024968

           

4. 利用join将三个文件进行合并

$ join -e 'NA' -a 1 -a 2 SRR3823655_FPKM.txt SRR3823868_FPKM.txt >SRR3823655_SRR3823868_FPKM.txt
join: SRR3823655_FPKM.txt:4: is not sorted: Aa1Ag00002	0.837885
join: SRR3823868_FPKM.txt:6: is not sorted: Aa1Ag00002	0.698827
$ head SRR3823655_SRR3823868_FPKM.txt
Gene Name SRR3823655FPKM Name SRR3823868FPKM
Aa1Ag00004 0.000000 0.111406
Aa1Ag00001 2.688935
Aa1Ag00005 0.080253 0.088680
Aa1Ag00002 0.837885
Aa1Ag00003 0.024968
Aa1Ag00006 1.419339
Aa1Ag00002 0.698827
Aa1Ag00003 0.030977
Aa1Ag00007 0.063051 0.707737
           

报错了,伤感。。。

需要对表达量的文件进行排序sort

本来以为表达量的结果只有存在表达的gene ID才会输出,才发现原来所有的ID都输出了,那就只需要把文件传回window然后用excel粘在一起就可以了。哎,虽然浪费了点时间还是学到了好几个口令,也算值把。

继续阅读