如何批量轉換gi至ko

2023-06-05 13:57:26

最近分析轉錄組資料，自己倒騰着做KEGG pathway的富集分析（許多轉錄組文章都是用KOBAS資料庫 http://kobas.cbi.pku.edu.cn/home.do 來做KEGG pathway的富集分析，但是該網站隻能一次送出500條序列，如果差異表達基因超過500條就不能用該網站進行分析了）。得到了差異表達基因的gi号，如何快速将gi号映射為ko号是進行富集分析必須解決的問題。目前發現有幾種方法可以解決（可能還有其他方法）。方法一：根據KEGG API（ http://www.kegg.jp/kegg/docs/keggapi.html ），利用如下腳本完成：

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use autodie;

die "\nUsage: perl $0 <querygi> <ko_out>" if @ARGV!=2;

my($querygi,$ko_out)[email protected];

open my $gi,'<',$querygi;
open my $ko,'>',$ko_out;

while(<$gi>){
    chomp;
    my $gi=$_;
    my $url="http://rest.kegg.jp/conv/genes/ncbi-geneid:$_";
    my $response=get($url);
    if ($response !~ /^\s+$/){
        chomp($response);
        $response=~s/(\S+)$/$1/;
        my $url2="http://rest.kegg.jp/list/$response";
        my $response2=get($url2);
        if ($response2 !~ /^\s+$/){
            print $ko "$gi\t$response2";
        }
    }
}

方法二：利用ipr資料庫上的idmapping（ http://pir.georgetown.edu/pirwww/search/idmapping.shtml ）工具完成，優點是速度快，缺點是不能直接由gi轉換為ko，必須先由gi轉換為UniProtKB AC/ID，然後再轉換為ko号。

方法三：在ipr資料庫背景下載下傳如下檔案 ftp://ftp.pir.georgetown.edu/databases/idmapping/uniprot_idmapping/idmapping.dat.gz，壓縮包739 M，但解壓後有27 G，利用如下腳本進行格式化（即隻保留gi号、ko号以及它們的對應關系）：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

open my $in,"<",$ARGV[0];	#idmappint.dat
open my $tempfile,">","tempfile";	#temp file

my $temp='';
while(<$in>){
	chomp;
	my ($uniprotKB,$content,$value)=split(/\t/,$_);
	($uniprotKB)=$uniprotKB=~/^(\w+)/;	#避免如uniprotKB為”P46077-1“這種形式引起産生多行空格
	if($temp eq $uniprotKB){
		if($content eq "GI"){
			print $tempfile $value,"\t";
		}
		elsif($content eq "KO"){
			print $tempfile $value;
		}
	}else{
		if($content=~/GI\t(\S+)/){
			print $tempfile $1,"\t";
		}else{
			print $tempfile "\n";
		}
		
	}
	$temp=$uniprotKB;
}
close $in;
close $tempfile;

open my $in2,"<","tempfile";
open my $out,">","F:/format_idmapping_gi2ko";

while(<$in2>){
	if($_=~/\tK/){
		print $out $_;
	}
}

close $in2;
close $out;
`del tempfile`;	#delete temp file

得到format_idmapping_gi2ko檔案，該檔案為160 M，連結： http://pan.baidu.com/s/1ARRu2 密碼：cqg0；得到該檔案後就可以利用該檔案進行gi2ko轉換了：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

open my $gi,"<",$ARGV[0];	#query_gi,每行一個gi号
open my $data_gi2ko,"<","F:/format_idmapping_gi2ko";
open my $out,">",$ARGV[1];

my %hash;

while(<$data_gi2ko>){
	my ($gi,$ko)=(split /\tK/,$_,2);
	foreach (split /\t/,$gi){
		$hash{$_}="K".$ko;
	}
}

while(<$gi>){
	chomp;
	if(exists $hash{$_}){
		#如果一個gi号對應多個ko，則以tab分割輸出
		$hash{$_}=~s/(?<=\d)K/\tK/g;
		print $out $_,"\t",$hash{$_};
	}
}

結果就得到了gi号對應的ko号，然後就可以進行後續的KEGG pathway的富集分析了。

如何批量轉換gi至ko

繼續閱讀

RNA-seq中的生物學重複

Bam檔案去重複

RNA-Seq HISAT+ HTSeq + DESeq2流程及測序深度和質控問題讨論

Cufllinks的安裝與使用轉自：http://www.chenlianfu.com/?p=623 一. 簡介二. 安裝三. Cufflinks的使用四. Cuffmerge的使用五. Cuffcompare的使用六. Cuffdiff的使用七. cufflinks使用中遇到的問題

2021.06.29【R語言】丨png轉pdf批量生成摘要環境與方法使用代碼結果展示總結

2021.05.17【R語言】丨clusterProfiler注釋表——KEGG/GO enrich富集圖專用

RNA分析流程

RNA-seq 比對軟體STAR——（1）安裝RNA-seq 比對軟體STAR——（1）安裝

哈佛大學——差異表達分析（十三）似然比檢驗結果學習目标似然比檢驗(LRT)結果探索results()表鑒定顯著基因識别具有共享表達譜的基因簇

哈佛大學——差異表達分析（十四）Time course analysis學習目标用LRT進行時程分析

轉錄組的技術應用（生物學、醫學、農學中的應用）轉錄組的技術應用

轉錄組測序和RNA-seq

2021.3.20丨Cutadapt資料統計腳本

2021.01.07丨使用fastp統計樣品品質結果

2021.01.05丨根據基因名稱拼接表達量與相關注釋

2021.04.08丨RNA-seq消除批量效應