如何批量转换gi至ko

2023-06-05 13:57:26

最近分析转录组数据，自己倒腾着做KEGG pathway的富集分析（许多转录组文章都是用KOBAS数据库 http://kobas.cbi.pku.edu.cn/home.do 来做KEGG pathway的富集分析，但是该网站只能一次提交500条序列，如果差异表达基因超过500条就不能用该网站进行分析了）。得到了差异表达基因的gi号，如何快速将gi号映射为ko号是进行富集分析必须解决的问题。目前发现有几种方法可以解决（可能还有其他方法）。方法一：根据KEGG API（ http://www.kegg.jp/kegg/docs/keggapi.html ），利用如下脚本完成：

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use autodie;

die "\nUsage: perl $0 <querygi> <ko_out>" if @ARGV!=2;

my($querygi,$ko_out)[email protected];

open my $gi,'<',$querygi;
open my $ko,'>',$ko_out;

while(<$gi>){
    chomp;
    my $gi=$_;
    my $url="http://rest.kegg.jp/conv/genes/ncbi-geneid:$_";
    my $response=get($url);
    if ($response !~ /^\s+$/){
        chomp($response);
        $response=~s/(\S+)$/$1/;
        my $url2="http://rest.kegg.jp/list/$response";
        my $response2=get($url2);
        if ($response2 !~ /^\s+$/){
            print $ko "$gi\t$response2";
        }
    }
}

方法二：利用ipr数据库上的idmapping（ http://pir.georgetown.edu/pirwww/search/idmapping.shtml ）工具完成，优点是速度快，缺点是不能直接由gi转换为ko，必须先由gi转换为UniProtKB AC/ID，然后再转换为ko号。

方法三：在ipr数据库后台下载如下文件 ftp://ftp.pir.georgetown.edu/databases/idmapping/uniprot_idmapping/idmapping.dat.gz，压缩包739 M，但解压后有27 G，利用如下脚本进行格式化（即只保留gi号、ko号以及它们的对应关系）：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

open my $in,"<",$ARGV[0];	#idmappint.dat
open my $tempfile,">","tempfile";	#temp file

my $temp='';
while(<$in>){
	chomp;
	my ($uniprotKB,$content,$value)=split(/\t/,$_);
	($uniprotKB)=$uniprotKB=~/^(\w+)/;	#避免如uniprotKB为”P46077-1“这种形式引起产生多行空格
	if($temp eq $uniprotKB){
		if($content eq "GI"){
			print $tempfile $value,"\t";
		}
		elsif($content eq "KO"){
			print $tempfile $value;
		}
	}else{
		if($content=~/GI\t(\S+)/){
			print $tempfile $1,"\t";
		}else{
			print $tempfile "\n";
		}
		
	}
	$temp=$uniprotKB;
}
close $in;
close $tempfile;

open my $in2,"<","tempfile";
open my $out,">","F:/format_idmapping_gi2ko";

while(<$in2>){
	if($_=~/\tK/){
		print $out $_;
	}
}

close $in2;
close $out;
`del tempfile`;	#delete temp file

得到format_idmapping_gi2ko文件，该文件为160 M，链接： http://pan.baidu.com/s/1ARRu2 密码：cqg0；得到该文件后就可以利用该文件进行gi2ko转换了：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

open my $gi,"<",$ARGV[0];	#query_gi,每行一个gi号
open my $data_gi2ko,"<","F:/format_idmapping_gi2ko";
open my $out,">",$ARGV[1];

my %hash;

while(<$data_gi2ko>){
	my ($gi,$ko)=(split /\tK/,$_,2);
	foreach (split /\t/,$gi){
		$hash{$_}="K".$ko;
	}
}

while(<$gi>){
	chomp;
	if(exists $hash{$_}){
		#如果一个gi号对应多个ko，则以tab分割输出
		$hash{$_}=~s/(?<=\d)K/\tK/g;
		print $out $_,"\t",$hash{$_};
	}
}

结果就得到了gi号对应的ko号，然后就可以进行后续的KEGG pathway的富集分析了。

如何批量转换gi至ko

继续阅读

RNA-seq中的生物学重复

Bam文件去重复

RNA-Seq HISAT+ HTSeq + DESeq2流程及测序深度和质控问题讨论

Cufllinks的安装与使用转自：http://www.chenlianfu.com/?p=623 一. 简介二. 安装三. Cufflinks的使用四. Cuffmerge的使用五. Cuffcompare的使用六. Cuffdiff的使用七. cufflinks使用中遇到的问题

2021.06.29【R语言】丨png转pdf批量生成摘要环境与方法使用代码结果展示总结

2021.05.17【R语言】丨clusterProfiler注释表——KEGG/GO enrich富集图专用

RNA分析流程

RNA-seq 比对软件STAR——（1）安装RNA-seq 比对软件STAR——（1）安装

哈佛大学——差异表达分析（十三）似然比检验结果学习目标似然比检验(LRT)结果探索results()表鉴定显著基因识别具有共享表达谱的基因簇

哈佛大学——差异表达分析（十四）Time course analysis学习目标用LRT进行时程分析

转录组的技术应用（生物学、医学、农学中的应用）转录组的技术应用

转录组测序和RNA-seq

2021.3.20丨Cutadapt数据统计脚本

2021.01.07丨使用fastp统计样品质量结果

2021.01.05丨根据基因名称拼接表达量与相关注释

2021.04.08丨RNA-seq消除批量效应