用cmscan挖掘ncRNA資訊

0.準備工作：

擷取Rfam種子
擷取Rfam的claninfo
軟體安裝
待處理物種的基因組檔案

建立一個專門用于處理RNA的檔案夾

mkdir Cmscan

擷取種子和chanin檔案

下載下傳Rfam種子：

axel -q ftp://ftp.ebi.ac.uk/pub/databases/Rfam/13.0/Rfam.cm.gz

下載下傳clanin檔案：

wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/13.0/Rfam.clanin

用conda安裝infernal軟體

conda install infernal

源碼安裝

http://eddylab.org/infernal/

官網提供可下載下傳的源碼和User's Guide，相當人性化了。并且mac系統也可以使用brew安裝infernal。

準備基因組檔案

将待處理的基因組檔案軟連結到

Cmscan

檔案夾下：

ln -s /path/to/file.fa

1.建庫

cmpress Rfam.cm

2.确定基因組大小

esl-seqstat my-genome.fa

其輸出結果中有一行，類似于

Total # of residues: 3000000

是我們需要的。考慮到基因組為雙鍊和下一步用到的參數的機關為Million，我們使用公式

3000000 * 2 / 1000000

計算得出結果為

，作為下一步參數

-Z

的值.

tips:esl-seqstat指令是hmmer的一個插件，如果沒法全局調用則建議直接

locate esl-seqstat

查找絕對路徑。在我的伺服器上它在的位置是

/media/newdisk/interproscan-5.28-67.0/bin/hmmer/hmmer3/3.1b1/easel/miniapps/esl-seqstat

當然，有可能是因為我的interproscan沒裝好導緻沒法直接使用。。

3.運作程式（兩個示例）

nohup cmscan -Z 208 --cut_ga --rfam --nohmmonly --tblout kfl.tblout --fmt 2 --clanin /media/newdisk/Cmscan/Rfam.clanin Rfam.cm /media/newdisk/lunzao/KFL/120824_klebsormidium_Scaffolds_v1.0.fna > kfl.cmscan &

nohup cmscan -Z 3503 --cut_ga --rfam --nohmmonly --tblout chara.tblout --fmt 2 --clanin /media/newdisk/Cmscan/Rfam.clanin Rfam.cm /media/newdisk/lunzao/Chara/dailydata/chara_genome.fasta > chara.cmscan &

因為比較耗時是以建議使用nohup指令來跑
-Z ：查詢序列的大小，以M為機關。由 esl-seqstat 算出或自己寫程式計算，記得乘以2，除以10^6
--cut_ga : 輸出不小于Rfam GA門檻值的結果。這是Rfam認證RNA家族的門檻值，不低于這個門檻值的序列得分被認為是真同源序列。The bit score gathering threshold (GA cutoff), set by Rfam curators when building the family. All sequences that score at or above this threshold will be included in the full alignment and are believed to be true homologs to the model
--rfam : run in “fast” mode, the same mode used for Rfam annotation and determination of GA thresholds.
--nohmmonly : all models, even those with zero basepairs, are run in CM mode (not HMM mode). This ensures all GA cutoffs, which were determined in CM mode for each model, are valid.
--tblout : table輸出。 --fmt 2 : table輸出的一種格式。 --clanin : 下載下傳的clan資訊。This file lists which models belong to the same clan. Rfam clans are groups of models that are homologous and therefore it is expected that some hits to these models will overlap. For example, the LSU rRNA archaea and LSU rRNA bacteria models are both in the same clan.

4.結果處理

在filename.tblout檔案中，有一欄是

olp

，表示查詢序列的重疊資訊:

表示同一條鍊上，不存在與此查詢序列重疊的序列也在Rfam資料庫有比對，這是需要保留的查詢序列。

表示同一條鍊上，不存在比此查詢序列與Rfam資料庫比對更好的序列，也需要保留。

表示同一條鍊上，存在比此查詢序列與Rfam資料庫比對更好的序列，應忽略。

是以應将搜尋到

的行給去除掉

grep -v '=' my-genome.tblout >my-genome.deoverlapped.tblout

将檔案處理成excel的形式，隻保留我們需要的資訊

awk 'BEGIN{OFS="\t";}{if(FNR==1) print "target_name\taccession\tquery_name\tquery_start\tquery_end\tstrand\tscore\tEvalue"; if(FNR>2 && $20!="=" && $0!~/^#/) print $2,$3,$4,$10,$11,$12,$17,$18; }' my-genome.tblout >my-genome.tblout.final.xls

tip：若不設定--cpu的話會預設使用全部線程。

awk 'BEGIN{OFS=FS="\t"} ARGIND==1{a[$2]=$4;} ARGIND==2{type=a[$1]; if(type=="") type="Others"; count[type]+=1;} END{for(type in count) print type, count[type];}' Rfam_anno_class.txt my-genome.tblout.final.xls

參考：

https://www.jianshu.com/p/89d8b72d9bd5

http://rfam.xfam.org/search/type

https://blog.csdn.net/qazplm12_3/article/details/73380016

https://mp.weixin.qq.com/s/5OIRHA22ZLr5Z8bEhDiBqg

http://blog.genesino.com/2017/06/Rfam/

http://rfam.readthedocs.io/en/latest/genome-annotation.html

用cmscan挖掘ncRNA資訊

0.準備工作：

擷取種子和chanin檔案

準備基因組檔案

1.建庫

2.确定基因組大小

3.運作程式（兩個示例）

4.結果處理

繼續閱讀

set define off關閉替代變量功能

報錯：'mysql' 不是内部或外部指令，也不是可運作的程式或批處理檔案。

Linxu常用指令技巧彙總

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

詳解STM32單片機的堆棧

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述