天天看點

Aspera/FTP下載下傳SRA/fastq檔案後根據樣本資訊進行批量重命名

從NCBI下載下傳:

sra的資料庫格式為

/sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>/<accession>/<accession>.sra
           
for i in `cat accession.txt`;do
x=$(echo $i | cut -b 1-6)
y=$(echo $i | cut -b 1-3)

ascp -T -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m anonft[email protected]:/sra/sra-instant/reads/ByRun/sra/$y/$x/$i/$i.sra .
done
           

從EBI下載下傳:

EBI的資料格式為

[email protected]:/vol1/fastq/<first 6 characters of accession>/00<the last character of accession>/<accession>/<accession>_{1|2}.fastq.gz
           
# 對于單端資料
cat  down.txt |while read id;do if [ ${#id} == 9 ] ; #根據accession number的長度下載下傳位址的規律會發生變化
then 
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}.fastq.gz ./; 
else 
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}.fastq.gz ./; 
fi;
done

# 對于雙端資料
cat  down.txt |while read id;do if [ ${#id} == 9 ] ;
then 
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}_1.fastq.gz ./;
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}_2.fastq.gz ./;
else 
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}_1.fastq.gz ./;
    ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}_2.fastq.gz ./;
fi;
done
           

以最常見的PE(雙端)為例,下載下傳後檔案名會是

<accession>_{1|2}.fastq.gz
           

從檔案名無法看出其樣本分類,是以我們需要根據臨床資訊進行批量重命名。

以GSE149638為例,打開EBI

1、選擇show column selection,點選添加sample_title和fastq_aspera

2、點選左上角的download tsv report

3、tsv格式如下

study_accession	sample_accession	experiment_accession	run_accession	tax_id	scientific_name	fastq_ftp	fastq_aspera	submitted_ftp	sra_ftpsample_title
PRJNA629498	SAMN14777247	SRX8213570	SRR11652531	9606	Homo sapiens	ftp.sra.ebi.ac.uk:/.../SRR11652531_1.fastq.gz;ftp.sra.ebi.ac.uk/.../SRR11652531_2.fastq.gz	fasp.sra.ebi.ac.uk:/.../SRR11652531_1.fastq.gz;fasp.sra.ebi.ac.uk:/.../SRR11652531_2.fastq.gz	HMDM M0 rep1
           

我們希望把檔案命名為HMDM_M0_rep1_SRR11652531_1.fastq.gz的格式

提取資訊:

##具體cut哪幾列取決于你下載下傳的檔案
grep 'HMDM' filereport_read_tsv.txt | cut -f 4,11 | tr ' ' '_' | awk '{print $2"_"$1}' > new.txt 
## 新命名格式為HMDM_M0_rep1_SRR11652531_1.fastq.gz,儲存到new.txt中
ls *_1.fastq.gz > old1.txt
ls *_2.fastq.gz > old2.txt
## 舊檔案名儲存到old1,old2中
(awk '{print$0"_1.fastq.gz"}' new.txt ) > new1.txt
(awk '{print$0"_2.fastq.gz"}' new.txt ) > new2.txt
##給新檔案名加上字尾
paste old1.txt new1.txt > sample1.txt
paste old2.txt new2.txt > sample2.txt
##将新舊檔案名合并到同一檔案中
awk -F'\t' 'system("mv " $1 " " $2)' sample1.txt
awk -F'\t' 'system("mv " $1 " " $2)' sample2.txt
##使用awk的system參數,調用系統函數,mv為重命名,$1代表文本的第一列,也就是舊檔案名;$2代表文本的第二列,也就是新檔案名了;括号内的"表示concatenation,
           

參考

使用aspera從EBI下載下傳fastq資料,抛棄NCBI的SRA資料庫吧!

Renaming files using list

繼續閱讀