從NCBI下載下傳:
sra的資料庫格式為
/sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>/<accession>/<accession>.sra
for i in `cat accession.txt`;do
x=$(echo $i | cut -b 1-6)
y=$(echo $i | cut -b 1-3)
ascp -T -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m anonft[email protected]:/sra/sra-instant/reads/ByRun/sra/$y/$x/$i/$i.sra .
done
從EBI下載下傳:
EBI的資料格式為
[email protected]:/vol1/fastq/<first 6 characters of accession>/00<the last character of accession>/<accession>/<accession>_{1|2}.fastq.gz
# 對于單端資料
cat down.txt |while read id;do if [ ${#id} == 9 ] ; #根據accession number的長度下載下傳位址的規律會發生變化
then
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}.fastq.gz ./;
else
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}.fastq.gz ./;
fi;
done
# 對于雙端資料
cat down.txt |while read id;do if [ ${#id} == 9 ] ;
then
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}_1.fastq.gz ./;
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/$id/${id}_2.fastq.gz ./;
else
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}_1.fastq.gz ./;
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/${id:0:6}/00${id:0-1}/$id/${id}_2.fastq.gz ./;
fi;
done
以最常見的PE(雙端)為例,下載下傳後檔案名會是
<accession>_{1|2}.fastq.gz
從檔案名無法看出其樣本分類,是以我們需要根據臨床資訊進行批量重命名。
以GSE149638為例,打開EBI
1、選擇show column selection,點選添加sample_title和fastq_aspera
2、點選左上角的download tsv report
3、tsv格式如下
study_accession sample_accession experiment_accession run_accession tax_id scientific_name fastq_ftp fastq_aspera submitted_ftp sra_ftpsample_title
PRJNA629498 SAMN14777247 SRX8213570 SRR11652531 9606 Homo sapiens ftp.sra.ebi.ac.uk:/.../SRR11652531_1.fastq.gz;ftp.sra.ebi.ac.uk/.../SRR11652531_2.fastq.gz fasp.sra.ebi.ac.uk:/.../SRR11652531_1.fastq.gz;fasp.sra.ebi.ac.uk:/.../SRR11652531_2.fastq.gz HMDM M0 rep1
我們希望把檔案命名為HMDM_M0_rep1_SRR11652531_1.fastq.gz的格式
提取資訊:
##具體cut哪幾列取決于你下載下傳的檔案
grep 'HMDM' filereport_read_tsv.txt | cut -f 4,11 | tr ' ' '_' | awk '{print $2"_"$1}' > new.txt
## 新命名格式為HMDM_M0_rep1_SRR11652531_1.fastq.gz,儲存到new.txt中
ls *_1.fastq.gz > old1.txt
ls *_2.fastq.gz > old2.txt
## 舊檔案名儲存到old1,old2中
(awk '{print$0"_1.fastq.gz"}' new.txt ) > new1.txt
(awk '{print$0"_2.fastq.gz"}' new.txt ) > new2.txt
##給新檔案名加上字尾
paste old1.txt new1.txt > sample1.txt
paste old2.txt new2.txt > sample2.txt
##将新舊檔案名合并到同一檔案中
awk -F'\t' 'system("mv " $1 " " $2)' sample1.txt
awk -F'\t' 'system("mv " $1 " " $2)' sample2.txt
##使用awk的system參數,調用系統函數,mv為重命名,$1代表文本的第一列,也就是舊檔案名;$2代表文本的第二列,也就是新檔案名了;括号内的"表示concatenation,
參考
使用aspera從EBI下載下傳fastq資料,抛棄NCBI的SRA資料庫吧!
Renaming files using list