生信單行腳本

個人首頁 http://ijz.me/?p=959

本站托管 rep http://git.oschina.net/ijz/onelineforbi

本内容全為本人翻譯至（https://github.com/stephenturner/oneliners）

歡迎引用，但需要著名引用位址。

Basic awk & sed

提取檔案中的2, 4, and 5 列:

awk '{print $2,$4,$5}' file.txt

輸出第五列等于abc123的行:

awk '$5 == "abc123"' file.txt

輸出第五列不是abc123的行:

awk '$5 != "abc123"' file.txt

輸出第七列以字母a-f開頭的行:

awk '$7  ~ /^[a-f]/' file.txt

輸出第七列不是以字母a-f開頭的行:

awk '$7 !~ /^[a-f]/' file.txt

計算第二列不重複的值儲存在哈希arr中 (一個值隻儲存一次):

awk '!arr[$2]++' file.txt

輸出第三列的值比第五列大的行:

awk '$3>$5' file.txt

計算檔案中第一列的累加值，輸出最後的結果:

awk '{sum+=$1} END {print sum}' file.txt

計算第二列的平均值:

awk '{x+=$2}END{print x/NR}' file.txt

用bar替換檔案中所有的foo:

sed 's/foo/bar/g' file.txt

消除行開頭空和格制表符:

sed 's/^[ \t]*//' file.txt

消除行結尾的空格和制表符:

sed 's/[ \t]*$//' file.txt

消除行中開頭和結尾的空格和制表符:

sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

删除空行:

sed '/^$/d' file.txt

删除包含‘EndOfUsefulData’的行及其後所有的行:

sed -n '/EndOfUsefulData/,$!p' file.txt

awk & sed for bioinformatics

生信單行sed,awk

[傳回]

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):

輸出Chr為1在1M和2M之間的所有行。（假設）染色體在第一列，位點在第三列（基于同樣的假設可以用來傳回類似特定等位基因頻率的變異）

cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'

Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq: 基本序列統計。輸出總的reads數，不重複的reads總數，不重複reads百分比，最大備援的序列及其頻度以及總占比百分數。

cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

轉換.bam為.fastq:

samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq

Keep only top bit scores in blast hits (best bit score only): 隻取blast采樣中的頂級位點的分數（最高的位點分）

awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt

Keep only top bit scores in blast hits (5 less than the top): 隻取blast采樣中的頂級位點的分數（比頂級少于5的）

awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt

分割多序列FASTA檔案為單序列FASTA檔案

awk '/^>/{s=++d".fa"} {print > s}' multi.fa

輸出fasta檔案中的每條序列的序列名稱和長度

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

轉化FASTQ檔案為FASTA:

sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa

從第二行開始每四行取值（從FASTQ檔案提取序列）。

sed -n '2~4p' file.fq

輸出中剔除第一行：

awk 'NR>1' input.txt

輸出20-80行:

awk 'NR>=20&&NR<=80' input.txt

計算二，三行列的和并追加到每行後輸出

awk '{print $0,$2+$3}' input.txt

計算fastq檔案平均reads的長度

awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq

轉化VSF檔案為BED檔案

sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'

sort, uniq, cut, etc.

[傳回開頭]

輸出帶行号的内容:

cat -n file.txt

去重複行計數

cat file.txt | sort -u | wc -l

找到兩檔案都有的行（假設兩個檔案都是無重複行，重定向執行‘wd -l’計算同樣行的行數）

sort file1 file2 | uniq -d

# 安全的方法
sort -u file1 > a
sort -u file2 > b
sort a b | uniq -d

# 用comm的方法
comm -12 file1 file2

對檔案按照第九列數字順序排序（g按照正常數值，k列）

sort -gk9 file.txt

找到第二列出現最多的字元串

cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

從檔案中随機取10行

shuf file.txt | head -n 10

輸出所有三個所可能的DNA序列

echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

解開一列交錯paired-end fastq檔案。如果fastq檔案有亂序paired-end reads，你想将其分離成單獨的/1，/2的檔案儲存，這裡假設/1 reads 在/2 前面：

cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq

Take a fasta file with a bunch of short scaffolds, e.g., labeled

>Scaffold12345

, remove them, and write a new fasta without them:

将一個fasta檔案轉成一系列短的scaffolds。比如，标簽 ">Scaffold12345"，然後移出他們，儲存一個去掉他們的新檔案：

samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa

Display hidden control characters:

顯示一個隐藏的控制字元：

python -c "f = open('file.txt', 'r'); f.seek(0); file = f.readlines(); print file"

find, xargs, and GNU parallel

[傳回]

通過 https://www.gnu.org/software/parallel/. 載 GNU parallel

搜尋檔案夾及其子目錄中名稱為 .bam 檔案（目錄也算）:

find . -name "*.bam"

删除上面搜到的檔案清單(不可逆的危險操作，謹慎使用！删除之前請自習确認)

find . -name "*.bam" | xargs rm

将所有.txt 檔案修改為.bak(例如在對*.txt做操作之前用于檔案備份)

find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh

Chastity filter raw Illumina data (grep reads containing

:N:

, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):

對Illumina資料做Chastity過濾（grep 查詢包含

:N:

，用（-A）選項第三列資訊附加在比對的包含一個序列品質資訊後，并儲存為一個新的fasta檔案）

find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"

通過parallel并行運作12個FASTQC任務

find *.fq | parallel -j 12 "fastqc {} --outdir ."

通過parallel給bam做索引，通過

--dry-run

列印測試這些指令，實際上并未做執行。 find *.bam | parallel --dry-run 'samtools index {}'

seqtk

[back to top]

Seqtk項目托管位址https://github.com/lh3/seqtk。Seqtk是一個快捷輕量的處理FASTA和FASTQ格式基因序列的工具。他可以是先FASTA和FASTQ無縫處理和轉化，同時支援gzip格式的壓縮檔案。

把FASTQ轉化為FASTA:

seqtk seq -a in.fq.gz > out.fa

Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to

(the 2nd):

轉化ILLUMINA 1.3+ 格式FASTQ為FASTA，并且以小于20的mask bases獲得小寫字母(第一指令行)或者到N（第二）。 seqtk seq -aQ64 -q20 in.fq > out.fa seqtk seq -aQ64 -q20 -n N in.fq > out.fa

Fold long FASTA/Q lines and remove FASTA/Q comments:

折疊長FASTA/Q行，并且去除其注釋：

seqtk seq -Cl60 in.fa > out.fa

Convert multi-line FASTQ to 4-line FASTQ: 轉化多行FASTQ到四行FASTQ:

seqtk seq -l0 in.fq > out.fq

Reverse complement FASTA/Q: 反轉FASTA/Q序列:

seqtk seq -r in.fq > out.fq

Extract sequences with names in file

name.lst

, one sequence name per line: 用序列檔案中的名稱（比如name.1st）提取序列,一個虛列名一行:

seqtk subseq in.fq name.lst > out.fq

Extract sequences in regions contained in file

reg.bed

: 利用序列檔案中的”reg.bed“r資訊提取地理資訊的序列:

seqtk subseq in.fa reg.bed > out.fa

Mask regions in

reg.bed

to lowercases: 編碼‘reg.bed’地理資訊到小寫

seqtk seq -M reg.bed in.fa > out.fa

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing): 從兩個大的paired FASTQ檔案提取10000個read pairs（記得用同樣的随機種子保持 paire）

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Trim low-quality bases from both ends using the Phred algorithm: 利用Phred公式從兩頭修剪低品質bases:

seqtk trimfq in.fq > out.fq

Trim 5bp from the left end of each read and 10bp from the right end: 從左端修剪5bp，從右端修剪10bp的。

seqtk trimfq -b 5 -e 10 in.fa > out.fa

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads: 解開一個交錯的paired-end FASTQ檔案。如果FASTQ檔案包含混合的 paired-end reads，如果你想把他們分離成/1,/2檔案（假設/1 read在/2 read的前面）： seqtk seq -l0 -1 interleaved.fq > deinterleaved_1.fq seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq

GFF3 Annotations

[back to top]

Print all sequences annotated in a GFF3 file. 輸出GFF3檔案中标注的所有的序列

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq

Determine all feature types annotated in a GFF3 file. 檢測GFF3檔案中标注的所有性狀類型。

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq

Determine the number of genes annotated in a GFF3 file. 檢測GFF3檔案中标注的基因數量。

grep -c $'\tgene\t' yourannots.gff3

Extract all gene IDs from a GFF3 file. 從GFF3檔案中提取所有的基因ID

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'

Print length of each gene in a GFF3 file. 輸出GFF3檔案每個基因的長度

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)'

FASTA header lines to GFF format (assuming the length is in the header as an appended "_length" as in Velvet assembled transcripts): FASTA頭列轉化為GFF格式（假設頭的長度，附加在”_length“ ,和Velvet assembled transcripts)）

grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' > file.gff

Other generally useful aliases for your .bashrc

有用的别名(.bashrc)

[back to top]

提示符修改為

[email protected]:/full/path/cwd/:$

形式

export PS1="\[email protected]\h:\w\\$ "

避免反複敲諸如

cd ../../..

的指令（也可以用[autojump](https://github.com/joelthelion/autojump），讓你在飛速的轉換目錄

alias ..='cd ..'
alias ...='cd ../../'
alias ....='cd ../../../'
alias .....='cd ../../../../'
alias ......='cd ../../../../../'

向前和向後浏覽

alias u='clear; cd ../; pwd; ls -lhGgo'
alias d='clear; cd -; ls -lhGgo'

覆寫檔案時候，先确認

alias mv="mv -i"
alias cp="cp -i"  
alias rm="rm -i"

我最喜歡的”ls“别名

alias ls="ls -1p --color=auto"
alias l="ls -lhGgo"
alias ll="ls -lh"
alias la="ls -lhGgoA"
alias lt="ls -lhGgotr"
alias lS="ls -lhGgoSr"
alias l.="ls -lhGgod .*"
alias lhead="ls -lhGgo | head"
alias ltail="ls -lhGgo | tail"
alias lmore='ls -lhGgo | more'

對cut空格和逗号，分割檔案

alias cuts="cut -d \" \""
alias cutc="cut -d \",\""

解壓縮tar包

alias tarup="tar -zcf"
alias tardown="tar -zxf"

或者可以用更普遍的‘extract’函數

# 源于ABSG(Advanced Bash Scripting Guide)中 Mendel Cooper的建議

extract () {
   if [ -f $1 ] ; then
       case $1 in
        *.tar.bz2)      tar xvjf $1 ;;
        *.tar.gz)       tar xvzf $1 ;;
        *.tar.xz)       tar Jxvf $1 ;;
        *.bz2)          bunzip2 $1 ;;
        *.rar)          unrar x $1 ;;
        *.gz)           gunzip $1 ;;
        *.tar)          tar xvf $1 ;;
        *.tbz2)         tar xvjf $1 ;;
        *.tgz)          tar xvzf $1 ;;
        *.zip)          unzip $1 ;;
        *.Z)            uncompress $1 ;;
        *.7z)           7z x $1 ;;
        *)              echo "don't know how to extract '$1'..." ;;
       esac
   else
       echo "'$1' is not a valid file!"
   fi
}

使用别名"mcd"建立一個目錄，并且cd到該目錄

function mcd { mkdir -p "$1" && cd "$1";}

跳轉到上級目錄，并且列出其内容

alias u="cd ..;ls"

一個好看的grep

alias grep="grep --color=auto"

重新整理你的

.bashrc

alias refresh="source ~/.bashrc"

編輯你的

.bashrc

alias eb="vi ~/.bashrc"

常用錯誤别稱

alias mf="mv -i"
alias mroe="more"
alias c='clear'

使用 pandoc轉化markdown文檔為PDF格式:

# 用法: mdpdf document.md document.md.pdf
alias mdpdf="pandoc -s -V geometry:margin=1in -V documentclass:article -V fontsize=12pt"

對目前目錄搜尋關鍵詞(

ft "mytext" *.txt

function ft { find . -name "$2" -exec grep -il "$1" {} \;; }

Etc

[傳回]

重複運作上一條指令:

sudo !!

列出最近最常用的指令行參數(通常是檔案)

'ALT+.' or '<ESC> .'

敲出了部分指令，删除這些輸入，查你忘記的明亮，拉回指令，繼續輸入(<CTRL+u>删除光标之前的輸入，<CTRL+y>恢複上個C-U删除字元)

<CTRL+u> [...] <CTRL+y>

跳到一個目錄，執行指令，然後傳回目前目錄(()的用法)

(cd /tmp && ls)

記時秒表 (輸入

Enter

ctrl-d

停止):

time read

把上次執行的指令生成一個腳本

echo "!!" > foo.sh

重用上次指令的所有參數

!*

列出或者删除一個目錄中所有不比對的特定字尾的檔案（例如，列出所有不是壓縮的檔案，删除所有不以.foo和.bar字尾的檔案）

ls !(*.gz)
rm !(*.foo|*.bar)

利用上次的指令，但是不需要他的的參數（重新輸入參數）:

!:- <new_last_argument>

激活一個快捷的編輯器，輸入，編輯長的，複雜，巧妙的指令:

fc

輸出一個特定的行（比如 42行）

sed -n 42p <file>

終結一個當機的ssh session(會車換行，敲~鍵，在敲下.鍵）

[ENTER]~.

利用grep去除檔案的空行，結果儲存到新檔案

grep . filename > newfilename

查找大檔案（例如，大于500M的）

find . -type f -size +500M

利用截取列（例如，一個tab分割檔案的第五個域）

cut -f5 --complement

查找包含特定字元的檔案（

-l

隻輸出檔案名,

-i

忽略大小寫

-r

周遊子目錄）

grep -lir "some text" *

原文連結：https://blog.csdn.net/weixin_33725722/article/details/91758562

生信單行腳本

Basic awk & sed

awk & sed for bioinformatics

生信單行sed,awk

sort, uniq, cut, etc.

find, xargs, and GNU parallel

seqtk

GFF3 Annotations

Other generally useful aliases for your .bashrc

有用的别名(.bashrc)

Etc

繼續閱讀

Storm編譯打包過程中遇到的一些問題及解決方法

Nagions記錄監控日志腳本

ansible配置檔案說明及ad hoc指令

vsftpd dead but subsys locked 的解決方法

作業系統（python）多程序學習

Shell程式設計——sort排序、uniq忽略重複、tr替換壓縮删除、cut指定删除字段、正規表達式元字元sort 指令uniq 指令tr 指令cut 指令正規表達式

Linxu常用指令技巧彙總

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

ACS基本配置-權限等級管理

nginx 安裝錯誤資訊解決

傳說FreeBSD等比Linux更穩定，更“健壯”

無人機--飛控科普

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Ambari介紹和架構原理