何不 Ack？Grep, Ack, Ag的搜尋效率對比何不 Ack？Grep, Ack, Ag的搜尋效率對比

我（@董偉明9 ）經常看到很多程式員，運維在代碼搜尋上使用ack，甚至ag(the_silver_searcher )，而我工作中95%都是用grep，剩下的是ag。我覺得很有必要聊一聊這個話題。

我以前也是一個運維，我當時也希望找到最好的最快的工具用在工作的方方面面。但是我很好奇為什麼ag和ack沒有作為linux發行版的内置部分。内置的一直是grep。我當初的了解是受各種開源協定的限制，或者發行版的boss個人喜好。後來我就做了實驗，研究了下他們到底誰快。當時的做法也無非跑幾個真實地線上log看看用時。然後我也有了我的一個認識: 大部分時候用grep也無妨，日志很大的時候用ag。

ack原來的域名是betterthangrep.com，現在是beyondgrep.com。好吧，其實我了解使用ack的同學，也了解ack産生的原因。這裡就有個故事。

何不 Ack？Grep, Ack, Ag的搜尋效率對比何不 Ack？Grep, Ack, Ag的搜尋效率對比

最開始我做運維使用shell，經常做一些分析日志的工作。那時候經常寫比較複雜的shell代碼實作一些特定的需求。後來來了一位會perl的同學。原來我寫shell做一個事情，寫了20多行shell代碼，跑一次大概5分鐘，這位同學來了用perl改寫， 4行，一分鐘就能跑完。亮瞎我們的眼，從那時候開始，我就覺得需要學perl，以至于後來的python。

perl是天生用來文本解析的語言， ack的效率确實很高。我想着可能是大家認為ack要更快更合适的理由吧。其實這件事要看場景。我為什麼還用比較’土’的grep呢？

看一下這篇文章，希望給大家點啟示。不耐煩看具體測試過程的同學，可以直接看結論：

在搜尋的總資料量較小的情況下，使用grep， ack甚至ag在感官上差別不大搜尋的總資料量較大時， grep效率下滑的很多，完全不要選 ack在某些場景下沒有grep效果高(比如使用-v搜尋中文的時候) 在不使用ag沒有實作的選項功能的前提下， ag完全可以替代ack/grep

ps: 嚴重聲明，本實驗經個人實踐，我盡量做到合理。大家看完覺得有異議可以試着其他的角度來做。并和我讨論。

我使用了公司的一台開發機(gentoo)

<code># 假如你是ubuntu: sudo apt-get install miscfiles</code>

<code>wget https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big</code>

我會分成英語和漢語2種檔案，檔案大小為1mb， 10mb， 100mb， 500mb， 1gb， 5gb。沒有更多是我覺得在實際業務裡面不會單個日志檔案過大的。也就沒有必要測試了(就算有，可以看下面結果的趨勢)。用下列程式深入測試的檔案：

<code>cat make_words.py</code>

<code># coding=utf-8</code>

<code>import os</code>

<code>import random</code>

<code>from cstringio import stringio</code>

<code>en_word_file = '/usr/share/dict/words'</code>

<code>en_data = f.readlines()</code>

<code>cn_data = f.readlines()</code>

<code>en_result_format = 'text_{0}_en_mb.txt'</code>

<code>cn_result_format = 'text_{0}_cn_mb.txt'</code>

<code>def write_data(f, size, data, cn=false):</code>

<code>total_size = 0</code>

<code>while 1:</code>

<code>s = stringio()</code>

<code>for x in range(10000):</code>

<code>cho = random.choice(data)</code>

<code>cho = cho.split()[0] if cn else cho.strip()</code>

<code>s.write(cho)</code>

<code>total_size += s.tell()</code>

<code>contents = s.getvalue()</code>

<code>f.write(contents + '\n')</code>

<code>if total_size > size:</code>

<code>break</code>

<code>f.close()</code>

<code>for index, size in enumerate([</code>

<code>size_name = size_list[index]</code>

<code>en_f = open(en_result_format.format(size_name), 'a+')</code>

<code>cn_f = open(cn_result_format.format(size_name), 'a+')</code>

<code>write_data(en_f, size, en_data)</code>

<code>write_data(cn_f, size, cn_data, true)</code>

好吧，效率比較低是吧？我自己沒有vps，公司伺服器我不能沒事把全部核心的cpu都占滿(不是運維好幾年了)。假如你不介意htop的多核cpu飄紅，可以這樣，耗時就是各檔案生成的時間短闆。這是生成測試檔案的多程序版本：

<code>import multiprocessing</code>

<code>inputs = []</code>

<code>write_data(_f, size, data, cn)</code>

<code>inputs.append((en_result_format.format(size_name), size, en_data, false))</code>

<code>inputs.append((cn_result_format.format(size_name), size, cn_data, true))</code>

<code>pool = multiprocessing.pool()</code>

<code>pool.map(map_func, inputs, chunksize=1)</code>

等待一段時間後，測試的檔案生成了。目錄下是這樣的：

<code>total 14g</code>

<code>-rw-rw-r-- 1 vagrant vagrant 2.2k mar 14 05:25 benchmarks.ipynb</code>

<code>-rw-rw-r-- 1 vagrant vagrant 8.2m mar 12 15:43 dict.txt.big</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.2k mar 12 15:46 make_words.py</code>

<code>-rw-rw-r-- 1 vagrant vagrant 101m mar 12 15:47 text_100_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 101m mar 12 15:47 text_100_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1g mar 12 15:54 text_1024_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1g mar 12 15:51 text_1024_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 11m mar 12 15:47 text_10_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 11m mar 12 15:47 text_10_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1m mar 12 15:47 text_1_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1m mar 12 15:47 text_1_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 501m mar 12 15:49 text_500_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 501m mar 12 15:48 text_500_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 5.1g mar 12 16:16 text_5120_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 5.1g mar 12 16:04 text_5120_en_mb.txt</code>

<code>$ ack --version # ack在ubuntu下叫`ack-grep`</code>

<code>running under perl 5.16.3 at /usr/bin/perl</code>

<code>this program is free software. you may modify or distribute it</code>

<code>under the terms of the artistic license v2.0.</code>

<code>$ ag --version</code>

<code>ag version 0.21.0</code>

<code>$ grep --version</code>

<code>license gplv3+: gnu gpl version 3 or later <http://gnu.org/licenses/gpl.html>.</code>

<code>this is free software: you are free to change and redistribute it.</code>

<code>there is no warranty, to the extent permitted by law.</code>

<code>written by mike haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/authors>.</code>

為了不産生并行執行的互相響應，我還是選擇了效率很差的同步執行，我使用了ipython提供的%timeit。測試程式的代碼如下：

<code>import re</code>

<code>import glob</code>

<code>import subprocess</code>

<code>import cpickle as pickle</code>

<code>from collections import defaultdict</code>

<code>'en': ('four', 'python')</code>

<code>options = ('', '-i', '-v')</code>

<code>files = glob.glob('text_*_mb.txt')</code>

<code>en_res = defaultdict(dict)</code>

<code>cn_res = defaultdict(dict)</code>

<code>regex = re.compile(r'text_(\d+)_(\w+)_mb.txt')</code>

<code>call_str = '{command} {option} {word} {filename} > /dev/null 2>&1'</code>

<code>for filename in files:</code>

<code>size, xn = regex.search(filename).groups()</code>

<code>_r = defaultdict(dict)</code>

<code>for command in ['grep', 'ack', 'ag']:</code>

<code>for option in options:</code>

<code>rs = %timeit -o -n10 subprocess.call(call_str.format(command=command, option=option, word=word, filename=filename), shell=true)</code>

<code>_r[command][option] = best</code>

<code>data = pickle.dumps(res)</code>

<code>with open('result.db', 'w') as f:</code>

<code>f.write(data)</code>

溫馨提示，這是一個灰常耗時的測試。開始執行後要喝很久的茶…

我來秦皇島辦事完畢(耗時超過1一天)，繼續我們的實驗。

我想工作的時候一般都是用到不帶參數/帶-i(忽略大小寫)/-v(查找不比對項)這三種。是以這裡測試了:

英文搜尋/中文搜尋

選擇了2個搜尋詞(效率太低，否則可能選擇多個)

分别測試’’/’-i’/’-v’三種參數的執行

使用%timeit，每種條件執行10遍，選擇效率最好的一次的結果

每個圖代碼一個搜尋詞， 3搜尋指令，一個選項在搜尋不同大小檔案時的效率對比

chart

chart-7

----------------------------------------------------------------------------------------------------------------------------

何不 Ack？Grep, Ack, Ag的搜尋效率對比何不 Ack？Grep, Ack, Ag的搜尋效率對比

繼續閱讀

Shell程式設計——sort排序、uniq忽略重複、tr替換壓縮删除、cut指定删除字段、正規表達式元字元sort 指令uniq 指令tr 指令cut 指令正規表達式

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

Linxu常用指令技巧彙總

27. Remove Element(清單)題目代碼

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

ACS基本配置-權限等級管理

nginx 安裝錯誤資訊解決

Ambari介紹和架構原理

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入