何不 Ack？Grep, Ack, Ag的搜索效率对比何不 Ack？Grep, Ack, Ag的搜索效率对比

我（@董伟明9 ）经常看到很多程序员，运维在代码搜索上使用ack，甚至ag(the_silver_searcher )，而我工作中95%都是用grep，剩下的是ag。我觉得很有必要聊一聊这个话题。

我以前也是一个运维，我当时也希望找到最好的最快的工具用在工作的方方面面。但是我很好奇为什么ag和ack没有作为linux发行版的内置部分。内置的一直是grep。我当初的理解是受各种开源协议的限制，或者发行版的boss个人喜好。后来我就做了实验，研究了下他们到底谁快。当时的做法也无非跑几个真实地线上log看看用时。然后我也有了我的一个认识: 大部分时候用grep也无妨，日志很大的时候用ag。

ack原来的域名是betterthangrep.com，现在是beyondgrep.com。好吧，其实我理解使用ack的同学，也理解ack产生的原因。这里就有个故事。

何不 Ack？Grep, Ack, Ag的搜索效率对比何不 Ack？Grep, Ack, Ag的搜索效率对比

最开始我做运维使用shell，经常做一些分析日志的工作。那时候经常写比较复杂的shell代码实现一些特定的需求。后来来了一位会perl的同学。原来我写shell做一个事情，写了20多行shell代码，跑一次大概5分钟，这位同学来了用perl改写， 4行，一分钟就能跑完。亮瞎我们的眼，从那时候开始，我就觉得需要学perl，以至于后来的python。

perl是天生用来文本解析的语言， ack的效率确实很高。我想着可能是大家认为ack要更快更合适的理由吧。其实这件事要看场景。我为什么还用比较’土’的grep呢？

看一下这篇文章，希望给大家点启示。不耐烦看具体测试过程的同学，可以直接看结论：

在搜索的总数据量较小的情况下，使用grep， ack甚至ag在感官上区别不大搜索的总数据量较大时， grep效率下滑的很多，完全不要选 ack在某些场景下没有grep效果高(比如使用-v搜索中文的时候) 在不使用ag没有实现的选项功能的前提下， ag完全可以替代ack/grep

ps: 严重声明，本实验经个人实践，我尽量做到合理。大家看完觉得有异议可以试着其他的角度来做。并和我讨论。

我使用了公司的一台开发机(gentoo)

<code># 假如你是ubuntu: sudo apt-get install miscfiles</code>

<code>wget https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big</code>

我会分成英语和汉语2种文件，文件大小为1mb， 10mb， 100mb， 500mb， 1gb， 5gb。没有更多是我觉得在实际业务里面不会单个日志文件过大的。也就没有必要测试了(就算有，可以看下面结果的趋势)。用下列程序深入测试的文件：

<code>cat make_words.py</code>

<code># coding=utf-8</code>

<code>import os</code>

<code>import random</code>

<code>from cstringio import stringio</code>

<code>en_word_file = '/usr/share/dict/words'</code>

<code>en_data = f.readlines()</code>

<code>cn_data = f.readlines()</code>

<code>en_result_format = 'text_{0}_en_mb.txt'</code>

<code>cn_result_format = 'text_{0}_cn_mb.txt'</code>

<code>def write_data(f, size, data, cn=false):</code>

<code>total_size = 0</code>

<code>while 1:</code>

<code>s = stringio()</code>

<code>for x in range(10000):</code>

<code>cho = random.choice(data)</code>

<code>cho = cho.split()[0] if cn else cho.strip()</code>

<code>s.write(cho)</code>

<code>total_size += s.tell()</code>

<code>contents = s.getvalue()</code>

<code>f.write(contents + '\n')</code>

<code>if total_size > size:</code>

<code>break</code>

<code>f.close()</code>

<code>for index, size in enumerate([</code>

<code>size_name = size_list[index]</code>

<code>en_f = open(en_result_format.format(size_name), 'a+')</code>

<code>cn_f = open(cn_result_format.format(size_name), 'a+')</code>

<code>write_data(en_f, size, en_data)</code>

<code>write_data(cn_f, size, cn_data, true)</code>

好吧，效率比较低是吧？我自己没有vps，公司服务器我不能没事把全部内核的cpu都占满(不是运维好几年了)。假如你不介意htop的多核cpu飘红，可以这样，耗时就是各文件生成的时间短板。这是生成测试文件的多进程版本：

<code>import multiprocessing</code>

<code>inputs = []</code>

<code>write_data(_f, size, data, cn)</code>

<code>inputs.append((en_result_format.format(size_name), size, en_data, false))</code>

<code>inputs.append((cn_result_format.format(size_name), size, cn_data, true))</code>

<code>pool = multiprocessing.pool()</code>

<code>pool.map(map_func, inputs, chunksize=1)</code>

等待一段时间后，测试的文件生成了。目录下是这样的：

<code>total 14g</code>

<code>-rw-rw-r-- 1 vagrant vagrant 2.2k mar 14 05:25 benchmarks.ipynb</code>

<code>-rw-rw-r-- 1 vagrant vagrant 8.2m mar 12 15:43 dict.txt.big</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.2k mar 12 15:46 make_words.py</code>

<code>-rw-rw-r-- 1 vagrant vagrant 101m mar 12 15:47 text_100_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 101m mar 12 15:47 text_100_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1g mar 12 15:54 text_1024_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1g mar 12 15:51 text_1024_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 11m mar 12 15:47 text_10_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 11m mar 12 15:47 text_10_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1m mar 12 15:47 text_1_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 1.1m mar 12 15:47 text_1_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 501m mar 12 15:49 text_500_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 501m mar 12 15:48 text_500_en_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 5.1g mar 12 16:16 text_5120_cn_mb.txt</code>

<code>-rw-rw-r-- 1 vagrant vagrant 5.1g mar 12 16:04 text_5120_en_mb.txt</code>

<code>$ ack --version # ack在ubuntu下叫`ack-grep`</code>

<code>running under perl 5.16.3 at /usr/bin/perl</code>

<code>this program is free software. you may modify or distribute it</code>

<code>under the terms of the artistic license v2.0.</code>

<code>$ ag --version</code>

<code>ag version 0.21.0</code>

<code>$ grep --version</code>

<code>license gplv3+: gnu gpl version 3 or later <http://gnu.org/licenses/gpl.html>.</code>

<code>this is free software: you are free to change and redistribute it.</code>

<code>there is no warranty, to the extent permitted by law.</code>

<code>written by mike haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/authors>.</code>

为了不产生并行执行的相互响应，我还是选择了效率很差的同步执行，我使用了ipython提供的%timeit。测试程序的代码如下：

<code>import re</code>

<code>import glob</code>

<code>import subprocess</code>

<code>import cpickle as pickle</code>

<code>from collections import defaultdict</code>

<code>'en': ('four', 'python')</code>

<code>options = ('', '-i', '-v')</code>

<code>files = glob.glob('text_*_mb.txt')</code>

<code>en_res = defaultdict(dict)</code>

<code>cn_res = defaultdict(dict)</code>

<code>regex = re.compile(r'text_(\d+)_(\w+)_mb.txt')</code>

<code>call_str = '{command} {option} {word} {filename} > /dev/null 2>&1'</code>

<code>for filename in files:</code>

<code>size, xn = regex.search(filename).groups()</code>

<code>_r = defaultdict(dict)</code>

<code>for command in ['grep', 'ack', 'ag']:</code>

<code>for option in options:</code>

<code>rs = %timeit -o -n10 subprocess.call(call_str.format(command=command, option=option, word=word, filename=filename), shell=true)</code>

<code>_r[command][option] = best</code>

<code>data = pickle.dumps(res)</code>

<code>with open('result.db', 'w') as f:</code>

<code>f.write(data)</code>

温馨提示，这是一个灰常耗时的测试。开始执行后要喝很久的茶…

我来秦皇岛办事完毕(耗时超过1一天)，继续我们的实验。

我想工作的时候一般都是用到不带参数/带-i(忽略大小写)/-v(查找不匹配项)这三种。所以这里测试了:

英文搜索/中文搜索

选择了2个搜索词(效率太低，否则可能选择多个)

分别测试’’/’-i’/’-v’三种参数的执行

使用%timeit，每种条件执行10遍，选择效率最好的一次的结果

每个图代码一个搜索词， 3搜索命令，一个选项在搜索不同大小文件时的效率对比

chart

chart-7

----------------------------------------------------------------------------------------------------------------------------

何不 Ack？Grep, Ack, Ag的搜索效率对比何不 Ack？Grep, Ack, Ag的搜索效率对比

继续阅读

Shell编程——sort排序、uniq忽略重复、tr替换压缩删除、cut指定删除字段、正则表达式元字符sort 命令uniq 命令tr 命令cut 命令正则表达式

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

Linxu常用命令技巧汇总

27. Remove Element(列表)题目代码

httpd服务的部署、启动、配置和简单优化一、部署二、启动三、配置文件

《Linux命令行与Shell脚本编程大全第2版.布卢姆》pdf

ACS基本配置-权限等级管理

nginx 安装错误信息解决

Ambari介绍和架构原理

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入