下面是gnu grep的原作者mikehaertel 在freebsd郵件清單中對 “gnu grep為什麼比bsd grep要快” 這個問題所做的回答,解釋了grep是如何進行快速搜尋的,下面是郵件正文内容:
why gnu grep is fast
mike haertel mike at ducky.net
sat aug 21 03:00:30 utc 2010
• previous message: latest intr problems
• next message: why gnu grep is fast
• messages sorted by: [ date ] [ thread ] [subject ] [ author ]
________________________________________
hi gabor,
i am the original author of gnu grep. i am also a freebsd user,
although i live on -stable (and older) andrarely pay attention
to -current.
however, while searching the -current mailinglist for an unrelated
reason, i stumbled across some flamageregarding bsd grep vs gnu grep
performance. you may have noticed that discussion too...
anyway, just fyi, here's a quick summary ofwhere gnu grep gets
its speed. hopefully you can carry these ideas over to bsd grep.
#1 trick: gnu grep is fast because it avoidslooking at
every input byte.
#2 trick: gnu grep is fast because itexecutes very few
instructions for each byte that it *does*look at.
gnu grep uses the well-known boyer-moorealgorithm, which looks
first for the final letter of the targetstring, and uses a lookup
table to tell it how far ahead it can skip inthe input whenever
it finds a non-matching character.
gnu grep also unrolls the inner loop ofboyer-moore, and sets up
the boyer-moore delta table entries in such away that it doesn't
need to do the loop exit test at every unrolledstep. the result
of this is that, in the limit, gnu grepaverages fewer than 3 x86
instructions executed for each input byte itactually looks at
(and it skips many bytes entirely).
see "fast string searching", byandrew hume and daniel sunday,
in the november 1991 issue of softwarepractice & experience, for
a good discussion of boyer-mooreimplementation tricks. it's
available as a free pdf online.
once you have fast search, you'll find youalso need fast input.
gnu grep uses raw unix input system calls andavoids copying data
after reading it.
moreover, gnu grep avoids breaking the inputinto lines. looking
for newlines would slow grep down by a factorof several times,
because to find the newlines it would have tolook at every byte!
so instead of using line-oriented input, gnugrep reads raw data into
a large buffer, searches the buffer usingboyer-moore, and only when
it finds a match does it go and look for thebounding newlines.
(certain command line options like -n disablethis optimization.)
finally, when i was last the maintainer ofgnu grep (15+ years ago...),
gnu grep also tried very hard to set thingsup so that the *kernel*
could also avoid handling every byte of theinput, by using mmap()
instead of read() for file input. at the time, using read() caused
most unix versions to do extra copying. since gnu grep passed out
of my hands, it appears that use of mmapbecame non-default, but you
can still get it via --mmap. and at least in cases where the data
is already file system buffer caches, mmap isstill faster:
$time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'
real 0m1.530s
user 0m0.230s
sys 0m1.357s
$time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'
real 0m1.201s
user 0m0.330s
sys 0m0.929s
[workload was a 648 megabyte mh mail foldercontaining 41000 messages]
so even nowadays, using --mmap can be worth a>20% speedup.
summary:
- use boyer-moore (and unroll its inner loopa few times).
- roll your own unbuffered input using rawsystem calls. avoid copying
theinput bytes before searching them. (do,however, use buffered
*output*. the normal grepscenario is that the amount of output is
smallcompared to the amount of input, so the overhead of output
buffer copying is small, while savings due to avoiding many small
unbuffered writes can be large.)
- don't look for newlines in the input untilafter you've found a match.
- try to set things up (page-aligned buffers,page-sized read chunks,
optionally use mmap) so the kernel can also avoid copying the bytes.
the key to making programs fast is to makethem do practically nothing. ;-)
regards,
mike
下面是翻譯:
gabor 您好,
我是gnu grep的原作者,同時我也是一名freebsd使用者,不過我一直使用的是-stable版本(一個更老的版本),沒怎麼關注-current版本。
但是,當我無意間翻閱-current版的郵件清單時,發現了一些關于bsd grep與gnu grep性能的讨論,你可能也注意到了。
不管怎麼說,僅供參考,下面是一些關于為什麼gnu grep如此之快的簡單總結。或許你能借鑒其中的一些思想運用到bsd grep中去。
#技巧1:gnu grep之是以快是因為它并不會去檢查輸入中的每一個位元組。
#技巧2:gnu grep之是以快是因為它對那些的确需要檢查的每個位元組都執行非常少的指令(操作)。
gnu grep使用了非常著名的boyer-moore算法,該算法首先從目标字元串的最後一個字元開始查找,并且使用一個查找表,它可以在發現一個不比對字元之後,計算出可以跳過多少個輸入字元并繼續查找。
gnu grep還展開了boyer-moore算法的内部循環,并建立了一個boyer-moore的delta表,這樣它就不需要在每一個展開的步驟進行循環退出判斷了。這樣的結果就是,在極限情況下,gnu grep在需要檢查的每一個輸入位元組上所執行的x86指令不會超過3條(并且還跳過了許多位元組)。
一旦有了快速搜尋,這時你會發現也需要同樣快速的輸入。
gnu grep使用了原生unix輸入系統調用并避免了在讀取後對資料進行拷貝。
而且,gnu grep還避免了對輸入進行分行,查找換行符會讓grep減慢好幾倍,因為要找換行符你就必須檢視每個位元組!
是以gnu grep沒有使用基于行的輸入,而是将原資料讀入到一個大的緩沖區buffer,用boyer-moore算法對這個緩沖區進行搜尋,隻有在發現一個比對之後才會去查找最近的換行符(某些指令參數,比如-n會禁止這種優化)。
最後,當我還在維護gnu grep的時候(15+年前……),gnu grep也嘗試做一些非常困難的事情使核心也能避免處理輸入的每個位元組,比如使用mmap()而不是read()來進行檔案輸入。當時,用read()會使大部分unix版本造成一些額外的拷貝。因為我已經不再gnu grep了,是以似乎mmap已經不再預設使用了,但是你仍然可以通過參數–mmap來啟用它,至少在檔案系統的buffer已經緩存了你的資料的情況下,mmap仍然要快一些:
1
2
3
4
5
6
7
8
$ time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'
real 0m1.530s
user 0m0.230s
sys 0m1.357s
$ time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'
real 0m1.201s
user 0m0.330s
sys 0m0.929s
[這裡使用的輸入是一個648m的mh郵件檔案夾,包含大約41000條資訊]
是以即使在今天,使用–mmap仍然可以提速20%以上。
總結:
- 使用boyer-moore算法(并且展開它的内層循環)。
- 使用原生系統調用來建立你的緩沖輸入,避免在搜尋之前拷貝輸入位元組。(無論如何,最好使用緩沖輸出,因為在grep的常用場景中,輸出的要比輸入的少,是以輸出緩沖拷貝的開銷要小,并且可以節省許多這樣小的無緩沖寫操作。)
- 在找到一個比對之前,不要查找換行符。
- 嘗試做一些設定(比如頁面對齊緩沖區,按頁大小來讀取塊,選擇性的使用mmap),這樣可以使核心避免拷貝位元組。
讓程式變得更快的關鍵就是讓它們做更少的事情。;-)
緻禮
mike