說到numericrangequery查詢,你們肯定覺得很簡單,不就是數字範圍查詢嗎?使用者提供一個上限值和一個下限值,底層api裡直接>min,<max,真的是這樣嗎?其實在lucene裡隻能對字元串string建立索引,那麼數字怎麼轉成string,你肯定又會想當然的認為tostring()一下就ok啦?ok,假如真的是這樣的,那字元串"3" > "26"問題怎麼解決?ok,可以通過在數字前面加前導零解決,“03”<"26"是沒錯,可是前導零加幾位沒法确定,加多了浪費硬碟空間,加少了支援索引的數字位數受限。即使你解決了位數受限問題,但lucene裡的範圍查詢本質還是通過booleanquery進行條件連接配接起來的,term條件太多還是會出現too many boolean clause異常的。其實lucene内部是把數字(int,long,float,double)轉成十六進制的數字來處理的。具體怎麼轉成的請參看numericutils這個工具類的源碼,
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnLyFGdz9lbvNWavw1cldWYtl2Lc12bj5SZ5VGdp5SYklWetFWavw1LcpDc0RHaiojIsJye.png)
/**
* converts a <code>float</code> value to a sortable signed <code>int</code>.
* the value is converted by getting their ieee 754 floating-point "float format"
* bit layout and then some bits are swapped, to be able to compare the result as int.
* by this the precision is not reduced, but the value can easily used as an int.
* @see #sortableinttofloat
*/
public static int floattosortableint(float val) {
int f = float.floattorawintbits(val);
if (f<0) f ^= 0x7fffffff;
return f;
}
上面貼的就是把float轉成十六進制的數字的代碼,裡面盡是位運算,看的人暈暈的,要完全搞懂,不是一件容易的事情。
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnLyFGdz9lbvNWavw1cldWYtl2Lc12bj5SZ5VGdp5SYklWetFWavw1LcpDc0RHaiojIsJye.png)
/** this helper does the splitting for both 32 and 64 bit. */
private static void splitrange(
final object builder, final int valsize,
final int precisionstep, long minbound, long maxbound
) {
if (precisionstep < 1)
throw new illegalargumentexception("precisionstep must be >=1");
if (minbound > maxbound) return;
for (int shift=0; ; shift += precisionstep) {
// calculate new bounds for inner precision
final long diff = 1l << (shift+precisionstep),
mask = ((1l<<precisionstep) - 1l) << shift;
final boolean
haslower = (minbound & mask) != 0l,
hasupper = (maxbound & mask) != mask;
final long
nextminbound = (haslower ? (minbound + diff) : minbound) & ~mask,
nextmaxbound = (hasupper ? (maxbound - diff) : maxbound) & ~mask;
lowerwrapped = nextminbound < minbound,
upperwrapped = nextmaxbound > maxbound;
if (shift+precisionstep>=valsize || nextminbound>nextmaxbound || lowerwrapped || upperwrapped) {
// we are in the lowest precision or the next precision is not available.
addrange(builder, valsize, minbound, maxbound, shift);
// exit the split recursion loop
break;
}
if (haslower)
addrange(builder, valsize, minbound, minbound | mask, shift);
if (hasupper)
addrange(builder, valsize, maxbound & ~mask, maxbound, shift);
// recurse to next precision
minbound = nextminbound;
maxbound = nextmaxbound;
}
說實話,我還沒有完全參透這段源碼,留着以後有空研究算法的時候再來啃這塊骨頭吧。
上面說了一大堆廢話,都是涉及底層數字範圍查詢設計原理的東西,隻說了個大概,具體實作涉及的算法和原理我也還沒參透,表示很抱歉,如果你對這方面算法很了解,麻煩請告知我,謝謝!
numericrangequery原理了解起來很難,但使用起來卻是非常簡單:
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnLyFGdz9lbvNWavw1cldWYtl2Lc12bj5SZ5VGdp5SYklWetFWavw1LcpDc0RHaiojIsJye.png)
query q = numericrangequery.newfloatrange("weight", 0.03f, 0.10f, true, true);
後面兩個boolean值用來控制是否包含兩個上下邊界值的。
不過要注意的是numericrangequery隻對intfield,longfield,floatfield,doublefield等這些表示數字的field域有效,numericrangequery還有一個比較重要的設定就是precision step,何為precision step呢?翻譯過來就是精度步長,還是不夠直覺無法了解,對不對?說通俗一點就是拿多大一個長度來截取term,因為你的數字轉成十六進制的字元串後,可能很長,需要按照一定的步長截取成多個term進行索引的,比如“1111101111111011”,如果你的precision step值為16的話(不同資料類型的步長預設值不同,都定義在numericutils工具類裡),那最終隻有1個term,如果precision step值為8,那最終索引中就會有2個term,這就是為什麼官方api裡說percisionstep值越小會越占硬碟空間但搜尋速度越快了。term多了肯定越占硬碟空間了。 numericrangequery就說到這兒了,thanks all.
如果你還有什麼問題請加我Q-q:7-3-6-0-3-1-3-0-5,
或者加裙
一起交流學習!
轉載:http://iamyida.iteye.com/blog/2194799