laitimes

An in-depth look at Lucene's dictionary FST

author:Flash Gene

Why are search engines so fast?

The core is how to quickly find all relevant documents according to the query terms, which is also the core idea of Inverted Index. So how to design a fast (constant, or 1) locating dictionary data structure is particularly important. To put it simply, we can use various data structures such as HashMap, TRIE, Binary Search Tree, Tenary Search Tree, etc.

So how is Lucene, an open-source search engine package, designed? Lucene uses a structure called FST (Finite State Transducer) to build dictionaries, which ensures the balance of time and space complexity, and is one of the core functions of Lucene.

关于FST(Finite State Transducer)

FST resembles a kind of TRIE tree.

使用FSM(Finite State Machines)作为数据结构

Finite State Machines (FSM): A mathematical model that represents a finite set of states and the transitions and actions between these states. One of the states is marked as start, and 0 or more states are marked as final.

An FSM is in only one state at a time. FSM is versatile and can be used to represent a variety of processes, and the FSM below describes A Kitten's Day.

An in-depth look at Lucene's dictionary FST

"Sleeping" or "eating" represents the state, while "providing food" or "moving things" represents transfer. The FSM in the figure is an abstraction of the kitten's activity (the start state or the final state is not deliberately written here), the kitten cannot be in the "playing" and "sleeping" state at the same time, and there is only one input for the transition from one state to the next. I don't know what state the "sleeping" state is transitioned from, it may be "playing" or "litter".

If the FSM of "A Kitten's Day" receives the following inputs:

  • Serve food
  • There are loud sounds
  • Be quiet
  • Digest food

Then we will clearly know that the kitten will change its state in this order: sleep - > eat - > hide - > eat - > cat litter nest.

The above is just a real-world example, let's look at how to implement an Ordered Sets and Map structure.

Ordered Sets

An Ordered Set is an ordered set. Usually an ordered set can be implemented with a binary tree or a B-tree. Unordered collections are implemented using hash tables. Here, we use a Deterministric acyclic finite state acceptor (FSA) to do this.

FSA is a type of Finite State Machine (FSM) with the following characteristics:

  • OK: Means specifying any one status, and only at most one transfer can be accessed.
  • No loop: It is not possible to traverse the same state repeatedly
  • Receiver: A finite state machine only "accepts" a specific sequence of inputs and terminates in the final state.

Let's take a look at how we represent a collection with only one key: "jul". Here's what the FSA looks like:

An in-depth look at Lucene's dictionary FST

When querying whether the FSA contains "jul", enter it in alphabetical order.

  • Enter j and FSA from 0->1
  • 输入u, FSA从1->2
  • Enter l, FSA from 2->3

At this time, the FSA is in final state 3, so "jul" is in this set.

Imagine typing "jun" and you can't move in state 2, you know that you are not in this set.

Imagine how to type "ju", in state 2, there is no input. State 2 is not final, so it is not in this set.

It is worth pointing out that the time complexity of finding whether this key is in a set depends on the length of the key, not the size of the set.

Now add another key to the FSA. The FSA now contains keys: "jul" and "mar".

An in-depth look at Lucene's dictionary FST

Start state 0 now has 2 transitions: j and m. Therefore, if you enter key: "mar", it will be transferred first following m. The final state is shared by "jul" and "mar". This allows us to represent more information in less space.

When we add "jun" to this FSA, then it has the same prefix "ju" as "jul":

An in-depth look at Lucene's dictionary FST

There are very few changes here, no new statuses added, just one more transfer.

下面来看一下由“october”,“november”,”december”构成的FSA.

An in-depth look at Lucene's dictionary FST

They have the common suffix "ber", so they only appear once in the FSA. Two of them have the common suffix "ember", which also appears only once.

So how do we iterate over all the keys represented by an FSA, let's take the previous "jul", "jun", and "mar" as examples:

An in-depth look at Lucene's dictionary FST

The traversal algorithm looks like this:

  • Initial state0, key=""
  • ->1, key="j"
  • ->2, key="ju"
  • ->3, key="jul", 找到jul
  • 2<-, key="ju"
  • ->3, key="June", 找到Jun
  • 2<-, key="ju"
  • 1<-, key="j"
  • 0<-, key=""
  • ->4, key="m"
  • ->5, key="ma",
  • ->3, k="kill",找࠸mar

The time complexity of this algorithm O(n), n is the size of all keys in the set, and the space complexity O(k), k is the longest key field length in the combination.

Ordered maps

Ordered maps就像一个普通的map,只不过它的key是有序的。 我们来看一下如何使用确定无环状态转换器(Deterministic acyclic finite state transducer, FST)来实现它。

FST is also a finite state machine (FSM) with the following properties:

  • OK: Means specifying any one state, and only at most one transition can be traversed to.
  • No loop: It is not possible to traverse the same state repeatedly
  • Transducer: Receives a specific sequence, terminates in the final state, and outputs a value.

FST is similar to FSA, given a key can not only answer whether it exists, but also output an associated value.

Let's take a look at an input like this: "jul:7", where 7 is the value associated with jul, like an entry to a map.

An in-depth look at Lucene's dictionary FST

This is basically the same as the corresponding sorted set, except that the first 0->1 transformation j is associated with a value of 7. For other conversions of you and l, the default value of the association is 0, which is not shown here.

So when we look for the key: "jul", the process is as follows:

  • Initial state 0
  • 输入j, FST从0->1, value=7
  • 输入u, FST从1->2, value=7+0
  • 输入l,FST从2->3, value=7+0+0

At this point, the FST is in final state 3, so there is jul, and the output is 7.

Let's take another look at what has become of FST with the addition of mar:3:

An in-depth look at Lucene's dictionary FST

Again, it's very simple, note that the value of 3 that comes with MAR is placed on the first shift. This is just to make the algorithm simpler, in fact, can be placed on other transfers.

What happens to FST if the prefix is shared? Here we move on to jun:6.

An in-depth look at Lucene's dictionary FST

Like sets, Jun and Jul share state 3, but with some changes.

  • 0->1 shift, output changed from 7 to 6
  • 2->3 shifts, input l, output value becomes 1.

This output change is important because he changes the process of finding the jul output value.

  • Initial state 0
  • 输入j, FST从0->1, value=6
  • 输入u, FST从1->2, value=6+0
  • 输入l,FST从2->3, value=6+0+1

The final value is still 7, but the path taken is different.

Is it also correct to look for jun?

  • Initial state 0
  • 输入j, FST从0 -> 1, value=6
  • 输入u,FST从1 -> 2, value=6+0
  • 输入n,FST从2 -> 3, value=6+0+0

From the above, it can be seen that Jun's query is also correct. FST ensures that different transitions have unique values, but also reuses most of the data structures.

The key to achieving shared state is that each key corresponds to a unique path in FST. Therefore, for any given key, there will always be some combination of value transfers that make the path unique. What we need to do is how to distribute these combinations in the transfer.

The sharing mechanism of key output also applies to common prefixes and common suffixes. For example, we have FSTs like tuesday:3 and thursday:5:

An in-depth look at Lucene's dictionary FST

The two keys have a common prefix t and a common suffix sday. The two associated values also have a common prefix. 3 can be written as 3+0, while 5 can be written: 3+2. This is a good way to share the associated values.

The above example is actually a bit simplistic and limited. What if the values of these associations are not int? In fact, FST requires the following methods to be associated with the type of value(outputs).

  • 加(Addition)
  • 减 (Subtraction)
  • Take the prefix (min, for integers)

Construction of FST

Earlier, there was no mention of how to build an FST. Building is still a bit more complex than traversal.

For the sake of simplicity, let's assume that the data in the set or map is added in dictionary order. This assumption is a heavy constraint, but we'll talk about how to mitigate it.

In order to build the FSM, let's first look at how the TRIE tree is constructed.

Construction of TRIE trees

TRIE can be seen as an FSA, the only difference is that TRIE only shares prefixes, while FSAs share not only prefixes but also suffixes.

Let's say we have a set like this: mon, tues, thurs. Here's what the FSA looks like:

An in-depth look at Lucene's dictionary FST

The corresponding TRIE is like this, only the prefix is shared.

An in-depth look at Lucene's dictionary FST

TRIE has 3 final states that repeat 3,8,11. And 8 and 11 are both s transfers and can be combined.

Building a TRIE tree is fairly straightforward. Insert 1 key, and you only need to do a simple search. If the input ends first, the current state is set to final; If you can't transfer anymore, just create a new transfer and status. Don't forget that the last created state is set to final.

Construction of FST

Building an FST is largely the same as building an FSA, with the main difference being how outputs are placed and shared on the transfer.

Still use the previously mentioned examples, mon, tues, and thurs, and associate them with the corresponding week values of 2, 3, and 5.

Starting with the 1st key, mon:2:

An in-depth look at Lucene's dictionary FST

The dotted line here indicates that the FST may change during the subsequent insert process.

It's important to note that this is just a 2 on the 1st shift. Technically, the following distribution is also correct.

An in-depth look at Lucene's dictionary FST

It's just that it's easier to write an algorithm that puts output close to the start state.

Let's continue to insert thurs:5:

An in-depth look at Lucene's dictionary FST

Just like the insert of the FSA, after inserting the thurs, we can know that the mon part of the FST (blue) will not change anymore.

Since mon and thurs don't have a common prefix, it's just a simple key in 2 maps. So their output value can be placed directly on the first transition in the start state.

Next, go ahead and insert tues:3,

An in-depth look at Lucene's dictionary FST

This has given rise to new changes. Some of them were frozen, and knew they wouldn't change it again. The output value has also been reassigned. Since the output of tues is 3, and tues and thurs have the same prefix t, the prefix operation of 5 and 3 results in 3. The value of states 0-> 4 is assigned 3, and the values of states 4-> 5 are set to 2.

Let's insert more keys, this time tye:99 to see what happens:

An in-depth look at Lucene's dictionary FST

Tye is inserted, causing the "ES" part to freeze, and the output of states 4-> 9 is 99-3=96 due to the shared prefix t.

The last step, over, is to perform another freeze operation.

The final FST looks like this:

An in-depth look at Lucene's dictionary FST

Lucene FST

In the previous section, we gave a detailed introduction to the concept and construction of FST. This section provides a detailed analysis of the implementation of Lucene FST and its specifics.

Lucene关于FST相关的代码在package:org.apache.lucene.util.fst。

From org.apache.lucene.util.fst.Builder, this is the Builder that builds FST:

An in-depth look at Lucene's dictionary FST

The Builder uses a generic T, which allows the build to contain different types of FSTs. We focus on attributes.

From where the add() method inserts the data:

/** Add the next input/output pair.  The provided input
   *  must be sorted after the previous one according to
   *  {@link IntsRef#compareTo}.  It's also OK to add the same
   *  input twice in a row with different outputs, as long
   *  as {@link Outputs} implements the {@link Outputs#merge}
   *  method. Note that input is fully consumed after this
   *  method is returned (so caller is free to reuse), but
   *  output is not.  So if your outputs are changeable (eg
   *  {@link ByteSequenceOutputs} or {@link
   *  IntSequenceOutputs}) then you cannot reuse across
   *  calls. */
  public void add(IntsRef input, T output) throws IOException {

    ...
    // prefixLenPlus1是计算出input和lastInput具有公共前缀的位置
    final int prefixLenPlus1 = pos1+1;

     // 1.新插入的节点放到frontier数组,UnCompileNode表明是新插入的,以后还可能会变化,还未放入FST内。
    if (frontier.length < input.length+1) {
      final UnCompiledNode<T>[] next = ArrayUtil.grow(frontier, input.length+1);
      for(int idx=frontier.length;idx<next.length;idx++) {
        next[idx] = new UnCompiledNode<>(this, idx);
      }
      frontier = next;
    }

    // minimize/compile states from previous input's
    // orphan'd suffix

    // 2.从prefixLenPlus1, 进行freeze冰冻操作, 添加并构建最小FST
    freezeTail(prefixLenPlus1);

    // init tail states for current input
    // 3.将当前input剩下的部分插入,构建arc转移(前缀是共用的,不用添加新的状态)。
    for(int idx=prefixLenPlus1;idx<=input.length;idx++) {
      frontier[idx-1].addArc(input.ints[input.offset + idx - 1],
                             frontier[idx]);
      frontier[idx].inputCount++;
    }

    final UnCompiledNode<T> lastNode = frontier[input.length];
    if (lastInput.length() != input.length || prefixLenPlus1 != input.length + 1) {
      lastNode.isFinal = true;
      lastNode.output = NO_OUTPUT;
    }

    // push conflicting outputs forward, only as far as
    // needed
    // 4.如果有冲突的话,重新分配output值
    for(int idx=1;idx<prefixLenPlus1;idx++) {
      final UnCompiledNode<T> node = frontier[idx];
      final UnCompiledNode<T> parentNode = frontier[idx-1];

      final T lastOutput = parentNode.getLastOutput(input.ints[input.offset + idx - 1]);
      assert validOutput(lastOutput);

      final T commonOutputPrefix;
      final T wordSuffix;

      if (lastOutput != NO_OUTPUT) {
        // 使用common方法,计算output的共同前缀
        commonOutputPrefix = fst.outputs.common(output, lastOutput);
        assert validOutput(commonOutputPrefix);
        // 使用subtract方法,计算重新分配的output
        wordSuffix = fst.outputs.subtract(lastOutput, commonOutputPrefix);
        assert validOutput(wordSuffix);
        parentNode.setLastOutput(input.ints[input.offset + idx - 1], commonOutputPrefix);
        node.prependOutput(wordSuffix);
      } else {
        commonOutputPrefix = wordSuffix = NO_OUTPUT;
      }
      output = fst.outputs.subtract(output, commonOutputPrefix);
      assert validOutput(output);
    }

    ...
  }           

Through the annotation, we see that the input is sorted, i.e., ordered. Otherwise, it is not the smallest FST that is generated. In addition, if the NO_OUTPUT degenerates into FSA, there is no need to reallocate the output in step 4.

Among them, the freezeTail method is to freeze the part that no longer changes, also known as compile, and build the UnCompileNode into the FST. To enter the FST, you will first compileNode, and then addNode.

To summarize the following, the process of joining the node is as follows:

  • 1) The new input is put into the frontier, FST has not been added here
  • 2) According to the current input, freezeTail the last inserted data and put it into the FST
  • 3) Construct the Arc relationship of the input
  • 4) Resolve output conflicts, redistribute output, and ensure that the path is unified (NO_OUTPUT, not executed)

Finally, in the finish method, execute freezeTail(0) to build all inputs into the FST.

Also, it's worth noting the types of Outputs defined in Lucene:

An in-depth look at Lucene's dictionary FST

Three of the methods are defined by the Outputs interface, and there are 11 different types of implementations:

  • T add(T prefix, T output); 加
  • T subtract(T output, T inc); 减
  • T common(T output1, T output2) 前缀

Completely satisfying the limitations of our previous section, it can be seen that it is a complete implementation based on the previous algorithm.

In addition to the use of Term dictionaries, FST is also widely used throughout lucene, which basically replaces the hashmap.

The scenario is roughly as follows:

  • 自动联想:suggester
  • charFilter: mappingcharFilter
  • Synonym filter
  • hunspell拼写检查词典

summary

FST, which can share not only prefixes but also suffixes. It can not only determine whether the key of the search exists, but also give the input output of the response. It is optimized for both temporal complexity and spatial complexity, allowing Lucene to fully load the Term Dictionary into memory, quickly locating the Term to find the output of the response (posting the inverted list).

References:

Burst Tries

Direct Construction of Minimal Acyclic Subsequential Transducers

Index 1,600,000,000 Keys with Automata and Rust

DFA minimization WikiPedia

Smaller Representation of Finite State Automata

Using Finite State Transducers in Lucene

Your support is the motivation for my originality!

Donate

  • Post author: 申艳超
  • Post link: https://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/