Finite State Automata in Dawid WEISS
.
.
Dawid Weiss .
20+ years of coding 10 years assembly only
Academia & Research
.
PhD in Information Retrieval, PUT
Open source Carrot2 , HPPC, Lucene, ‌
Industry & Business Carrot Search s.c.
.
Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs.
Use cases in Lucene and Solr Suggester. FuzzySearch. Index.
No API details Still @experimental.
(Non)? Deterministic Finite State (Automata|Machines)
HashSet hash 0x29384d34 0xde3e3354 0x00000666
→ slot
→ value → lucene → lucid → lucifer
HashSet → slot
hash
→ value → lucene → lucid → lucifer
0x29384d34 0xde3e3354 0x00000666 FSA (deterministic)
l
u
c
e
n
e
i d f
r e
HashSet → slot
hash
→ value → lucene → lucid → lucifer
0x29384d34 0xde3e3354 0x00000666 FSA (deterministic)
l
u
c
e
n
e
i d exists(sequence) oor(pre x) ceil(pre x)
f
r e
k b
i
l
i
l
l l
deterministic, non-minimal
k b
i
l
i
l
i
l
l l
deterministic, non-minimal
k
b
l
deterministic, minimal
k b
i
l
i
l
i
l
l l
deterministic, non-minimal
k l
deterministic, minimal
b k
b
i
l
i
l
l
non-deterministic, non-minimal
(Sorted)Map lucene lucid lucifer
→ 1 → 2 → 666
(Sorted)Map → 1 → 2 → 666
lucene lucid lucifer
FST (transducer)
l
u
c
e i
n
e|1 d|2 r|666
f
e
(Sorted)Map lucene lucid lucifer
→ 1 → 2 → 666
FST (transducer)
l|1
u
c
e i|1
n
e d r
f|664
e
NFSAs and
Regular expressions a
a e1e2
Determinization
e+
states explosion, not always possible
Backtracking recursion explosion
e*
e?
e1
e1
e
e
e
a?nan
a?nan
n=3 → a?a?a?aaa
a?nan
n=3 → a?a?a?aaa
Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
35000 30000
Time [ms]
25000 20000 15000 10000 5000 0 0
5
10
15
20
25
Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
30
Linear-time, minimal, deterministic
FSA construction
Linear algorithm from sorted input by Daciuk, Mihov, et al.
Active path states that still can change
States dictionary nodes that will never change
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
lucene
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
l
lucid
u
c
e
n
e
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
l
u
c
e
n
i d
e
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
l
u
c
e
n
i d
lucifer
e
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
l
u
c
e
n
i
e d
f
e
r
1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP
l
u
c
e
n
e
i d f
r e
FS(A|T)s in (Lucene|Solr)
Automata in
Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum
org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes
org.apache.lucene.util.automaton.fst.*
FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive
Input: abc, bd, bde. a b
b
c d
d
a e
b
b
c
d
e
org.apache.lucene.util.automaton.fst.*
FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive
Next-state chaining requires unusual tricks during construction
s3
a
s2
b s4
b d
s1 s5
s1 s2 s5 s4 cFL bL eFL dL
s1 s2 s5 s4
c e
s3 a
bL
s3
cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive
Next-state chaining requires unusual tricks during construction
Everything in a byte[]
s3
a
s2
b s4
b d
s1 s5
s1 s2 s5 s4 cFL bL eFL dL
c e
s3 a
bL
traversals-ready, memory-eďŹƒcient
s1 s2 s5 s4
s3
cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive
Next-state chaining requires unusual tricks during construction
Everything in a byte[]
s3
a
s2
b s4
b d
s1 s5
s1 s2 s5 s4 cFL bL eFL dL
c e
s3 a
bL
traversals-ready, memory-eďŹƒcient
Dual transition storage format lookup: bsearch or linear scan
s1 s2 s5 s4
s3
cFL bL eFL dL bLN a
Input size
Compressed size (MB)
Input
MB
Terms
Lucene
morf.
gzip
Wikipedia t.index Polish in .
481 162
38 092 045 3 672 200
258 3.1
164 1.7
149 15.4
.
Use Cases: Solr's Autocomplete
Solr's
Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches?
Weights term→weight, lookup(term, onlyMorePopular)
org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup
.
flour|3 four|4 fourier|3 furious|2
. T . ake 1 .
.
flour|3 four|4 fourier|3 furious|2
→fou*
u
o l f
o
u
r
i
e
r
u r
i
o
| 3
|
4 u
s
|
2
Find pre x. Depth-in traversal for completions. PQ on score|alpha . T . ake 1 .
.
2furious 3flour 3fourier 4four
. T . ake 2 .
.
2furious 3flour 3fourier 4four
→fou*
u
f
r
i
2 3
l o
f
4 f
u o
r
o
u
i
e
s r
u
From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match.
. T . ake 2 .
.
2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour ‌
. T . ake 3 (in xes) .
.
2
i o
o
r 5
u
i
r
s s
|
f
u s
u
4 6
u
o u
r
r
| f
7
r
.
i
o
u
i
e
l f
3 e
| r
o
u
o
i |
r u u
i r
e |
o
u o
f f r
i l
l
|
r
s r
u
o
Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
Exact matches only. Static snapshot (not incremental). Discretized weights.
Top50KWiki.utf8, 676 KB, 50 000 terms
RAM. [B]
. PREFIX [100-200]
Jaspell
TST
FST
7 869. 415
7 914. 524
300 .175
queries per second,. . . tpq . . . 458 966 742
PREFIX. [6-9]
. 330
. 228
. 659
PREFIX. [2-4]
. 126
.. 29
. 501
Summary
Summary and Conclusions Automata compact, powerful, eďŹƒcient data structure
Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters
API in Lucene ‌is shaped right now, still @experimental
Acknowledgement Michael McCandless Robert Muir committer: .+
dawid.weiss@carrotsearch.com