Finite State Automata in Lucene: Internals and Applications

Page 1

Finite State Automata in Dawid WEISS


.

.

Dawid Weiss .

20+ years of coding 10 years assembly only

Academia & Research

.

PhD in Information Retrieval, PUT

Open source Carrot2 , HPPC, Lucene, ‌

Industry & Business Carrot Search s.c.

.


Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and Solr Suggester. FuzzySearch. Index.

No API details Still @experimental.


(Non)? Deterministic Finite State (Automata|Machines)


HashSet hash 0x29384d34 0xde3e3354 0x00000666

→ slot

→ value → lucene → lucid → lucifer


HashSet → slot

hash

→ value → lucene → lucid → lucifer

0x29384d34 0xde3e3354 0x00000666 FSA (deterministic)

l

u

c

e

n

e

i d f

r e


HashSet → slot

hash

→ value → lucene → lucid → lucifer

0x29384d34 0xde3e3354 0x00000666 FSA (deterministic)

l

u

c

e

n

e

i d exists(sequence) oor(pre x) ceil(pre x)

f

r e


k b

i

l

i

l

l l

deterministic, non-minimal


k b

i

l

i

l

i

l

l l

deterministic, non-minimal

k

b

l

deterministic, minimal


k b

i

l

i

l

i

l

l l

deterministic, non-minimal

k l

deterministic, minimal

b k

b

i

l

i

l

l

non-deterministic, non-minimal


(Sorted)Map lucene lucid lucifer

→ 1 → 2 → 666


(Sorted)Map → 1 → 2 → 666

lucene lucid lucifer

FST (transducer)

l

u

c

e i

n

e|1 d|2 r|666

f

e


(Sorted)Map lucene lucid lucifer

→ 1 → 2 → 666

FST (transducer)

l|1

u

c

e i|1

n

e d r

f|664

e


NFSAs and

Regular expressions a

a e1e2

Determinization

e+

states explosion, not always possible

Backtracking recursion explosion

e*

e?

e1

e1

e

e

e


a?nan


a?nan

n=3 → a?a?a?aaa


a?nan

n=3 → a?a?a?aaa

Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).


35000 30000

Time [ms]

25000 20000 15000 10000 5000 0 0

5

10

15

20

25

Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.

30


Linear-time, minimal, deterministic

FSA construction

Linear algorithm from sorted input by Daciuk, Mihov, et al.

Active path states that still can change

States dictionary nodes that will never change


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

lucene


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

l

lucid

u

c

e

n

e


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

l

u

c

e

n

i d

e


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

l

u

c

e

n

i d

lucifer

e


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

l

u

c

e

n

i

e d

f

e

r


1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP

l

u

c

e

n

e

i d f

r e


FS(A|T)s in (Lucene|Solr)


Automata in

Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes


org.apache.lucene.util.automaton.fst.*

FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive

Input: abc, bd, bde. a b

b

c d

d

a e

b

b

c

d

e


org.apache.lucene.util.automaton.fst.*

FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive

Next-state chaining requires unusual tricks during construction

s3

a

s2

b s4

b d

s1 s5

s1 s2 s5 s4 cFL bL eFL dL

s1 s2 s5 s4

c e

s3 a

bL

s3

cFL bL eFL dL bLN a


org.apache.lucene.util.automaton.fst.*

FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive

Next-state chaining requires unusual tricks during construction

Everything in a byte[]

s3

a

s2

b s4

b d

s1 s5

s1 s2 s5 s4 cFL bL eFL dL

c e

s3 a

bL

traversals-ready, memory-eďŹƒcient

s1 s2 s5 s4

s3

cFL bL eFL dL bLN a


org.apache.lucene.util.automaton.fst.*

FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive

Next-state chaining requires unusual tricks during construction

Everything in a byte[]

s3

a

s2

b s4

b d

s1 s5

s1 s2 s5 s4 cFL bL eFL dL

c e

s3 a

bL

traversals-ready, memory-eďŹƒcient

Dual transition storage format lookup: bsearch or linear scan

s1 s2 s5 s4

s3

cFL bL eFL dL bLN a


Input size

Compressed size (MB)

Input

MB

Terms

Lucene

morf.

gzip

Wikipedia t.index Polish in .

481 162

38 092 045 3 672 200

258 3.1

164 1.7

149 15.4

.


Use Cases: Solr's Autocomplete


Solr's

Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches?

Weights term→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup


.

flour|3 four|4 fourier|3 furious|2

. T . ake 1 .


.

flour|3 four|4 fourier|3 furious|2

→fou*

u

o l f

o

u

r

i

e

r

u r

i

o

| 3

|

4 u

s

|

2

Find pre x. Depth-in traversal for completions. PQ on score|alpha . T . ake 1 .


.

2furious 3flour 3fourier 4four

. T . ake 2 .


.

2furious 3flour 3fourier 4four

→fou*

u

f

r

i

2 3

l o

f

4 f

u o

r

o

u

i

e

s r

u

From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match.

. T . ake 2 .


.

2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour ‌

. T . ake 3 (in xes) .


.

2

i o

o

r 5

u

i

r

s s

|

f

u s

u

4 6

u

o u

r

r

| f

7

r

.

i

o

u

i

e

l f

3 e

| r

o

u

o

i |

r u u

i r

e |

o

u o

f f r

i l

l

|

r

s r

u

o


Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.


Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.

Exact matches only. Static snapshot (not incremental). Discretized weights.


Top50KWiki.utf8, 676 KB, 50 000 terms

RAM. [B]

. PREFIX [100-200]

Jaspell

TST

FST

7 869. 415

7 914. 524

300 .175

queries per second,. . . tpq . . . 458 966 742

PREFIX. [6-9]

. 330

. 228

. 659

PREFIX. [2-4]

. 126

.. 29

. 501


Summary


Summary and Conclusions Automata compact, powerful, eďŹƒcient data structure

Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters

API in Lucene ‌is shaped right now, still @experimental


Acknowledgement Michael McCandless Robert Muir committer: .+


dawid.weiss@carrotsearch.com


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.