Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com
Searching the United States Code § Who are we: • Paul Nelson, Chief Architect • Ronald Matamoros, Lead Engineer
§ Our Mission: Replace Personal Librarian Search • A 20-Year-Old Search Engine!
§ Key Challenges • How to index this massive, complex, 85-year-old document? • How to replicate 20-Year-Old search features?
§ Government Documents are Fun! 3
Search Technologies § The largest independent provider of enterprise search expertise and services § 80 full-time dedicated search engine experts § 200+ customers § Technology Neutral • (yeah, we know Sphinx too)
§ Offices All Over • DC, NY, CA, MD, OH, UK, CR…
4
A Quick Civics Lesson… § The United States Code • The general & permanent laws of the U.S. Government – All in one place • 51 titles § Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health…
• First Version: 1926
§ The Office of the Law Revision Council (OLRC) • 20 lawyers who author the U.S. Code • They report to the Speaker of the House of Representatives
§ Bonus Question: Which Title is the largest? 5
Major Challenges 1. Document Parsing • A 50 Volume Table Of Contents!
2. Query Parsing • Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…)
3. Searching & Highlighting Fields • Some fields are embedded in the document • These fields must be highlighted in context 6
screenshot
7
screenshot
8
screenshot
9
10
Part The First: Document Processing
11
Document Processing / Indexing USC Title
Parse & Granularize
Embed Refs
Construct XHTML
Store
Xform & Index
Solr
Repository
12
Field Type 1: Extracted to Index <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 Page Numbers documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> Source Credit <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …
13
Document Processing / Indexing USC Title
Parse & Granularize
Embed Refs
Construct XHTML
Store
Xform & Index
Solr
Repository Title 14
pt. A
ch. 1
ch. 2
ch. 3
pt. B
pt. C
…
sec. 1
sec. 2
sec. 3
… … 14
Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… Public Law <!-- field-end:sourcecredit --> <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … Public Law <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …
15
Document Processing / Indexing USC Title
Parse & Granularize
Embed Refs
Construct XHTML
Store
Xform & Index
Solr
Repository
16
Document Processing / Indexing USC Title
Parse & Granularize
Embed Refs
Construct XHTML
Store
Xform & Index
Solr
Repository
§ /US-Code § /2010 § /title2 § /USC-title2-section1532.htm § /USC-title2-node3-rule5.htm
17
Part The Second: Token Processing
18
Token Processing 1 xhtml tag tokenizer
<!-- field-start:amendment-note --> <h4 class="note-head">
<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash; 296 substituted &ldquo;Department of â&#x20AC;Ś <!-- field-end:amendment-note -->
Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
19
Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …
20
Token Processing 2 Mark Start and End Tags <!-- field-start:amendment-note -->
S/amendment
<h4 class="note-head">
<h4 class="note-head">
Amendments
Amendments
</h4>
</h4>
<p class="note-body">
<p class="note-body">
2002
2002
Pub
Pub
L
L
107
107
296
296
Substituted
Substituted
Department
Department
of
of
<!-- field-end:amendment-note -->
E/amendment
21
Token Processing 3 Remove XHTML Tags S/amendment
S/amendment
<h4 class="note-head"> Amendments
Amendments
</h4> <p class="note-body"> 2002
2002
Pub
Pub
L
L
107
107
296
296
Substituted
Substituted
Department
Department
of
of
E/amendment
E/amendment
22
Token Processing 4 Tag Original Case & Lower Case S/amendment
S/amendment
Amendments
O/Amendments
L/amendments
2002
O/2002
L/2002
Pub
O/Pub
L/pub
L
O/L
L/l
107
O/107
L/107
296
O/296
L/296
Substituted
O/Substituted
L/substituted
Department
O/Department
L/department
of
O/of
L/of
E/amendment
E/amendment
23
Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNet S/amendment
S/amendment
O/Amendments
L/amendments
O/Amendments
L/amendments
amendment
O/2002
L/2002
O/2002
L/2002
2002
O/Pub
L/pub
O/Pub
L/Pub
pub
O/L
L/l
O/L
L/l;
l
O/107
L/107
O/107
L/107
107
O/296
L/296
O/296
L/296
296
O/Substituted
L/substituted
O/Substituted
L/Substituted
substitute
O/Department
L/department
O/Department
L/Department
department
O/of
L/of
O/of
L/of
of
E/amendment
E/amendment
24
Part The Third: Query Processing
25
Query Processing (not all stages shown) Query String
parse
mark exact:
mark phrases
lemmatize
query template
build lucene query
search
§ Communicates via generic QNode Class • Simpler to manipulate than Lucene operators
§ Can produce FAST FQL as well • (cue the derisive catcalls)
§ But most importantly: • It is a Query Processing Pipeline § Mix and match query processing modules 26
Query Processing exact:FOIA top secret amendment:RECORDS Query String
parse
mark original
mark lowercase
lemmatize
query template
build lucene query
search
and exact: |FOIA|
phrase |top|
|secret|
amendment: |RECORDS|
27
Query Processing exact:FOIA top secret amendment:RECORDS Query String
parse
mark original
mark lowercase
lemmatize
query template
build lucene query
search
and O/FOIA
phrase |top|
|secret|
amendment: |RECORDS|
28
Query Processing exact:FOIA top secret amendment:RECORDS Query String
parse
mark original
mark lowercase
lemmatize
query template
build lucene query
search
and O/FOIA
phrase |L/top|
|L/secret|
amendment: |records|
29
Query Processing exact:FOIA top secret amendment:RECORDS Query String
parse
mark original
mark lowercase
lemmatize
query template
build lucene query
search
and O/FOIA
phrase |L/top|
|L/secret|
amendment: |record|
30
Query Processing exact:FOIA top secret amendment:RECORDS Query String
parse
mark original
mark lowercase
lemmatize
query template
build lucene query
search
and O/FOIA
phrase |L/top|
|L/secret|
between S/amendment
|record| E/amendment 31
The between() Operator § between(start-tag, end-tag, pos-clause, neg-clause) § start-tag à Starting tag, e.g. S/amendment § end-tag à Ending tag, e.g. E/amendment § pos-clause à words which must occur between start and end • Note: Requires a nested ScanAnd() operator
§ neg-clause à words which must not occur between start and end 32
Part the Fourth: Hierarchical Navigation
33
screenshot
34
Hierarchies: Requirements § Any number of levels § Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section
§ Levels vary across titles § Title 1: 3 levels § Title 26: 8 levels
§ Multiple views: § Children § Ancestors § Ancestor s Siblings
§ Multiple search scopes: § Only children, all descendents, everything 35
Hierarchies: Ancestor-Siblings § US-Code • Title 1 • Title 2 § Chapter 1 § Chapter 2 – Part 1 – Part 2 • Section 2.1 • Section 2.2 – Part 3 – Part 4
§ Chapter 3 § Chapter 4
• Title 3 36
Hierarchies: Fields § ancestors • Searching § USC USC-title2 USC-title2-chapter25 USC-title2-chapter25subchapter2
§ encodedAncestors – for display only • Where the node exists within the hierarchy § id;heading;subjectTitle//id;heading;subjectTitle//... § USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform
§ parentId – ID of the parent node § USC-title2-chapter25-subchapter2
§ treesort – Hierarchical sort field, e.g. 13/000/0/00882 37
Hierarchies: Tree Sort § Sorting In Print Order • Front Matter à Titles à Tables à etc. • Everything padded to fixed-length
01/011/1/02032 01 = USC Title
Sequence # in file 011 = Title 11
1 = An Appendix
38
Hierarchies: Sample Searches § Assuming Node = USC-title2-chapter25 § Search Children • parentId:USC-title2-chapter25
§ Search All Descendents • ancestors:USC-title2-chapter25
§ Ancestor Siblings • (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)
39
Contact § Paul Nelson • pnelson@searchtechnologies.com
§ Ronald Matamoros • rmatamoros@searchtechnologies.com
§ Search Technologies • http://searchtechnologies.com
40