Searching The United States Code with Solr/Lucene

Page 1

Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com


Searching the United States Code §  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer

§  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine!

§  Key Challenges •  How to index this massive, complex, 85-year-old document? •  How to replicate 20-Year-Old search features?

§  Government Documents are Fun! 3


Search Technologies §  The largest independent provider of enterprise search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral •  (yeah, we know Sphinx too)

§  Offices All Over •  DC, NY, CA, MD, OH, UK, CR…

4


A Quick Civics Lesson… §  The United States Code •  The general & permanent laws of the U.S. Government – All in one place •  51 titles §  Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health…

•  First Version: 1926

§  The Office of the Law Revision Council (OLRC) •  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of Representatives

§  Bonus Question: Which Title is the largest? 5


Major Challenges 1.  Document Parsing •  A 50 Volume Table Of Contents!

2.  Query Parsing •  Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…)

3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context 6


screenshot

7


screenshot

8


screenshot

9


10


Part The First: Document Processing

11


Document Processing / Indexing USC Title

Parse & Granularize

Embed Refs

Construct XHTML

Store

Xform & Index

Solr

Repository

12


Field Type 1: Extracted to Index <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 Page Numbers documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> Source Credit <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …

13


Document Processing / Indexing USC Title

Parse & Granularize

Embed Refs

Construct XHTML

Store

Xform & Index

Solr

Repository Title 14

pt. A

ch. 1

ch. 2

ch. 3

pt. B

pt. C

sec. 1

sec. 2

sec. 3

… … 14


Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… Public Law <!-- field-end:sourcecredit --> <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … Public Law <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …

15


Document Processing / Indexing USC Title

Parse & Granularize

Embed Refs

Construct XHTML

Store

Xform & Index

Solr

Repository

16


Document Processing / Indexing USC Title

Parse & Granularize

Embed Refs

Construct XHTML

Store

Xform & Index

Solr

Repository

§  /US-Code §  /2010 §  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm

17


Part The Second: Token Processing

18


Token Processing 1 xhtml tag tokenizer

<!-- field-start:amendment-note --> <h4 class="note-head">

<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107– 296 substituted “Department of ‌ <!-- field-end:amendment-note -->

Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->

19


Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …

20


Token Processing 2 Mark Start and End Tags <!-- field-start:amendment-note -->

S/amendment

<h4 class="note-head">

<h4 class="note-head">

Amendments

Amendments

</h4>

</h4>

<p class="note-body">

<p class="note-body">

2002

2002

Pub

Pub

L

L

107

107

296

296

Substituted

Substituted

Department

Department

of

of

<!-- field-end:amendment-note -->

E/amendment

21


Token Processing 3 Remove XHTML Tags S/amendment

S/amendment

<h4 class="note-head"> Amendments

Amendments

</h4> <p class="note-body"> 2002

2002

Pub

Pub

L

L

107

107

296

296

Substituted

Substituted

Department

Department

of

of

E/amendment

E/amendment

22


Token Processing 4 Tag Original Case & Lower Case S/amendment

S/amendment

Amendments

O/Amendments

L/amendments

2002

O/2002

L/2002

Pub

O/Pub

L/pub

L

O/L

L/l

107

O/107

L/107

296

O/296

L/296

Substituted

O/Substituted

L/substituted

Department

O/Department

L/department

of

O/of

L/of

E/amendment

E/amendment

23


Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNet S/amendment

S/amendment

O/Amendments

L/amendments

O/Amendments

L/amendments

amendment

O/2002

L/2002

O/2002

L/2002

2002

O/Pub

L/pub

O/Pub

L/Pub

pub

O/L

L/l

O/L

L/l;

l

O/107

L/107

O/107

L/107

107

O/296

L/296

O/296

L/296

296

O/Substituted

L/substituted

O/Substituted

L/Substituted

substitute

O/Department

L/department

O/Department

L/Department

department

O/of

L/of

O/of

L/of

of

E/amendment

E/amendment

24


Part The Third: Query Processing

25


Query Processing (not all stages shown) Query String

parse

mark exact:

mark phrases

lemmatize

query template

build lucene query

search

§  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators

§  Can produce FAST FQL as well •  (cue the derisive catcalls)

§  But most importantly: •  It is a Query Processing Pipeline §  Mix and match query processing modules 26


Query Processing exact:FOIA top secret amendment:RECORDS Query String

parse

mark original

mark lowercase

lemmatize

query template

build lucene query

search

and exact: |FOIA|

phrase |top|

|secret|

amendment: |RECORDS|

27


Query Processing exact:FOIA top secret amendment:RECORDS Query String

parse

mark original

mark lowercase

lemmatize

query template

build lucene query

search

and O/FOIA

phrase |top|

|secret|

amendment: |RECORDS|

28


Query Processing exact:FOIA top secret amendment:RECORDS Query String

parse

mark original

mark lowercase

lemmatize

query template

build lucene query

search

and O/FOIA

phrase |L/top|

|L/secret|

amendment: |records|

29


Query Processing exact:FOIA top secret amendment:RECORDS Query String

parse

mark original

mark lowercase

lemmatize

query template

build lucene query

search

and O/FOIA

phrase |L/top|

|L/secret|

amendment: |record|

30


Query Processing exact:FOIA top secret amendment:RECORDS Query String

parse

mark original

mark lowercase

lemmatize

query template

build lucene query

search

and O/FOIA

phrase |L/top|

|L/secret|

between S/amendment

|record| E/amendment 31


The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause) §  start-tag à Starting tag, e.g. S/amendment §  end-tag à Ending tag, e.g. E/amendment §  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator

§  neg-clause à words which must not occur between start and end 32


Part the Fourth: Hierarchical Navigation

33


screenshot

34


Hierarchies: Requirements §  Any number of levels §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section

§  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels

§  Multiple views: §  Children §  Ancestors §  Ancestor s Siblings

§  Multiple search scopes: §  Only children, all descendents, everything 35


Hierarchies: Ancestor-Siblings §  US-Code •  Title 1 •  Title 2 §  Chapter 1 §  Chapter 2 –  Part 1 –  Part 2 •  Section 2.1 •  Section 2.2 –  Part 3 –  Part 4

§  Chapter 3 §  Chapter 4

•  Title 3 36


Hierarchies: Fields §  ancestors •  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25subchapter2

§  encodedAncestors – for display only •  Where the node exists within the hierarchy §  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform

§  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2

§  treesort – Hierarchical sort field, e.g. 13/000/0/00882 37


Hierarchies: Tree Sort §  Sorting In Print Order •  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length

01/011/1/02032 01 = USC Title

Sequence # in file 011 = Title 11

1 = An Appendix

38


Hierarchies: Sample Searches §  Assuming Node = USC-title2-chapter25 §  Search Children •  parentId:USC-title2-chapter25

§  Search All Descendents •  ancestors:USC-title2-chapter25

§  Ancestor Siblings •  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)

39


Contact §  Paul Nelson •  pnelson@searchtechnologies.com

§  Ronald Matamoros •  rmatamoros@searchtechnologies.com

§  Search Technologies •  http://searchtechnologies.com

40


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.