Floyd Morgan Floyd_Morgan@intuit.com @fmorgan Lucene Revolution, 2011
Agenda • • • • • •
About Me About Live Community Live Community Search NLP Next Steps Questions? Answers?
About Me • Principal Software Engineer at Intuit
Intuit QuickBase
Intuit Inc. is a leading provider of business and financial management solutions for small and mid-sized businesses; financial institutions, including banks and credit unions; consumers and accounting professionals. More than 200 applica0ons and 7700 employees worldwide.
About Me • Principal Software Engineer at Intuit • TurboTax Engineering
TurboTax is the nation’s No. 1 rated, best-selling, do-it-yourself tax preparation software. TurboTax helps more than 20 million people a year. $1 billion in revenue
About Me • Principal Software Engineer at Intuit • TurboTax Engineering – Core tax engine
About Me • Principal Software Engineer at Intuit • TurboTax Engineering – Core tax engine – TurboTax Online
About Me • Principal Software Engineer at Intuit • TurboTax Engineering – Core tax engine – TurboTax Online – TurboTax Live Community
About Me • Principal Software Engineer at Intuit • TurboTax Engineering – Core tax engine – TurboTax Online – TurboTax Live Community
• Central Technology Organization – Live Community Platform
About Live Community • It’s a user contribution system – Q&A
About Live Community • It’s a user contribution system – Q&A • It can be integrated into an application, contextually – Page-to-page relevance
About Live Community • It’s a user contribution system – Q&A • It can be integrated into an application, contextually – Page-to-page relevance • We use social, technology and data – To create our value proposition…assisting users
About Live Community • It’s a user contribution system – Q&A • It can be integrated into an application, contextually – Page-to-page relevance • We use social, technology and data – To create our value proposition…assisting users • We launched our Beta in 2007 – TurboTax Online Home & Business
About Live Community • It’s a user contribution system – Q&A • It can be integrated into an application, contextually – Page-to-page relevance • We use social, technology and data – To create our value proposition…assisting users • We launched our Beta in 2007 – TurboTax Online Home & Business • We use open source…primarily open source – Apache HTTP, Ruby on Rails, MySQL, memcached ...
About Live Community • It’s a user contribution system – Q&A • It can be integrated into an application, contextually – Page-to-page relevance • We use social, technology and data – To create our value proposition…assisting users • We launched our Beta in 2007 – TurboTax Online Home & Business • We use open source…primarily open source – Apache HTTP, Ruby on Rails, MySQL, memcached ... • It’s a platform – APIs, skinning, dynamic provisioning (AWS in progress)
Intuit Money Manager, India
QuickBooks Online, UK
devZone, Intuit dev
QuickBooks Online, US
TurboTax Desktop & Online, US
Terminology
Consumers (in the millions)
Contributors (in the thousands)
Top Contributors (in the hundreds)
Employees (contribute too)
Tax Season
Officially begins on December 1 and ends on April 15.
About TurboTax Live Community • Largest community – 150+ servers, 200 thousand concurrent users
About TurboTax Live Community • Largest community – 150+ servers, 200 thousand concurrent users • Over 23 million users have used the service – Over 8 million last tax season alone
About TurboTax Live Community • Largest community – 150+ servers, 200 thousand concurrent users • Over 23 million users have used the service – Over 8 million last tax season alone • Over 32 million pages views last tax season – In-product views in the billions
About TurboTax Live Community • Largest community – 150+ servers, 200 thousand concurrent users • Over 23 million users have used the service – Over 8 million last tax season alone • Over 32 million pages views last tax season – In-product views in the billions • Over 750 thousand answered questions – 10 thousand questions asked on peak day
About TurboTax Live Community • Largest community – 150+ servers, 200 thousand concurrent users • Over 23 million users have used the service – Over 8 million last tax season alone • Over 32 million pages views last tax season – In-product views in the billions • Over 750 thousand answered questions – 10 thousand questions asked on peak day • Our contributors answers thousands of questions – Top contributor – 70 thousand answers
Demo
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Why Solr? • Lots of features/functionality
Why Solr? • Lots of features/functionality • Ease of integration
Why Solr? • Lots of features/functionality • Ease of integration • We can scale it independently
Why Solr? • • • •
Lots of features/functionality Ease of integration We can scale it independently You’ll need some search expertise…that’s ok – Community and Lucid Imagination!
Why Solr? • • • •
Lots of features/functionality Ease of integration We can scale it independently You’ll need some search expertise…that’s ok – Community and Lucid Imagination!
• Search is really important – Search everywhere…
Why Solr? • • • •
Lots of features/functionality Ease of integration We can scale it independently You’ll need some search expertise…that’s ok – Community and Lucid Imagination!
• Search is really important – Search everywhere…
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Auto suggest •  Provides a glimpse of our vast content
Auto suggest • Provides a glimpse of our vast content • facet query (Solr 1.2)
Auto suggest • Provides a glimpse of our vast content • facet query (Solr 1.2) • We use NLP…
Auto suggest • • • •
Provides a glimpse of our vast content facet query (Solr 1.2) We use NLP… It’s used on every search touch point
Auto suggest • • • • •
Provides a glimpse of our vast content facet query (Solr 1.2) We use NLP… It’s used on every search touch point Second most frequent request
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
In-product “mini” search • Primary search interface for consumers
In-product “mini” search • Primary search interface for consumers • It appears integrated
In-product “mini” search • Primary search interface for consumers • It appears integrated • Now the most utilized search interface
In-product “mini” search • • • •
Primary search interface for consumers It appears integrated Now the most utilized search interface It makes all content available
In-product “mini” search • • • • •
Primary search interface for consumers It appears integrated Now the most utilized search interface It makes all content available Over 3 million users last tax season
# using Solr is easy! require 'solr’ c = Solr::Connection.new( "http://localhost:8090/solr/posts" ) c.search( "how do i input 1099”, :filter_queries => "post_status: # {Post::ANSWERED}" )
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Web-site “full” search • Primary search interface for contributors and employees
Web-site “full” search • Primary search interface for contributors and employees • More real estate, more facets, more suggestions ...
Web-site “full” search • Primary search interface for contributors and employees • More real estate, more facets, more suggestions ... • Faceted search empowers development teams to narrow on issues
Web-site “full” search • Primary search interface for contributors and employees • More real estate, more facets, more suggestions ... • Faceted search empowers development teams to narrow on issues • 200+ TurboTax issues discovered last tax season
# using Solr is easy! require 'solr’ c = Solr::Connection.new( "http://localhost:8090/solr/posts" ) c.search( ”bug”, :filter_queries => "post_status: # {Post::OPEN}" )
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Instant answer •  Present similar answered question
Instant answer • Present similar answered question • Search with the terms of the new question
Instant answer • Present similar answered question • Search with the terms of the new question • Narrow the focus to the subject
Instant answer • • • •
Present similar answered question Search with the terms of the new question Narrow the focus to the subject Show snippet of a recommended answer
Instant answer • • • • •
Present similar answered question Search with the terms of the new question Narrow the focus to the subject Show snippet of a recommended answer Accidental A/B test
Demo
# using Solr is easy! require 'solr’ c = Solr::Connection.new( "http://localhost:8090/solr/posts" ) c.search( "how do i input 1099”, { :query_fields => "subject", :filter_queries => "post_status: #{Post::ANSWERED}" } )
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Instant question • Present similar unanswered questions
Instant question • Present similar unanswered questions • Answer reuse
Instant question • Present similar unanswered questions • Answer reuse • Search with the terms of the answered question
Instant question • Present similar unanswered questions • Answer reuse • Search with the terms of the answered question • Narrow the focus to the subject
Instant question • Present similar unanswered questions • Answer reuse • Search with the terms of the answered question • Narrow the focus to the subject • We also use a date filter
“Aren’t we addicted enough!”
Demo
# using Solr is easy!  require 'solr’ c = Solr::Connection.new( "http://localhost:8090/solr/posts" ) today = DateTime.now.at_beginning_of_day.utc.to_time date_from = 7.to_i.days.ago ( today ).getutc.iso8601 c.search( "how do i input 1099", { :query_fields => "subject", :filter_queries => "post_status: #{Post::OPEN} AND created_at_d:[#{date_from} TO *]" } )
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Answer bot • We continue to search for you – The day after you ask
Answer bot • We continue to search for you – The day after you ask
• Send an email
Answer bot • We continue to search for you – The day after you ask
• Send an email • Runs for 7 days
Answer bot • We continue to search for you – The day after you ask
• Send an email • Runs for 7 days • We only send another email if the results have changed
Answer bot • We continue to search for you – The day after you ask
• Send an email • Runs for 7 days • We only send another email if the results have changed • From our explicit feedback – 39% answered question
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Advertising • We use our user generated content in advertising
Advertising • We use our user generated content in advertising • Has 300% higher click through rate than static banner ads
Advertising • We use our user generated content in advertising • Has 300% higher click through rate than static banner ads • Ads displayed throughout the tax season on many ad networks
Advertising • We use our user generated content in advertising • Has 300% higher click through rate than static banner ads • Ads displayed throughout the tax season on many ad networks • Content selection is automated and continuous
Logs Logs
Logs
MapReduce Carrot2 Solr Heuristics
<?xml version="1.0" encoding="UTF-8"?> <lc_trending end_date="2011-05-21" include_popular="true" type="queries" duration="day"> <topic> <rank>1</rank>
<text>Ptp</text>
<post> <post_id>aBHMBWxzar4lKMacfArRo0</post_id> <subject>Final K-1 Disposition of PTP Units</subject> <detail>I bought units in a PTP in five separate transactions in 2008; I sold all my units in five separate transactions in 2010. TT does not allow me to report all 5 transactions while stepping through the K-1 form -- these transactions are reported on Schedule D, but also need to be on Form 4797, Part II, Box 10. I can't seem to make the linkage work. I would appreciate some guidance on how to make this happen.</detail> <response>OK, several steps needed for your situation: 1) on the K-1 on the screen entitled Describe the Partnership Disposal, choose "Disposition was not via a sale" 2) Then search for the topic "sale of business property" you will be taked to a topic entitled "Any Other Property Sales?" - select the first option. Ove rthe next few screens here you will have the opportunityut to enter the sale amounts associated witht he Form 4797.
3) then choose the topic on the income landing table for "Stocke, Mutual Funds, Bonds, other - here you will enter the rest of the sale, that portion attributable to capital gains. Hope this helps you, </response> <viewsCount>60</viewsCount> <answersCount>2</answersCount> <asker>Xuxan</asker> <display_post_url>https://ttlc.intuit.com/post/show_full/aBHMBWxzar4lKMacfArRo0? rmode=ad</display_post_url> </post>
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Search everywhere • Search first, ask second – Used to be ask first, search later or never!
Search everywhere • Search first, ask second – Used to be ask first, search later or never! • Auto complete everywhere too – 64 bit Linux, 10 (8 core) slaves, 300 req/s
Search everywhere • Search first, ask second – Used to be ask first, search later or never! • Auto complete everywhere too – 64 bit Linux, 10 (8 core) slaves, 300 req/s • Search requests – 900 % increase
Search everywhere • Search first, ask second – Used to be ask first, search later or never! • Auto complete everywhere too – 64 bit Linux, 10 (8 core) slaves, 300 req/s • Search requests – 900 % increase • Questions asked – 50 % decrease…is that good?
Search everywhere • Search first, ask second – Used to be ask first, search later or never! • Auto complete everywhere too – 64 bit Linux, 10 (8 core) slaves, 300 req/s • Search requests – 900 % increase • Questions asked – 50 % decrease…is that good? • Increased consumption – 38% users, 43% content…very good!
Live Community Search • • • • • • • • • •
Why Solr? Auto suggest In-product search Web-site search Instant answer Instant question Answer bot Advertising Search everywhere Architecture
Search cluster
App server
Indexing server
Database cluster
NLP • Search is not enough…unfortunately
NLP • Search is not enough…unfortunately • Our domain is noisy…ugly at times
Uh, what?
Too much what!
?
I wish NLP could help!
NLP • Search is not enough…unfortunately • Our domain is noisy…ugly at times • How it works…
HwO do iput 10 99 i don,t know what to do need help help me.
Where do I enter a 1099?
schema.xml <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype>
dictionary <?xml version="1.0" encoding="US-ASCII"?> <dictionary> <entry score="10" root="none" synonym="none" domain="ttlc" id="suitcas">suitcase</entry> <entry score="10" root="form" synonym="none" domain="ttlc" id="2210"></entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="xrai">x-ray</ entry> <entry score="10" root="none" synonym="townhom" domain="ttlc" id="townhous">townhouse</entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="grosssal">gross sale</entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="trinidad">Trinidad</entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="home"></entry> <entry score="10" root="none" synonym="know" domain="ttlc" id="knew"></entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="massachusett">Massachusetts</entry> <entry score="10" root="none" synonym="none" domain="ttlc" id="denver">Denver</entry> <entry score="5" root="none" synonym="none" domain="ttlc" id="instead"></ entry> <entry score="10" root="none" synonym="unallow" domain="ttlc" id="disallow">not allowed</entry> <entry score="5" root="none" synonym="see" domain="ttlc" id="saw"></entry>
regular expressions (many)
if text =~ / any/ text.gsub!(/ any where /, ' anywhere ') text.gsub!(/ any(body| body| one) /, ' anyone ') text.gsub!(/ any( thing| things|things) /, ' anything ') text.gsub!(/ any(one|thing|where) else /, ' any\1 ’) end if text =~ / don / text.gsub!(/ don i /, ' do not i ') text.gsub!(/ don (have|know|see|want) /, ' do not \1 ') text.gsub!(/ (are|be|have|is|was|were) don /, ' \1 done ’) text.gsub!(/ don (not|nt|t) /, ' do not ’) end text.gsub!(/ (do|can) (ai|ii) /, ' \1 i ’) text.gsub!(/ d (oyou|you) /, ' do you ') text.gsub!(/ (1|ai|ii|my) (did|do|had|have|was) /, ' i \2 ’) text.gsub!(/ crap{1,10} /, ' crap ’) text.gsub!(/ gr{1,} /, ' ')
Spell Checker Stemmer (Porter) Word Collocation Stop Phrase Correction Stop Word Removal Synonyms Substitution Tax Domain Correction Phrase Encoding
# NLP is not easy! # this class wraps our NLP sf = SemanticFilter.new # does it work? sf.act_on_post( "HwO do iput 10 99 i don,t know what to do need help help me." ) =>[" wheretoent 1099 ”] sf.act_on_post( "Where do I enter a 1099?" ) =>[" wheretoent 1099 ”]
NLP • • • •
Search is not enough…unfortunately Our domain is noisy…ugly at times How it works… It works well, but it’s not perfect
“Stop guessing what I’m looking for!”
NLP • • • • •
Search is not enough…unfortunately Our domain is noisy…ugly at times How it works… It works well, but it’s not perfect Not just for search…
Recommendations â&#x20AC;˘â&#x20AC;Ż Deliver unanswered questions to contributors
Recommendations • Deliver unanswered questions to contributors • Too much content to scan manually
Recommendations • Deliver unanswered questions to contributors • Too much content to scan manually • Based on past answering behavior
Recommendations • Deliver unanswered questions to contributors • Too much content to scan manually • Based on past answering behavior • Recommend a question to multiple contributors
Recommendations • Deliver unanswered questions to contributors • Too much content to scan manually • Based on past answering behavior • Recommend a question to multiple contributors • Uses Mahout machine learning library
Answered
Unanswered
NLP
NLP
User vectors
Post vectors
Mahout Heuristics
Next Steps • We’re going to rewrite it!
Next Steps • We’re going to rewrite it! … most of it ;)
Next Steps • We’re going to rewrite it! … most of it ;) • Real-time indexing
Next Steps • We’re going to rewrite it! … most of it ;) • Real-time indexing • Question vs. Query
Next Steps • • • •
We’re going to rewrite it! … most of it ;) Real-time indexing Question vs. Query Social feedback – Page ranking
Next Steps • • • •
We’re going to rewrite it! … most of it ;) Real-time indexing Question vs. Query Social feedback – Page ranking • Social dictionaries – Content classification
Next Steps • • • •
We’re going to rewrite it! … most of it ;) Real-time indexing Question vs. Query Social feedback – Page ranking • Social dictionaries – Content classification • Beer?!
Thank you. Floyd_Morgan@intuit.com @fmorgan
Appendix • User search • SEO