Understanding Data Skew Bias in Application Code
Dominic Delmolino (ddelmolino@networksolutions.com) Director of Database Development, Network Solutions Hotsos Symposium 2005 / Dallas, Texas USA March 2005
Slide 1
Agenda • • • • • • • • • •
Network Solutions business profile COTS billing system evaluation Interlude (commentary on evaluation) Implementation harbingers The problem – bill runs to produce wholesale invoices Vendor dialogs – process examination vs. tuning Analysis challenges The aha! moment Solutions? Epilogue
Slide 2
Network Solutions Business Profile • Domain name rental agent (registrar) – Rent (register) your domain name with us on an annual or multiannual basis – Have the opportunity to buy additional services (email, web site) with us for your registered domain • Most customers interact with web site (retail) via credit cards (7M domains) • Resellers (wholesalers) also exist and are sent invoices (1.5M domains) • No seasonality to business • Major transactions include – Establishment of new customer – Registration or purchase – Renewal – Service modification • “System” processes 500 transactions per second
Slide 3
Cast • Network Solutions – Finance department – Software Engineering group – System Operations group – VP level stakeholders • Software Vendor • Implementation Consultants
Slide 4
The Project • Replace in-house billing system with COTS product – Engineering wants to get out of the business of writing a billing package – Finance wants more functionality – Operations wants professional support • Requirements driven by Finance needs / wants – Perform refunds – Accept and process payments – Accept paper checks – Apply adjustments to account balances – Manage invoices
Slide 5
Interlude (commentary on evaluation) What we did
What we learned later
Checklist based approach. • Can your package do this? • Invoked vendor “yes” reflex
Use techniques from HR interviews 2. Tell me how your package performs the following function
No review of requirements prior to evaluation process
Need to quantity benefits of each requirement to balance against costs. E.g., processing paper checks
No questions about typical reference Should have compared business customer usage profiles – compared model to that of reference customer only on business size to determine match / fit.
Slide 6
Product Choice
A billing system package commonly used by telecommunication companies.
Slide 7
Implementation Harbingers
Perspectives on Vendor and Implementation Team
Slide 8
Implementation Decisions: Business Focus vs. Application Focus Business Focus
Application Focus
Majority of transactions are on-line via credit cards
Application is batch invoice generation oriented, so translate each credit card purchase into a mini batch, even if invoice is never sent.
Products and services use a simple price schedule
Application supports complex charge categories and requires them at the detailed product level
Slide 9
Vendor database recommendations • Normal = settings we use in our largest online DBs • Implementation version is 8.1.7 – 1GB block buffer cache (3x normal) – 20MB sort area size, with retained also 20MB (10x normal) – 300MB shared pool size (2x normal) – Hash_join_enabled to FALSE (normal is TRUE) – Optimizer_index_caching = 0 (normal is 90) – Optimizer_index_cost_adj = 100 (normal is 40) • Regular, periodic table analyze using provided SQL script – Script analyzes every table using sample of 10,000 rows
We begin to wonder about competency and data skew bias, but of course at this point it is too late to re-open product evaluation
Slide 10
Monthly Wholesale Bill Run is Finally Tested • But only after all configuration is done • And only after data importation / migration is done
Waterfall approach meant that one of our most important business process was not tested until the very end of implementation • Initial runs are poor, but after changes to application configuration, final test runs take about 12 hours to process 3,000 wholesalers representing roughly 2M end customers with 2.5M products
My bad – I didn’t question why it needed to take this long Slide 11
Wholesale Bill Runs become a problem
Operations group has set an upper limit of 24 hours to complete, but bill runs start to routinely exceed that. Even though we are reducing the number of wholesalers we bill
Slide 12
Wholesale Bill Runs processing history
Wholesalers vs Time 3500
40.00
3000
35.00
2500 2000
30.00 25.00 20.00
1500 1000
15.00
Wholesalers Hours Poly. (Hours) Poly. (Wholesalers)
10.00 5.00
0
0.00
1Se 1- p-0 O 3 1- ct-0 N 3 1- ovDe 03 1- c -0 Ja 3 1- n -0 Fe 4 1- b-0 M 4 a 1- r-0 Ap 4 1- r-0 M 4 a 1- y-0 Ju 4 n 1- -04 Ju 1- l- 0 Au 4 1- g-0 Se 4 1- p-0 O 4 1- ct-0 N 4 1- ov De 04 1- c-0 Ja 4 1- n -0 Fe 5 b05
500
Slide 13
Wholesale Bill Runs processing history
Services vs Time 2500000
40.00 35.00
2000000
1500000
30.00 25.00 20.00
1000000
500000
15.00
Active Services Hours Poly. (Hours) Poly. (Active Services)
10.00 5.00 0.00
1Se 1- p-0 O 3 1- ct-0 N 3 1- ov De 03 1- c -0 Ja 3 1- n -0 F 4 1- ebM 04 1- a r-0 A 4 1- p rM 04 a 1- y-0 Ju 4 1- n -0 J 4 1- ulAu 04 1- gSe 04 1- p-0 O 4 1- ct-0 N 4 1- ov De 04 1- c -0 Ja 4 1- n -0 Fe 5 b05
0
Slide 14
Wholesale Bill Runs processing history
Charges vs Time 600000 500000
40.00 35.00 30.00
400000 300000 200000
25.00 20.00 15.00
Charges Hours Poly. (Hours) Poly. (Charges)
10.00 100000
0.00
1Se 1- p-0 O 3 c 1- t-0 N 3 1- ovDe 03 1- c-0 Ja 3 1- n -0 Fe 4 1- b-0 M 4 a 1- r-0 A 4 1- p r-0 M 4 a 1- y-0 Ju 4 n 1- -04 Ju 1- l- 0 Au 4 1- g-0 Se 4 1- p-0 O 4 c 1- t-0 No 4 1- vDe 04 1- c-0 Ja 4 1- n -0 Fe 5 b05
0
5.00
Slide 15
Wholesale Bill Runs processing history
Customers vs Time 1750000
40.00
1700000
35.00
1650000 1600000
30.00 25.00
1550000 1500000 1450000 1400000
20.00 15.00
Active Customers Hours Poly. (Hours) Poly. (Active Customers)
10.00 5.00
1300000
0.00
1Se 1- p-0 O 3 1- ct-0 N 3 1- ov De -03 1- c -0 J 3 1- an Fe 04 1- bM 04 1- a r-0 A 1- p r- 4 M 04 1- a y-0 Ju 4 1- n -0 J 4 1- ulAu 04 1- gSe 04 1- pO 04 1- ct-0 N 4 1- ov De -04 1- c-0 J 4 1- an Fe 05 b05
1350000
Slide 16
Initial Vendor Response Approach • Investigative approach – Slowness is due to a h/w or db tuning problem – Send us your statspack report – Send us sar reports – Send us tuxedo statistics • Initial response – Low (92%) hit ratio means you need to 4x your SGA (from 512M to 2G) – You need to change hash_join_enabled to false – You need to set optimizer_index_caching to 0 – You need to set optimizer_index_cost_adj to 0
Slide 17
Oh, by the way… • Are your statistics up to date? – We stopped running analyze in April 2004 due to complex vendor views suddenly changing execution plans (for the worse). – We have begun to implement plan stability and are only running targeted analyzes. • You need to run our dbanalyze script
• Not a single question about our business model
Slide 18
What we learn on our own • 2-3 resource intensive queries – One to get charge categories or rate plans on a per product basis • Common in the telco industry • But we use the same charge category for every product
– One to query every child customer of a wholesaler to see if their services have any charges in the month • Assumption appears to be that every customer has active services • Every active service has MANY charges each month, again a telco-ism (lots of calls per phone line or service)
• Analyze has no effect, queries are written with hints and direct assumptions (i.e., filter conditions applied at far ends of join chains).
Slide 19
Our investigation reveals data skew bias
Wholesalers / Customers
Services
Charges
Application (Telco) Assumption
NSI Business Data Model
Any service could be on any rate plan
All services are on the same rate plan
Customers are deleted when they terminate their services
Customers are always active – can buy new services later
Every active service has charges every bill run
Each service basically has a 1 in 30 chance of having a charge each bill run
Slide 20
Application SQL reflects data skew bias Application (Telco) Assumption
What the code does
NSI Business Data What the code Model should do
Any service could be on any rate plan
Query all services to determine the different rate plans
All services are on the same rate plan
If we use a global rate plan, don’t query the services
Customers are deleted when they terminate their services
Assume all active customers will have charges
Customers are always active – can buy new services later
Don’t assume a customer will have a charge (generate $0 at the end)
Every active service has charges every bill run
Get all services and determine their charges
Each service Get all charges basically has a 1 in and 30 chance of having a charge Slide 21
Sample queries from application
select /*+ ordered use_nl(a,c,sh) */ distinct c.service_id from account a, charge c, service s where a.customer_node_id = :b1 and a.account_id = c.account_id and c.charge_date between :b2 and :b3 and c.service_id = s.service_id order by 1
Commentary – Looping – run once for every customer (:b1) – Forces driving off of every customer (ordered and use_nl hints) – Analyze won’t have any effect on this query
Slide 22
Sample queries from application
select distinct charge_category_id from customer c, service s, service_charge_category scc where (c.root_customer_node_id = :b1 or c.customer_node_id = :b1) and c.customer_node_id = s.customer_node_id and s.service_id = scc.service_id
Commentary – Looping – run once for every customer (:b1)
Slide 23
We use LIOs/join/rows processed to measure statement efficiency • Statement #1 (getting all services associated with charges for each customer) call count cpu elapsed disk query rows Parse 25058 2.98 3.15 0 0 0 Execute 25058 1.36 1.11 0 0 0 Fetch 38710 12.28 74.72 12519 1636259 14282 total 88826 16.62 78.98 12519 1636259 14282 • Also look at rows vs. executions – don’t want to execute if you’re not going to get any rows • LIOs/join/row processed = roughly 40 (we target <= 10 in our own code) • Processing customer record even if no charges exist results in low row per execution count
Slide 24
We use LIOs/join/rows processed to measure statement efficiency • Statement #2 (getting all charge categories) call count cpu elapsed disk query current rows Parse 1 0.02 0.09 14 265 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 2 7.75 143.37 29909 427780 0 2 total 4 7.77 143.46 29923 428045 0 2
• LIOs/join/row processed = over 210,000 • Aggregation is no excuse • A lot of work to determine that we only use one rate plan
Slide 25
The disconnected conversations with vendor • We send them statspack reports and execution plans for the problematic statements • The conversation: – Run dbanalyze – “Which plans are not as you expect? Which plans should change?” – We don’t know, please run dbanalyze – Set hash_join_enabled to false and set optimizer parameters back to defaults – “But your statements don’t use hash joins, and can you tell us how the plans will change when we reset the optimizer parameter?” – These parameters affect more than execution plans Slide 26
What every H/W vendor wants to hear… • More conversation with s/w vendor – Disks look slow – Network may be slow – You should consider a larger server with faster CPU and memory • The response from our Operations group was priceless: – “Are you serious?”
Slide 27
The aha! moment
After 6-8 weeks of back and forth, where we provide more and more evidence that our database and h/w are fine, the vendor account manager finally brings in a functional consultant. The functional consultant promptly validates our conclusions about application data skew bias
Slide 28
Solutions? • Our “hack” – we adapt our data to their process – Prior to a bill run, temporarily delete end customers (those without any charges) by moving them to a “holding” wholesaler – Vendor claims this will take too long and not give any benefit – Our results: • • • •
1 hour to move Processing largest wholesaler drops from 5 hours to 5 minutes 1 hour to move back Total time should drop from 30 hours to 3 hours
– We have not implemented this yet due to concerns about directly modifying data • Vendor analysis turns up configuration items – Option to not maintain running subtotal (we don’t use it any) – Option to not calculate special tariffs (we don’t use any) – These changes result in 40% savings in test environment
Slide 29
Other avoidance strategies?
Email me at ddelmolino@networksolutions.com
Slide 30
Epilogue â&#x20AC;˘ Homegrown applications are not immune to this problem â&#x20AC;˘ Consider web page screen design
Slide 31
Epilogue • During our requirement construction process, our UI team prepares screen specifications • These screen specs have samples of how to display data • The problem? – All screen specs are created and distributed on 8 ½” x 11” paper – Sample data chosen to fit paper size – Design assumptions perpetuate bias toward result set sizes which match sample data sizes – no concept of pagination until it’s too late.
Slide 32
Questions & Answers
Thank you
Slide 33