Delmolino - Understanding Data Skew Bias by rocker12

Understanding Data Skew Bias in Application Code

Dominic Delmolino (ddelmolino@networksolutions.com) Director of Database Development, Network Solutions Hotsos Symposium 2005 / Dallas, Texas USA March 2005

Slide 1

Agenda • • • • • • • • • •

Network Solutions business profile COTS billing system evaluation Interlude (commentary on evaluation) Implementation harbingers The problem – bill runs to produce wholesale invoices Vendor dialogs – process examination vs. tuning Analysis challenges The aha! moment Solutions? Epilogue

Slide 2

Network Solutions Business Profile • Domain name rental agent (registrar) – Rent (register) your domain name with us on an annual or multiannual basis – Have the opportunity to buy additional services (email, web site) with us for your registered domain • Most customers interact with web site (retail) via credit cards (7M domains) • Resellers (wholesalers) also exist and are sent invoices (1.5M domains) • No seasonality to business • Major transactions include – Establishment of new customer – Registration or purchase – Renewal – Service modification • “System” processes 500 transactions per second

Slide 3

Cast • Network Solutions – Finance department – Software Engineering group – System Operations group – VP level stakeholders • Software Vendor • Implementation Consultants

Slide 4

The Project • Replace in-house billing system with COTS product – Engineering wants to get out of the business of writing a billing package – Finance wants more functionality – Operations wants professional support • Requirements driven by Finance needs / wants – Perform refunds – Accept and process payments – Accept paper checks – Apply adjustments to account balances – Manage invoices

Slide 5

Interlude (commentary on evaluation) What we did

What we learned later

Checklist based approach. • Can your package do this? • Invoked vendor “yes” reflex

Use techniques from HR interviews 2. Tell me how your package performs the following function

No review of requirements prior to evaluation process

Need to quantity benefits of each requirement to balance against costs. E.g., processing paper checks

No questions about typical reference Should have compared business customer usage profiles – compared model to that of reference customer only on business size to determine match / fit.

Slide 6

Product Choice

A billing system package commonly used by telecommunication companies.

Slide 7

Implementation Harbingers

Perspectives on Vendor and Implementation Team

Slide 8

Implementation Decisions: Business Focus vs. Application Focus Business Focus

Application Focus

Majority of transactions are on-line via credit cards

Application is batch invoice generation oriented, so translate each credit card purchase into a mini batch, even if invoice is never sent.

Products and services use a simple price schedule

Application supports complex charge categories and requires them at the detailed product level

Slide 9

Vendor database recommendations • Normal = settings we use in our largest online DBs • Implementation version is 8.1.7 – 1GB block buffer cache (3x normal) – 20MB sort area size, with retained also 20MB (10x normal) – 300MB shared pool size (2x normal) – Hash_join_enabled to FALSE (normal is TRUE) – Optimizer_index_caching = 0 (normal is 90) – Optimizer_index_cost_adj = 100 (normal is 40) • Regular, periodic table analyze using provided SQL script – Script analyzes every table using sample of 10,000 rows

We begin to wonder about competency and data skew bias, but of course at this point it is too late to re-open product evaluation

Slide 10

Monthly Wholesale Bill Run is Finally Tested • But only after all configuration is done • And only after data importation / migration is done

Waterfall approach meant that one of our most important business process was not tested until the very end of implementation • Initial runs are poor, but after changes to application configuration, final test runs take about 12 hours to process 3,000 wholesalers representing roughly 2M end customers with 2.5M products

My bad – I didn’t question why it needed to take this long Slide 11

Wholesale Bill Runs become a problem

Operations group has set an upper limit of 24 hours to complete, but bill runs start to routinely exceed that. Even though we are reducing the number of wholesalers we bill

Slide 12

Wholesale Bill Runs processing history

Wholesalers vs Time 3500

40.00

3000

35.00

2500 2000

30.00 25.00 20.00

1500 1000

15.00

Wholesalers Hours Poly. (Hours) Poly. (Wholesalers)

10.00 5.00

0.00

1Se 1- p-0 O 3 1- ct-0 N 3 1- ovDe 03 1- c -0 Ja 3 1- n -0 Fe 4 1- b-0 M 4 a 1- r-0 Ap 4 1- r-0 M 4 a 1- y-0 Ju 4 n 1- -04 Ju 1- l- 0 Au 4 1- g-0 Se 4 1- p-0 O 4 1- ct-0 N 4 1- ov De 04 1- c-0 Ja 4 1- n -0 Fe 5 b05

500

Slide 13

Wholesale Bill Runs processing history

Services vs Time 2500000

40.00 35.00

2000000

1500000

30.00 25.00 20.00

1000000

500000

15.00

Active Services Hours Poly. (Hours) Poly. (Active Services)

10.00 5.00 0.00

1Se 1- p-0 O 3 1- ct-0 N 3 1- ov De 03 1- c -0 Ja 3 1- n -0 F 4 1- ebM 04 1- a r-0 A 4 1- p rM 04 a 1- y-0 Ju 4 1- n -0 J 4 1- ulAu 04 1- gSe 04 1- p-0 O 4 1- ct-0 N 4 1- ov De 04 1- c -0 Ja 4 1- n -0 Fe 5 b05

Slide 14

Wholesale Bill Runs processing history

Charges vs Time 600000 500000

40.00 35.00 30.00

400000 300000 200000

25.00 20.00 15.00

Charges Hours Poly. (Hours) Poly. (Charges)

10.00 100000

0.00

1Se 1- p-0 O 3 c 1- t-0 N 3 1- ovDe 03 1- c-0 Ja 3 1- n -0 Fe 4 1- b-0 M 4 a 1- r-0 A 4 1- p r-0 M 4 a 1- y-0 Ju 4 n 1- -04 Ju 1- l- 0 Au 4 1- g-0 Se 4 1- p-0 O 4 c 1- t-0 No 4 1- vDe 04 1- c-0 Ja 4 1- n -0 Fe 5 b05

5.00

Slide 15

Wholesale Bill Runs processing history

Customers vs Time 1750000

40.00

1700000

35.00

1650000 1600000

30.00 25.00

1550000 1500000 1450000 1400000

20.00 15.00

Active Customers Hours Poly. (Hours) Poly. (Active Customers)

10.00 5.00

1300000

0.00

1Se 1- p-0 O 3 1- ct-0 N 3 1- ov De -03 1- c -0 J 3 1- an Fe 04 1- bM 04 1- a r-0 A 1- p r- 4 M 04 1- a y-0 Ju 4 1- n -0 J 4 1- ulAu 04 1- gSe 04 1- pO 04 1- ct-0 N 4 1- ov De -04 1- c-0 J 4 1- an Fe 05 b05

1350000

Slide 16

Initial Vendor Response Approach • Investigative approach – Slowness is due to a h/w or db tuning problem – Send us your statspack report – Send us sar reports – Send us tuxedo statistics • Initial response – Low (92%) hit ratio means you need to 4x your SGA (from 512M to 2G) – You need to change hash_join_enabled to false – You need to set optimizer_index_caching to 0 – You need to set optimizer_index_cost_adj to 0

Slide 17

Oh, by the way… • Are your statistics up to date? – We stopped running analyze in April 2004 due to complex vendor views suddenly changing execution plans (for the worse). – We have begun to implement plan stability and are only running targeted analyzes. • You need to run our dbanalyze script

• Not a single question about our business model

Slide 18

What we learn on our own • 2-3 resource intensive queries – One to get charge categories or rate plans on a per product basis • Common in the telco industry • But we use the same charge category for every product

– One to query every child customer of a wholesaler to see if their services have any charges in the month • Assumption appears to be that every customer has active services • Every active service has MANY charges each month, again a telco-ism (lots of calls per phone line or service)

• Analyze has no effect, queries are written with hints and direct assumptions (i.e., filter conditions applied at far ends of join chains).

Slide 19

Our investigation reveals data skew bias

WholesalersÂ /Â Customers

Services

Charges

Application (Telco) Assumption

NSI Business Data Model

Any service could be on any rate plan

All services are on the same rate plan

Customers are deleted when they terminate their services

Customers are always active â&#x20AC;&#x201C; can buy new services later

Every active service has charges every bill run

Each service basically has a 1 in 30 chance of having a charge each bill run

Slide 20

Application SQL reflects data skew bias Application (Telco) Assumption

What the code does

NSI Business Data What the code Model should do

Any service could be on any rate plan

Query all services to determine the different rate plans

All services are on the same rate plan

If we use a global rate plan, donâ&#x20AC;&#x2122;t query the services

Customers are deleted when they terminate their services

Assume all active customers will have charges

Customers are always active â&#x20AC;&#x201C; can buy new services later

Donâ&#x20AC;&#x2122;t assume a customer will have a charge (generate $0 at the end)

Every active service has charges every bill run

Get all services and determine their charges

Each service Get all charges basically has a 1 in and 30 chance of having a charge Slide 21

Sample queries from application

select /*+ ordered use_nl(a,c,sh) */ distinct c.service_id from account a, charge c, service s where a.customer_node_id = :b1 and a.account_id = c.account_id and c.charge_date between :b2 and :b3 and c.service_id = s.service_id order by 1

Commentary – Looping – run once for every customer (:b1) – Forces driving off of every customer (ordered and use_nl hints) – Analyze won’t have any effect on this query

Slide 22

Sample queries from application

select distinct charge_category_id from customer c, service s, service_charge_category scc where (c.root_customer_node_id = :b1 or c.customer_node_id = :b1) and c.customer_node_id = s.customer_node_id and s.service_id = scc.service_id

Commentary – Looping – run once for every customer (:b1)

Slide 23

We use LIOs/join/rows processed to measure statement efficiency • Statement #1 (getting all services associated with charges for each customer) call count cpu elapsed disk query rows Parse 25058 2.98 3.15 0 0 0 Execute 25058 1.36 1.11 0 0 0 Fetch 38710 12.28 74.72 12519 1636259 14282 total 88826 16.62 78.98 12519 1636259 14282 • Also look at rows vs. executions – don’t want to execute if you’re not going to get any rows • LIOs/join/row processed = roughly 40 (we target <= 10 in our own code) • Processing customer record even if no charges exist results in low row per execution count

Slide 24

We use LIOs/join/rows processed to measure statement efficiency • Statement #2 (getting all charge categories) call count cpu elapsed disk query current rows Parse 1 0.02 0.09 14 265 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 2 7.75 143.37 29909 427780 0 2 total 4 7.77 143.46 29923 428045 0 2

• LIOs/join/row processed = over 210,000 • Aggregation is no excuse • A lot of work to determine that we only use one rate plan

Slide 25

The disconnected conversations with vendor • We send them statspack reports and execution plans for the problematic statements • The conversation: – Run dbanalyze – “Which plans are not as you expect? Which plans should change?” – We don’t know, please run dbanalyze – Set hash_join_enabled to false and set optimizer parameters back to defaults – “But your statements don’t use hash joins, and can you tell us how the plans will change when we reset the optimizer parameter?” – These parameters affect more than execution plans Slide 26

What every H/W vendor wants to hear… • More conversation with s/w vendor – Disks look slow – Network may be slow – You should consider a larger server with faster CPU and memory • The response from our Operations group was priceless: – “Are you serious?”

Slide 27

The aha! moment

After 6-8 weeks of back and forth, where we provide more and more evidence that our database and h/w are fine, the vendor account manager finally brings in a functional consultant. The functional consultant promptly validates our conclusions about application data skew bias

Slide 28

Solutions? • Our “hack” – we adapt our data to their process – Prior to a bill run, temporarily delete end customers (those without any charges) by moving them to a “holding” wholesaler – Vendor claims this will take too long and not give any benefit – Our results: • • • •

1 hour to move Processing largest wholesaler drops from 5 hours to 5 minutes 1 hour to move back Total time should drop from 30 hours to 3 hours

– We have not implemented this yet due to concerns about directly modifying data • Vendor analysis turns up configuration items – Option to not maintain running subtotal (we don’t use it any) – Option to not calculate special tariffs (we don’t use any) – These changes result in 40% savings in test environment

Slide 29

Other avoidance strategies?

Email me at ddelmolino@networksolutions.com

Slide 30

Epilogue â&#x20AC;˘ Homegrown applications are not immune to this problem â&#x20AC;˘ Consider web page screen design

Slide 31

Epilogue • During our requirement construction process, our UI team prepares screen specifications • These screen specs have samples of how to display data • The problem? – All screen specs are created and distributed on 8 ½” x 11” paper – Sample data chosen to fit paper size – Design assumptions perpetuate bias toward result set sizes which match sample data sizes – no concept of pagination until it’s too late.

Slide 32

Questions & Answers

Thank you

Slide 33