Avoiding the Big Data – Small Info Syndrome Roberto Rigobon
MIT Sloan, NBER, CSAC
Big Distance!
Data
The world is not lacking of Data
Information
Lacking of Careful Empirics
Knowledge
Lacking of Managerial Data Analysis
Data Types Type
Advantages
Disadvantages
Standard
• Representative • High Quality
• • • • •
Limited Collection Cost Understand process Identification
Data Types Type
Advantages
Disadvantages
Standard
• Representative • High Quality
• • • • •
Limited Collection Cost Understand process Identification
Big Data
• Size, details, frequency • Might solve some identification problems • Might improve measurement
• • • • •
Statistics is hard Not representative Need question Understand process Identification
Principles of Good Big-Data Users Warnings Purpose of collection
• Have a question before you have the data! • Data collection is part of the design
Understand process
• Correlation is not causality • Forecast-in-sample is useless
Empirical procedure
• Identification is usually crucial • Control for spurious correlation
Outcome
• What is the big data providing? • Visualization, • Measurement, • or Prediction?
Inflation and Price Indexes Alberto Cavallo
We want to produce alternative measures of inflation We need prices We need methodology to weight those prices We need to understand product introductions and discontinuations We need to understand store behavior How to collect prices of products not sold on the internet? Services?
Online Information and Indexes Our Approach to Daily Inflation Statistics
1
2 Use scraping technology
Connect to thousands of online retailers every day
3 Find individual items
4 Store and process key item information in a database
• Date • Item • Price • Description
5 Develop daily inflation statistics for ~20 countries
How do we collect data?
Our prices are collected from public online sources, using a technique called “web scraping”
A software downloads a webpage, analyses the html code, “scrapes” price data, and stores it in a database
Countries covered
Argentina
USA
Annual Inflation Rates Argentina
Colombia
Russia
Australia
Germany
UK
China – Supermarket Index
Ireland
Venezuela
US Recession
Sep 16 2008
US Sector Inflation Series Food and Beverage Inflation Index
Energy and Transportation Inflation Index
Furnishing and Household Equipment Inflation Index
Recreation and Electronics Inflation Index
Properties of Inflation Indexes Congruence Stores keep markups between online and offline prices relatively constant in the medium run
Anticipation Online Price Indexes trends anticipate official inflation shifts.
Online prices are easier to change, retailers are more competitive, and consumers have less memory.
Demand trends Changes in inflation trends by retailers reflect changes in the demand they are facing
Measuring Anticipation Due to the behavior of online stores in some countries we find anticipation
We compute anticipation in a simple VAR framework
Compute impulse responses
USA 1.8
1.6
1.4
1.2 95 1
90 10
0.8
5 mean
0.6
0.4
0.2
16 0 -1
0
1
2
3
4
5
6
7
8
9
10
11
12
Anticipation of US CPI Sectors 1% Shock to PriceStats Aggregate Inflation 2
1% Shock to PriceStats Food Inflation
1% Shock to PriceStats Household Inflation
1.5
1.5
1
1
.5
.5
1.5
1
.5
0
0 0
1
2
3
4 5 Month
CI
6
7
8
9
0 0
1
2
Cumulative IRF
3
4 5 Month
CI
1% Shock to PriceStats Health Inflation
6
7
8
9
0
2
Cumulative IRF
3
4 5 Month
CI
1% Shock to PriceStats Transportation Inflation
1.5
1
6
7
8
9
Cumulative IRF
1% Shock to PriceStats Recreation Inflation
2
2
1.5
1.5
1
1
.5
.5
1
.5
0
0 0
1
2
3 CI
Source: PriceStats
4 5 Month
6
7
8
Cumulative IRF
9
0 0
1
2
3 CI
4 5 Month
6
7
8
Cumulative IRF
9
0
1
2
3 CI
4 5 Month
6
7
8
Cumulative IRF
9
Global Inflation Series ď‚— World, Emerging Markets, Developed Markets, and Eurozone ď‚— By Sector: All, Food, and Fuel
Apr-13
Feb-13 Mar-13
Jan-13
Dec-12
Nov-12
Oct-12
Sep-12
3
Aug-12
Jul-12
Jun-12
May-12
Apr-12
Mar-12
Feb-12
Jan-12
Dec-11
Nov-11
Oct-11
Sep-11
Aug-11
Jul-11
Jun-11
May-11
Apr-11
Feb-11 Mar-11
Jan-11
Dec-10
Nov-10
Oct-10
Sep-10
Aug-10
Jul-10
Jun-10
May-10
Apr-10
Mar-10
Inflation Rate (%)
PriceStats World Inflation Series World Inflation (Annual Inflation Rate)
4
3.5 PriceStats World Inflation
CPI - World Inflation
2.5
2
1.5
Big Data Advantages
Challenges
Cost
Non-Representative
Volume, Variety, and Velocity
Econometrics is much harder • Understanding of the phenomena is even more important • No technique can substitute for deeper thinking
If collection is properly designed it can solve problems of identification
If collection is NOT properly design, it mostly computes irrelevant correlations
Good Outcomes: • Visualization • Measurement • Predictability
Spurious Relationships