R at Microsoft Guillermo Julca (Yemo) The Marketing Advantage, Inc @2017
• • • • • • • • • •
1993: Research project in Auckland, NZ • Ross Ihaka and Robert Gentlemen 1995: Released as open-source software • Generally compatible with the “S” language 1997: R core group formed 2000: R 1.0.0 released 2003: R Foundation formed in Austria 2004: First international user conference 2007: Revolution Analytics founded 2009: New York Times article on R 2013: Revolution R Open released 2015: Microsoft acquires Revolution Analytics
3
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
Memory bound because product can only process datasets that fit into the available memory. 1
Because the Intel Math Kernel Library (MKL) is included in Microsoft R Open, the performance of a generic R solution is generally better. MKL replaces the standard R implementations of Basic Linear Algebra Subroutines (BLAS) and the LAPACK library with multithreaded versions. As a result, calls to those low-level routines tend to execute faster on Microsoft R than on a conventional installation of R. 2
More at deployr.revolutionanalytics.com
• Multithreaded library replaces
standard BLAS/LAPACK algorithms •
Intel MKL on Windows/Linux ; Accelerate on Mac
• High-performance algorithms • Sequential Parallel • Uses as many threads as there are available cores
• No need to change any R code • Included with RRO binary
distributions
More at Revolutions blog
blog.revolutionanalytics.com/popularity R Usage Growth
Rexer Data Miner Survey, 2007-2013
Language Popularity
IEEE Spectrum Top Programming Languages
#9: R • Rexer Data Miner Survey
• IEEE Spectrum, July 2014
Data Step ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪
Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)
Statistical Tests
▪ ▪ ▪ ▪
Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test
▪ ▪
Subsample (observations & variables) Random Sampling
Predictive Models
Descriptive Statistics
Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations
Sampling
▪ ▪ ▪
▪ ▪ ▪ ▪ ▪
Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models
Variable Selection Stepwise Regression
▪
Simulation ▪ ▪
Simulation (e.g. Monte Carlo) Parallel Random Number Generation
Cluster Analysis ▪
K-Means
Classification ▪ ▪ ▪ ▪
Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes
Combination ▪ ▪ ▪
New in v7.3
PEMA-R API rxDataStep rxExec
Coming in v7.4
R IN THE CLOUD
• Exposing the expertise of data scientists as APIs • Bringing the utility of data science to applications
• Addressing the Data Science talent gap
Azure: Huge infrastructure scale
19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
North Europe Central US Iowa
US Gov
Ireland
North Central US
Illinois
West Europe Netherlands
China North * Beijing
East US
Iowa
West US
Virginia
South Central US
Texas
India West
US Gov
Japan West
India East
TBD
Virginia
Saitama
Shanghai
East US 2
California
Japan East
China South *
Virginia
Osaka
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia East Sydney
Brazil South Sao Paulo
▪ ▪ ▪ ▪
100+ datacenters One of the top 3 networks in the world (coverage, speed, connections) 2 x AWS and 6x Google number of offered regions G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Australia West Melbourne
Announced Operational * Operated by 21Vianet