Instance-based Learning Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica UniversitĂ degli Studi di Bari
November 12, 2009
Corso di Apprendimento Automatico
Instance-based Learning
Outline
k -N EAREST N EIGHBOR Locally weighted regression kD-trees, ball-trees Rectangles Radial basis functions Case-based reasoning Lazy and eager learning
Corso di Apprendimento Automatico
Instance-based Learning
Instance-Based Learning I Key idea: just store all training examples hxi , f (xi )i N EAREST N EIGHBOR: Given query instance xq , first locate nearest training example xn , then estimate ˆf (xq ) ← f (xn ) k -N EAREST N EIGHBOR: Given xq , take vote among its k nearest nbrs (if discrete-valued target function) take mean of f values of k nearest nbrs (if real-valued) ˆf (xq ) ←
Corso di Apprendimento Automatico
Pk
i=1 f (xi )
k
Instance-based Learning
Instance-Based Learning II
Corso di Apprendimento Automatico
Instance-based Learning
kNN and Voronoi Diagram
Corso di Apprendimento Automatico
Instance-based Learning
When To Consider N EAREST N EIGHBOR
Instances map to points in <n Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Donâ&#x20AC;&#x2122;t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes
Corso di Apprendimento Automatico
Instance-based Learning
Behavior in the Limit I Consider p(x) defines probability that instance x will be labeled 1 (positive) versus 0 (negative). N EAREST N EIGHBOR: As number of training examples → ∞ approaches Gibbs Algorithm Gibbs: with probability p(x) predict 1, else 0 k -N EAREST N EIGHBOR: As number of training examples → ∞ and k gets large approaches Bayes optimal Bayes optimal: if p(x) > .5 then predict 1, else 0 Note: Gibbs has at most twice the expected error of Bayes optimal Corso di Apprendimento Automatico
Instance-based Learning
Behavior in the Limit II
Corso di Apprendimento Automatico
Instance-based Learning
Distance-Weighted k NN Might want weight nearer neighbors more heavily... ˆf (xq ) ← where wi ≡
Pk
i=1 wi f (xi )
Pk
i=1 wi
1 d(xq , xi )2
and d(xq , xi ) is distance between xq and xi Note now it makes sense to use all training examples instead of just k → Shepard’s method
Corso di Apprendimento Automatico
Instance-based Learning
Distances
Most instance-based schemes use Euclidean distance: q (1) (2) (1) (2) (1) (2) (x1 − x1 )2 + (x2 − x2 )2 + ...(xk − xk )2 where x(1) and x(2): two instances with k attributes Taking the square root is not required when comparing distances Other popular metric: cityblock metric Adds differences without squaring them
Corso di Apprendimento Automatico
Instance-based Learning
Normalization
Different attributes are measured on different scales â&#x2020;&#x2019; need to be normalized: ai =
vi â&#x2C6;&#x2019; mini vi maxi vi â&#x2C6;&#x2019; mini vi
vi : the actual value of attribute i Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes)
Corso di Apprendimento Automatico
Instance-based Learning
Curse of Dimensionality Imagine instances described by 20 attributes, but only 2 are relevant to target function Curse of dimensionality: N EAREST N EIGHBOR is easily mislead when high-dimensional X One approach: Stretch jth axis by weight zj , where z1 , . . . , zn chosen to minimize prediction error Use cross-validation to automatically choose weights z1 , . . . , zn Note setting zj to zero eliminates this dimension altogether see [Moore and Lee, 1994]
Corso di Apprendimento Automatico
Instance-based Learning
Finding Nearest Neighbors Efficiently
Simplest way of finding nearest neighbor: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets
Nearest-neighbor search can be done more efficiently using appropriate data structures We will discuss two methods that represent training data in a tree structure: kD-trees and ball trees
Corso di Apprendimento Automatico
Instance-based Learning
kD-tree Example I
Corso di Apprendimento Automatico
Instance-based Learning
kD-tree Example II
Corso di Apprendimento Automatico
Instance-based Learning
kD-tree Example III
Corso di Apprendimento Automatico
Instance-based Learning
More on kD-trees
Complexity depends on depth of tree, given by logarithm of number of nodes Amount of backtracking required depends on quality of tree (”square” vs. ”skinny” nodes) How to build a good tree? Need to find good split point and split direction Split direction: direction with greatest variance Split point: median value along that direction
Using value closest to mean (rather than median) can be better if data is skewed Can apply this recursively
Corso di Apprendimento Automatico
Instance-based Learning
Building Trees Incrementally
Big advantage of instance-based learning: classifier can be updated incrementally Just add new training instance!
Can we do the same with kD-trees? Heuristic strategy: Find leaf node containing new instance Place instance into leaf if leaf is empty Otherwise, split leaf according to the longest dimension (to preserve squareness)
Tree should be re-built occasionally (i.e. if depth grows to twice the optimum depth)
Corso di Apprendimento Automatico
Instance-based Learning
Ball Trees
Problem in kD-trees: corners Observation: no need to make sure that regions donâ&#x20AC;&#x2122;t overlap Can use balls (hyperspheres) instead of hyperrectangles A ball tree organizes the data into a tree of k-dimensional hyperspheres Normally allows for a better fit to the data and thus more efficient search
Corso di Apprendimento Automatico
Instance-based Learning
Ball Trees Examples I
Corso di Apprendimento Automatico
Instance-based Learning
Ball Trees Examples II
Corso di Apprendimento Automatico
Instance-based Learning
Ball Trees Examples III Nearest-neighbor search is done using the same backtracking strategy as in kD-trees Ball can be ruled out from consideration if: distance from target to ballâ&#x20AC;&#x2122;s center exceeds ballâ&#x20AC;&#x2122;s radius plus current upper bound
Corso di Apprendimento Automatico
Instance-based Learning
Building Ball Trees
Ball trees are built top down (like kD-trees) Donâ&#x20AC;&#x2122;t have to continue until leaf balls contain just two points: can enforce minimum occupancy (same in kD-trees) Basic problem: splitting a ball into two Simple (linear-time) split selection strategy: Choose point farthest from ballâ&#x20AC;&#x2122;s center Choose second point farthest from first one Assign each point to these two points Compute cluster centers and radii based on the two subsets to get two balls
Corso di Apprendimento Automatico
Instance-based Learning
Building Ball Trees
Ball trees are built top down (like kD-trees) Donâ&#x20AC;&#x2122;t have to continue until leaf balls contain just two points: can enforce minimum occupancy (same in kD-trees) Basic problem: splitting a ball into two Simple (linear-time) split selection strategy: Choose point farthest from ballâ&#x20AC;&#x2122;s center Choose second point farthest from first one Assign each point to these two points Compute cluster centers and radii based on the two subsets to get two balls
Corso di Apprendimento Automatico
Instance-based Learning
Discussion of Nearest-Neighbor Learning I Often very accurate Assumes all attributes are equally important Remedy: attribute selection or weights
Possible remedies against noisy instances: Take a majority vote over the k nearest neighbors Removing noisy instances from dataset (difficult!)
Statisticians have used k-NN since early 1950s If n â&#x2020;&#x2019; â&#x2C6;&#x17E; and k /n â&#x2020;&#x2019; 0, error approaches minimum
kD-trees become inefficient when number of attributes is too large (approximately > 10) Ball trees (which are instances of metric trees) work well in higher-dimensional spaces
Corso di Apprendimento Automatico
Instance-based Learning
Discussion of Nearest-Neighbor Learning II
Instead of storing all training instances, compress them into regions Example: hyperpipes (from discussion of 1R) Another simple technique (Voting Feature Intervals): Construct intervals for each attribute Discretize numeric attributes Treat each value of a nominal attribute as an â&#x20AC;?intervalâ&#x20AC;?
Count number of times class occurs in interval Prediction is generated by letting intervals vote (those that contain the test instance)
Corso di Apprendimento Automatico
Instance-based Learning
Instance-based Learning
Practical problems of 1-NN scheme: Slow (but: fast tree-based approaches exist) Remedy: remove irrelevant data
Noise (but: k-NN copes quite well with noise) Remedy: remove noisy instances
All attributes deemed equally important Remedy: weight attributes (or simply select)
Doesnâ&#x20AC;&#x2122;t perform explicit generalization Remedy: rule-based NN approach
Corso di Apprendimento Automatico
Instance-based Learning
Applications: agricultural land classification I STATLOG project [Michie et al., 1994] dataset: 82 x 100 LANDSAT images images (heat-maps) for areas in Australia: 2 in the visible band and 2 in the infrared band
each pixel has a class (manually determined) different colors colors to indicate the classes G = {red sod, cotton, vegetation stubble, mkture, gray soid, damp gray soil, very damp gray soil}
for each spectrum bands and for each pixel p: 8-neighbor feature map extracted: 1 2 3 4 p 5 6 7 8 i.e. (1 + 8) Ă&#x2014; 4 = 36 input features per pixel Corso di Apprendimento Automatico
Instance-based Learning
Applications: agricultural land classification II objective: classify land usage at a pixel, based on the information in the 4 spectral bands
Corso di Apprendimento Automatico
Instance-based Learning
Applications: agricultural land classification III
5-NN produced the predicted land usage map test error rate: â&#x2030;&#x2C6; 9.5% in the STATLOG project, many method attempted including LVQ, CART, ANN, LDA . . . k-NN performed best on this task. Hence it is likely that the decision boundaries in R36 are quite irregular
Corso di Apprendimento Automatico
Instance-based Learning
Learning Prototypes (Editing)
Only those instances involved in a decision need to be stored Noisy instances should be filtered out Idea: only use prototypical examples
Corso di Apprendimento Automatico
Instance-based Learning
Speed up, Combat noise IB2: save memory, speed up classification Work incrementally Only incorporate misclassified instances Problem: noisy data gets incorporated IB3: deal with noise Discard instances that donâ&#x20AC;&#x2122;t perform well Compute confidence intervals for 1. Each instanceâ&#x20AC;&#x2122;s success rate 2. Default accuracy of its class
Accept/reject instances Accept if lower limit of 1 exceeds upper limit of 2 Reject if upper limit of 1 is below lower limit of 2
Corso di Apprendimento Automatico
Instance-based Learning
Weight Attributes
IB4: weight each attribute weights can be class-specific Weighted Euclidean distance: q (w12 (x1 − y1 )2 + · · · + wn2 (xn − yn )2 ) Update weights based on nearest neighbor Class correct: increase weight Class incorrect: decrease weight Amount of change for i-th attribute depends on |xi − yi |
Corso di Apprendimento Automatico
Instance-based Learning
Rectangular Generalizations
Nearest-neighbor rule is used outside rectangles Rectangles are rules! (But they can be more conservative than â&#x20AC;?normalâ&#x20AC;? rules.) Nested rectangles are rules with exceptions
Corso di Apprendimento Automatico
Instance-based Learning
Generalized Exemplars
Generalize instances into hyperrectangles Online: incrementally modify rectangles Offline version: seek small set of rectangles that cover the instances
Important design decisions: Allow overlapping rectangles? Requires conflict resolution
Allow nested rectangles? Dealing with uncovered instances?
Corso di Apprendimento Automatico
Instance-based Learning
Separating Generalized Exemplars
Corso di Apprendimento Automatico
Instance-based Learning
Rectangular Generalizations
Given: some transformation operations on attributes K â&#x2C6;&#x2014; : similarity = probability of transforming instance A into B by chance Average over all transformation paths Weight paths according their probability (need way of measuring this)
Uniform way of dealing with different attribute types Easily generalized to give distance between sets of instances
Corso di Apprendimento Automatico
Instance-based Learning
Locally Weighted Regression I Note k NN forms local approximation to f for each query point xq Why not form an explicit approximation ˆf (x) for region surrounding xq Fit linear function to k nearest neighbors Fit quadratic, ... Produces “piecewise approximation” to f Several choices of error to minimize: Squared error over k nearest neighbors E1 (xq ) ≡
1 2
X
(f (x) − ˆf (x))2
x∈kNN(xq )
Corso di Apprendimento Automatico
Instance-based Learning
Locally Weighted Regression II
Distance-weighted squared error over all nbrs E2 (xq ) ≡
1X (f (x) − ˆf (x))2 K (d(xq , x)) 2 x∈D
...
Corso di Apprendimento Automatico
Instance-based Learning
Radial Basis Function Networks I Global approximation to target function, in terms of linear combination of local approximations Used, e.g., for image classification A different kind of neural network Closely related to distance-weighted regression, but “eager” instead of “lazy”
Corso di Apprendimento Automatico
Instance-based Learning
Radial Basis Function Networks II
where ai (x) are the attributes describing instance x, and f (x) = w0 +
k X
wu Ku (d(xu , x))
u=1
One common choice for Ku (d(xu , x)) is Ku (d(xu , x)) = e
Corso di Apprendimento Automatico
â&#x2C6;&#x2019;
1 2 2 d (xu ,x) 2Ď&#x192;u
Instance-based Learning
Training Radial Basis Function Networks
Q1: What xu to use for each kernel function Ku (d(xu , x)) Scatter uniformly throughout instance space Or use training instances (reflects instance distribution) Q2: How to train weights (assume here Gaussian Ku ) First choose variance (and perhaps mean) for each Ku e.g., use EM
Then hold Ku fixed, and train linear output layer efficient methods to fit linear function
Corso di Apprendimento Automatico
Instance-based Learning
Case-Based Reasoning Can apply instance-based learning even when X 6= <n → need different “distance” metric Case-Based Reasoning is instance-based learning applied to instances with symbolic logic descriptions
((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (network-connection PCIA) (memory 48meg) (installed-applications Excel Netscape VirusScan) (disk 1gig) (likely-cause ???))
Corso di Apprendimento Automatico
Instance-based Learning
Case-Based Reasoning in CADET
CADET: 75 stored examples of mechanical devices each training example: h qualitative function, mechanical structurei new query: desired function, target value: mechanical structure for this function Distance metric: match qualitative function descriptions
Corso di Apprendimento Automatico
Instance-based Learning
Case-Based Reasoning in CADET
Corso di Apprendimento Automatico
Instance-based Learning
Case-Based Reasoning in CADET
Instances represented by rich structural descriptions Multiple cases retrieved (and combined) to form solution to new problem Tight coupling between case retrieval and problem solving Bottom line: Simple matching of cases useful for tasks such as answering help-desk queries Area of ongoing research
Corso di Apprendimento Automatico
Instance-based Learning
Lazy and Eager Learning
Lazy: wait for query before generalizing k -N EAREST N EIGHBOR, Case based reasoning Eager: generalize before seeing query Radial basis function networks, ID3, Backpropagation, NaiveBayes, . . . Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations if they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)
Corso di Apprendimento Automatico
Instance-based Learning
Credits
T. M. Mitchell Machine Learning McGraw Hill I. Witten & E. Frank Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning, Springer R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley
Corso di Apprendimento Automatico
Instance-based Learning