Instant download Outlier analysis 2nd edition charu c. aggarwal (auth.) pdf all chapter

Page 1


Visit to download the full and correct content document: https://textbookfull.com/product/outlier-analysis-2nd-edition-charu-c-aggarwal-auth/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Outlier Ensembles An Introduction 1st Edition Charu C. Aggarwal

https://textbookfull.com/product/outlier-ensembles-anintroduction-1st-edition-charu-c-aggarwal/

Neural Networks and Deep Learning A Textbook Charu C. Aggarwal

https://textbookfull.com/product/neural-networks-and-deeplearning-a-textbook-charu-c-aggarwal/

Linear Algebra and Optimization for Machine Learning: A Textbook Charu C. Aggarwal

https://textbookfull.com/product/linear-algebra-and-optimizationfor-machine-learning-a-textbook-charu-c-aggarwal/

Arc Flash Hazard Analysis and Mitigation 2nd Edition J. C. Das

https://textbookfull.com/product/arc-flash-hazard-analysis-andmitigation-2nd-edition-j-c-das/

Introduction to optical components First Edition Aggarwal

https://textbookfull.com/product/introduction-to-opticalcomponents-first-edition-aggarwal/

Managing Myositis: A Practical Guide Rohit Aggarwal

https://textbookfull.com/product/managing-myositis-a-practicalguide-rohit-aggarwal/

Physical properties of diamond and sapphire First Edition Aggarwal

https://textbookfull.com/product/physical-properties-of-diamondand-sapphire-first-edition-aggarwal/

Breast 2nd Edition Susan C. Lester

https://textbookfull.com/product/breast-2nd-edition-susan-clester/

Structural Analysis 8th Edition Solutions Manual

Russell C. Hibbeler

https://textbookfull.com/product/structural-analysis-8th-editionsolutions-manual-russell-c-hibbeler/

Outlier Analysis

Second Edition

Outlier Analysis

Charu C. Aggarwal

Outlier Analysis

Second Edition

Yorktown Heights, New York, USA

ISBN 978-3-319-47577-6 ISBN 978-3-319-47578-3 (eBook)

DOI 10.1007/978-3-319-47578-3

Library of Congress Control Number: 2016961247

© Springer International Publishing AG 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Tomywife,mydaughterSayani, andmylateparentsDr.PremSarupandMrs.PushplataAggarwal.

1AnIntroductiontoOutlierAnalysis

1.2TheDataModelisEverything

1.2.1ConnectionswithSupervisedModels

1.3TheBasicOutlierDetectionModels ....................10

1.3.1FeatureSelectioninOutlierDetection

1.3.2Extreme-ValueAnalysis

1.3.3ProbabilisticandStatisticalModels

1.3.4LinearModels ..............................13

1.3.4.1SpectralModels ........................14

1.3.5Proximity-BasedModels

1.3.6Information-TheoreticModels

1.3.7High-DimensionalOutlierDetection ..................17

1.4OutlierEnsembles ................................18

1.4.1SequentialEnsembles

1.4.2IndependentEnsembles

1.5TheBasicDataTypesforAnalysis.

1.5.1Categorical,Text,andMixedAttributes

1.5.2WhentheDataValueshaveDependencies

1.5.2.1Times-SeriesDataandDataStreams

1.5.2.2DiscreteSequences .......................24

1.5.2.3SpatialData ..........................24

1.5.2.4NetworkandGraphData

1.6SupervisedOutlierDetection ..........................25

1.7OutlierEvaluationTechniques.

1.7.1InterpretingtheROCAUC

1.7.2CommonMistakesinBenchmarking

1.8ConclusionsandSummary. ...........................31

1.9BibliographicSurvey. ..............................31

1.10Exercises .....................................33

2ProbabilisticModelsforOutlierDetection

2.1Introduction. ...................................35

2.2StatisticalMethodsforExtreme-ValueAnalysis.

2.2.1ProbabilisticTailInequalities .. ....................37

2.2.1.1SumofBoundedRandomVariables

2.2.2Statistical-TailConfidenceTests

2.2.2.1 t-ValueTest ..........................43

2.2.2.2SumofSquaresofDeviations

2.2.2.3VisualizingExtremeValueswithBoxPlots

2.3Extreme-ValueAnalysisinMultivariateData

2.3.1Depth-BasedMethods ..........................47

2.3.2Deviation-BasedMethods ........................48

2.3.3Angle-BasedOutlierDetection

2.3.4DistanceDistribution-basedTechniques:TheMahalanobisMethod .51

2.3.4.1StrengthsoftheMahalanobisMethod

2.4ProbabilisticMixtureModelingforOutlierAnalysis

2.4.1RelationshipwithClusteringMethods

2.4.2TheSpecialCaseofaSingleMixtureComponent

2.4.3OtherWaysofLeveragingtheEMModel

2.4.4AnApplicationofEMforConvertingScorestoProbabilities

2.5LimitationsofProbabilisticModeling. .

2.7BibliographicSurvey

3.2LinearRegressionModels ............................68

3.2.1ModelingwithDependentVariables

3.2.1.1ApplicationsofDependentVariableModeling ........73

3.2.2LinearModelingwithMean-SquaredProjectionError ........74

3.3PrincipalComponentAnalysis .........................75

3.3.1ConnectionswiththeMahalanobisMethod ..............78

3.3.2HardPCAversusSoftPCA .......................79

3.3.3SensitivitytoNoise ............................79

3.3.4NormalizationIssues ...........................80

3.3.5RegularizationIssues ...........................80

3.3.6ApplicationstoNoiseCorrection ....................80

3.3.7HowManyEigenvectors? ........................81

3.3.8ExtensiontoNonlinearDataDistributions ..............83

3.3.8.1ChoiceofSimilarityMatrix ..................85

3.3.8.2PracticalIssues .........................86

3.3.8.3ApplicationtoArbitraryDataTypes ............88

3.4One-ClassSupportVectorMachines .. ....................88

3.4.1SolvingtheDualOptimizationProblem ................92

3.4.2PracticalIssues ..............................92

3.4.3ConnectionstoSupportVectorDataDescriptionandOtherKernel Models ..................................93

3.5AMatrixFactorizationViewofLinearModels ................95

3.5.1OutlierDetectioninIncompleteData

3.5.1.1ComputingtheOutlierScores

3.6NeuralNetworks:FromLinearModelstoDeepLearning. ..........98

3.6.1GeneralizationtoNonlinearModels

3.6.2ReplicatorNeuralNetworksandDeepAutoencoders

3.6.3PracticalIssues ..............................105

3.6.4TheBroadPotentialofNeuralNetworks

3.7LimitationsofLinearModeling

3.8ConclusionsandSummary.

3.9BibliographicSurvey ...............................108

3.10Exercises .....................................109

4Proximity-BasedOutlierDetection

4.1Introduction. ...................................111

4.2ClustersandOutliers:TheComplementaryRelationship.

4.2.1ExtensionstoArbitrarilyShapedClusters

4.2.1.1ApplicationtoArbitraryDataTypes

4.2.2AdvantagesandDisadvantagesofClusteringMethods

4.3Distance-BasedOutlierAnalysis ........................118

4.3.1ScoringOutputsforDistance-BasedMethods

4.3.2BinaryOutputsforDistance-BasedMethods

4.3.2.1Cell-BasedPruning

4.3.2.2Sampling-BasedPruning

4.3.2.3Index-BasedPruning

4.3.3Data-DependentSimilarityMeasures

4.3.4ODIN:AReverseNearestNeighborApproach

4.3.5IntensionalKnowledgeofDistance-BasedOutliers ..........130

4.3.6DiscussionofDistance-BasedMethods

4.4Density-BasedOutliers. .............................131

4.4.1LOF:LocalOutlierFactor ........................132

4.4.1.1HandlingDuplicatePointsandStabilityIssues .......134

4.4.2LOCI:LocalCorrelationIntegral

4.4.2.1LOCIPlot

4.4.3Histogram-BasedTechniques

4.4.4KernelDensityEstimation .......................138

4.4.4.1ConnectionwithHarmonic k -NearestNeighborDetector .139

4.4.4.2LocalVariationsofKernelMethods

4.4.5Ensemble-BasedImplementationsofHistogramsandKernelMethods 140

4.5LimitationsofProximity-BasedDetection ...................141

4.6ConclusionsandSummary. ...........................142

4.7BibliographicSurvey

4.8Exercises

5High-DimensionalOutlierDetection 149

5.1Introduction. ...................................149

5.2Axis-ParallelSubspaces .............................152

5.2.1GeneticAlgorithmsforOutlierDetection ...............153

5.2.1.1DefiningAbnormalLower-DimensionalProjections

5.2.1.2DefiningGeneticOperatorsforSubspaceSearch ......154

5.2.2FindingDistance-BasedOutlyingSubspaces ..............157

5.2.3FeatureBagging:ASubspaceSamplingPerspective .........157

5.2.4ProjectedClusteringEnsembles . ....................158

5.2.5SubspaceHistogramsinLinearTime ..................160

5.2.6IsolationForests .............................161

5.2.6.1FurtherEnhancementsforSubspaceSelection .......163

5.2.6.2EarlyTermination .......................163

5.2.6.3RelationshiptoClusteringEnsemblesandHistograms ...164

5.2.7SelectingHigh-ContrastSubspaces ...................164

5.2.8LocalSelectionofSubspaceProjections ................166

5.2.9Distance-BasedReferenceSets .....................169

5.3GeneralizedSubspaces .............................170

5.3.1GeneralizedProjectedClusteringApproach ..............171

5.3.2LeveragingInstance-SpecificReferenceSets ..............172

5.3.3RotatedSubspaceSampling ... ....................175

5.3.4NonlinearSubspaces ...........................176

5.3.5RegressionModelingTechniques ....................178

5.4DiscussionofSubspaceAnalysis . ........................178

5.5ConclusionsandSummary. ...........................180

5.6BibliographicSurvey. ..............................181

5.7Exercises .....................................184 6OutlierEnsembles

6.1Introduction. ...................................185

6.2CategorizationandDesignofEnsembleMethods

6.2.1BasicScoreNormalizationandCombinationMethods

6.3TheoreticalFoundationsofOutlierEnsembles ................191

6.3.1WhatistheExpectationComputedOver?

6.3.2RelationshipofEnsembleAnalysistoBias-VarianceTrade-Off ....195

6.4VarianceReductionMethods ..........................196

6.4.1ParametricEnsembles ..........................197

6.4.2RandomizedDetectorAveraging ....................199

6.4.3FeatureBagging:AnEnsemble-CentricPerspective ..........199

6.4.3.1ConnectionstoRepresentationalBias ............200

6.4.3.2WeaknessesofFeatureBagging ................202

6.4.4RotatedBagging .............................202

6.4.5IsolationForests:AnEnsemble-CentricView .............203

6.4.6Data-CentricVarianceReductionwithSampling ...........205

6.4.6.1Bagging .............................205

6.4.6.2Subsampling ..........................206

6.4.6.3VariableSubsampling . ....................207

6.4.6.4VariableSubsamplingwithRotatedBagging(VR) .....209

6.4.7OtherVarianceReductionMethods ..................209

6.5FlyingBlindwithBiasReduction . .......................211

6.5.1BiasReductionbyData-CentricPruning ...............211

6.5.2BiasReductionbyModel-CentricPruning ...............212

6.5.3CombiningBiasandVarianceReduction ................213

6.6ModelCombinationforOutlierEnsembles. ..................214

6.6.1CombiningScoringMethodswithRanks ................215

6.6.2CombiningBiasandVarianceReduction ................216

6.7ConclusionsandSummary. ...........................217

6.8BibliographicSurvey. ..............................217

6.9Exercises .....................................218

7SupervisedOutlierDetection 219

7.1Introduction. ...................................219

7.2FullSupervision:RareClassDetection. ....................221

7.2.1Cost-SensitiveLearning .........................223

7.2.1.1MetaCost:ARelabelingApproach ..............223

7.2.1.2WeightingMethods ......................225

7.2.2AdaptiveRe-sampling ..........................228

7.2.2.1RelationshipbetweenWeightingandSampling .......229

7.2.2.2SyntheticOver-sampling:SMOTE ..............229

7.2.3BoostingMethods ............................230

7.3Semi-Supervision:PositiveandUnlabeledData. ...............231

7.4Semi-Supervision:PartiallyObservedClasses

7.4.1One-ClassLearningwithAnomalousExamples ............233

7.4.2One-ClassLearningwithNormalExamples ..............234

7.4.3LearningwithaSubsetofLabeledClasses ...............234

7.5UnsupervisedFeatureEngineeringinSupervisedMethods.

7.6ActiveLearning. .................................236

7.7SupervisedModelsforUnsupervisedOutlierDetection .

7.7.1ConnectionswithPCA-BasedMethods

7.7.2Group-wisePredictionsforHigh-DimensionalData ..........243

7.7.3ApplicabilitytoMixed-AttributeDataSets ..............244

7.7.4IncorporatingColumn-wiseKnowledge .................244

7.7.5OtherClassificationMethodswithSyntheticOutliers ........244

7.8ConclusionsandSummary. ...........................245

7.9BibliographicSurvey. ..............................245

7.10Exercises .....................................247

8Categorical,Text,andMixedAttributeData 249

8.1Introduction. ...................................249

8.2ExtendingProbabilisticModelstoCategoricalData

8.2.1ModelingMixedData ..........................253

8.3ExtendingLinearModelstoCategoricalandMixedData ..........254

8.3.1LeveragingSupervisedRegressionModels ...............254

8.4ExtendingProximityModelstoCategoricalData. ..............255

8.4.1AggregateStatisticalSimilarity .....................256

8.4.2ContextualSimilarity ..........................257

8.4.2.1ConnectionstoLinearModels ................258

8.4.3IssueswithMixedData .........................259

8.4.4Density-BasedMethods .........................259

8.4.5ClusteringMethods ...........................259

8.5OutlierDetectioninBinaryandTransactionData ..............260

8.5.1SubspaceMethods ............................260

8.5.2NoveltiesinTemporalTransactions ...................262

8.6OutlierDetectioninTextData .........................262

8.6.1ProbabilisticModels ...........................262

8.6.2LinearModels:LatentSemanticAnalysis ...............264

8.6.2.1ProbabilisticLatentSemanticAnalysis(PLSA) .......265

8.6.3Proximity-BasedModels .........................268

8.6.3.1FirstStoryDetection ....................269

8.7ConclusionsandSummary. ...........................270

8.8BibliographicSurvey. ..............................270

8.9Exercises .....................................272

9TimeSeriesandStreamingOutlierDetection

9.1Introduction. ...................................273

9.2PredictiveOutlierDetectioninStreamingTime-Series . ...........276

9.2.1AutoregressiveModels ..........................276

9.2.2MultipleTimeSeriesRegressionModels ................279

9.2.2.1DirectGeneralizationofAutoregressiveModels .......279

9.2.2.2Time-SeriesSelectionMethods ................281

9.2.2.3PrincipalComponentAnalysisandHiddenVariable-Based Models .............................282

9.2.3RelationshipbetweenUnsupervisedOutlierDetectionandPrediction 284

9.2.4SupervisedPointOutlierDetectioninTimeSeries ..........284

9.3Time-SeriesofUnusualShapes .........................286

9.3.1TransformationtoOtherRepresentations ...............287

9.3.1.1NumericMultidimensionalTransformations .........288

9.3.1.2DiscreteSequenceTransformations ..............290

9.3.1.3LeveragingTrajectoryRepresentationsofTimeSeries ...291

9.3.2Distance-BasedMethods .........................293

9.3.2.1SingleSeriesversusMultipleSeries ..............295

9.3.3ProbabilisticModels ...........................295

9.3.4LinearModels ..............................295

9.3.4.1UnivariateSeries ........................295

9.3.4.2MultivariateSeries .......................296

9.3.4.3IncorporatingArbitrarySimilarityFunctions ........297

9.3.4.4LeveragingKernelMethodswithLinearModels ......298

9.3.5SupervisedMethodsforFindingUnusualTime-SeriesShapes ....298

9.4MultidimensionalStreamingOutlierDetection ................298

9.4.1IndividualDataPointsasOutliers ...................299

9.4.1.1Proximity-BasedAlgorithms .................299

9.4.1.2ProbabilisticAlgorithms ...................301

9.4.1.3High-DimensionalScenario ..................301

9.4.2AggregateChangePointsasOutliers ..................301

9.4.2.1VelocityDensityEstimationMethod .............302

9.4.2.2StatisticallySignificantChangesinAggregateDistributions 304

9.4.3RareandNovelClassDetectioninMultidimensionalDataStreams .305

9.4.3.1DetectingRareClasses ....................305

9.4.3.2DetectingNovelClasses ....................306

9.4.3.3DetectingInfrequentlyRecurringClasses ..........306

9.5ConclusionsandSummary. ...........................307

9.6BibliographicSurvey. ..............................307

9.7Exercises .....................................310

10OutlierDetectioninDiscreteSequences 311

10.1Introduction. ...................................311

10.2PositionOutliers .................................313

10.2.1Rule-BasedModels ............................315

10.2.2MarkovianModels ............................316

10.2.3EfficiencyIssues:ProbabilisticSuffixTrees ..............318

10.3CombinationOutliers ..............................320

10.3.1APrimitiveModelforCombinationOutlierDetection ........322

10.3.1.1Model-SpecificCombinationIssues ..............323

10.3.1.2EasierSpecialCases ......................323

10.3.1.3RelationshipbetweenPositionandCombinationOutliers .324

10.3.2Distance-BasedModels .........................324

10.3.2.1CombiningAnomalyScoresfromComparisonUnits ....326

10.3.2.2SomeObservationsonDistance-BasedMethods ......327

10.3.2.3EasierSpecialCase:ShortSequences ............327

10.3.3Frequency-BasedModels .........................327

10.3.3.1Frequency-BasedModelwithUser-SpecifiedComparisonUnit327

10.3.3.2Frequency-BasedModelwithExtractedComparisonUnits 328

10.3.3.3CombiningAnomalyScoresfromComparisonUnits ....329

10.3.4HiddenMarkovModels .........................329

10.3.4.1DesignChoicesinaHiddenMarkovModel

10.3.4.2TrainingandPredictionwithHMMs .............333

10.3.4.3Evaluation:ComputingtheFitProbabilityforObservedSequences .............................334

10.3.4.4Explanation:DeterminingtheMostLikelyStateSequence forObservedSequence ....................334

10.3.4.5Training:Baum-WelchAlgorithm ..............335

10.3.4.6ComputingAnomalyScores

10.3.4.7SpecialCase:ShortSequenceAnomalyDetection

10.3.5Kernel-BasedMethods

10.4ComplexSequencesandScenarios

10.4.1MultivariateSequences ..........................338

10.4.2Set-BasedSequences ...........................339

10.4.3OnlineApplications:EarlyAnomalyDetection

10.5SupervisedOutliersinSequences

10.6ConclusionsandSummary.

10.7BibliographicSurvey.

10.8Exercises

11.2.1.2Graph-BasedMethods

11.2.1.3TheCaseofMultipleBehavioralAttributes

11.2.2AutoregressiveModels

11.2.3VisualizationwithVariogramClouds

11.2.4FindingAbnormalShapesinSpatialData

11.2.4.1ContourExtractionMethods

11.2.4.2ExtractingMultidimensionalRepresentations

11.2.4.3MultidimensionalWaveletTransformation

11.2.4.4SupervisedShapeDiscovery

11.2.4.5AnomalousShapeChangeDetection

11.3SpatiotemporalOutlierswithSpatialandTemporalContext.

11.4SpatialBehaviorwithTemporalContext:Trajectories

11.4.1Real-TimeAnomalyDetection

11.4.2UnusualTrajectoryShapes

11.4.2.1Segment-wisePartitioningMethods

11.4.2.2Tile-BasedTransformations

11.4.2.3Similarity-BasedTransformations

11.4.3SupervisedOutliersinTrajectories

11.5ConclusionsandSummary. ...........................366

11.6BibliographicSurvey.

11.7Exercises

12.2OutlierDetectioninManySmallGraphs

12.2.1LeveragingGraphKernels ........................371

12.3OutlierDetectioninaSingleLargeGraph .

12.3.1NodeOutliers ...............................372

12.3.1.1LeveragingtheMahalanobisMethod

12.3.2LinkageOutliers .............................374

12.3.2.1MatrixFactorizationMethods ................374

12.3.2.2SpectralMethodsandEmbeddings

12.3.2.3ClusteringMethods

12.3.2.4CommunityLinkageOutliers

12.3.3SubgraphOutliers ............................381

12.4NodeContentinOutlierAnalysis.

12.4.1SharedMatrixFactorization

12.4.2RelatingFeatureSimilaritytoTieStrength ..............383

12.4.3HeterogeneousMarkovRandomFields .................384

12.5Change-BasedOutliersinTemporalGraphs. .................384

12.5.1DiscoveringNodeHotspotsinGraphStreams

12.5.2StreamingDetectionofLinkageAnomalies ..............386

12.5.3OutliersBasedonCommunityEvolution ................388

12.5.3.1IntegratingClusteringMaintenancewithEvolutionAnalysis 388

12.5.3.2OnlineAnalysisofCommunityEvolutioninGraphStreams 390

12.5.3.3GraphScope ..........................390

12.5.4OutliersBasedonShortestPathDistanceChanges ..........392

12.5.5MatrixFactorizationandLatentEmbeddingMethods ........392

12.6ConclusionsandSummary. ...........................393

12.7BibliographicSurvey. ..............................394

12.8Exercises .....................................396

13.5IntrusionandSecurityApplications.

13.7TextandSocialMediaApplications

13.8EarthScienceApplications

13.9MiscellaneousApplications

13.10GuidelinesforthePractitioner

13.10.1WhichUnsupervisedAlgorithmsWorkBest?

13.11ResourcesforthePractitioner.

13.12ConclusionsandSummary. ...........................422

Preface

“Allthingsexcellentareasdifficultastheyarerare.”–BaruchSpinoza

FirstEdition

Mostoftheearliestworkonoutlierdetectionwasperformedbythestatisticscommunity. Whilestatisticalmethodsaremathematicallymoreprecise,theyhaveseveralshortcomings, suchassimplifiedassumptionsaboutdatarepresentations,pooralgorithmicscalability,and alowfocusoninterpretability.Withtheincreasingadvancesinhardwaretechnologyfor data collection,andadvancesinsoftwaretechnology(databases)for dataorganization,computer scientistshaveincreasinglybeenparticipatinginthelatestadvancementsofthisfield.Computerscientistsapproachthisfieldbasedontheirpracticalexperiencesinmanaginglarge amountsofdata,andwithfarfewerassumptions–thedatacanbeofanytype,structured orunstructured,andmaybeextremelylarge.Furthermore,issuessuchascomputational efficiencyandintuitiveanalysisofthedataaregenerallyconsideredmoreimportantby computerscientiststhanmathematicalprecision,thoughthelatterisimportantaswell. Thisistheapproachofprofessionalsfromthefieldofdatamining,anareaofcomputer sciencethatwasfoundedabout20yearsago.Thishasledtotheformationofmultiple academiccommunitiesonthesubject,whichhaveremainedseparated,partiallybecauseof differencesintechnicalstyleandopinionsabouttheimportanceofdifferentproblemsand approachestothesubject.Atthispoint,dataminingprofessionals(withacomputerscience background)aremuchmoreactivelyinvolvedinthisareaascomparedtostatisticians.This seemstobeamajorchangeintheresearchlandscape.Thisbookpresentsoutlierdetection fromanintegratedperspective,thoughthefocusistowardscomputerscienceprofessionals. Specialemphasiswasplacedonrelatingthemethodsfromdifferentcommunitieswithone another.

Thekeyadvantageofwritingthebookatthispointintimeisthatthevastamountof workdonebycomputerscienceprofessionalsinthelasttwodecadeshasremainedlargely untouchedbyaformalbookonthesubject.Theclassicalbooksrelevanttooutlieranalysis areasfollows:

• P.RousseeuwandA.Leroy.RobustRegressionandOutlierDetection, Wiley,2003.

• V.BarnettandT.Lewis.OutliersinStatisticalData, Wiley,1994.

• D.Hawkins.IdentificationofOutliers, ChapmanandHall,1980.

Wenotethatthesebooksarequiteoutdated,andthemostrecentamongthemisadecade old.Furthermore,this(mostrecent)bookisreallyfocusedontherelationshipbetween regressionandoutlieranalysis,ratherthanthelatter.Outlieranalysisisamuchbroader area,inwhichregressionanalysisisonlyasmallpart.Theotherbooksareevenolder,and arebetween15and25yearsold.Theyareexclusivelytargetedtothestatisticscommunity. Thisisnotsurprising,giventhatthefirstmainstreamcomputerscienceconferenceindata mining(KDD)wasorganizedin1995.Mostoftheworkinthedata-miningcommunity wasperformedafterthewritingofthesebooks.Therefore,manykeytopicsofinterest tothebroaderdataminingcommunityarenotcoveredinthesebooks.Giventhatoutlier analysishasbeenexploredbyamuchbroadercommunity,includingdatabases,datamining, statistics,andmachinelearning,wefeelthatourbookincorporatesperspectivesfromamuch broaderaudienceandbringstogetherdifferentpointsofview.

Thechaptersofthisbookhavebeenorganizedcarefully,withaviewofcoveringthe areaextensivelyinanaturalorder.Emphasiswasplacedonsimplifyingthecontent,so thatstudentsandpractitionerscanalsobenefitfromthebook.Whilewedidnotoriginally intendtocreateatextbookonthesubject,itevolvedduringthewritingprocessintoa workthatcanalsobeusedasateachingaid.Furthermore,itcanalsobeusedasareference book,sinceeachchaptercontainsextensivebibliographicnotes.Therefore,thisbookserves adualpurpose,providingacomprehensiveexpositionofthetopicofoutlierdetectionfrom multiplepointsofview.

AdditionalNotesfortheSecondEdition

Thesecondeditionofthisbookisasignificantenhancementoverthefirstedition.Inparticular,mostofthechaptershavebeenupgradedwithnewmaterialandrecenttechniques. Moreexplanationshavebeenaddedatseveralplacesandnewertechniqueshavealsobeen added.Anentirechapteronoutlierensembleshasbeenadded.Manynewtopicshavebeen addedtothebooksuchasfeatureselection,one-classsupportvectormachines,one-class neuralnetworks,matrixfactorization,spectralmethods,wavelettransforms,andsupervised learning.Everychapterhasbeenupdatedwiththelatestalgorithmsonthetopic.

Lastbutnotleast,thefirsteditionwasclassifiedbythepublisherasamonograph, whereasthesecondeditionisformallyclassifiedasatextbook.Thewritingstylehasbeen enhancedtobeeasilyunderstandabletostudents.Manyalgorithmshavebeendescribedin greaterdetail,asonemightexpectfromatextbook.Itisalsoaccompaniedwithasolution manualforclassroomteaching.

Acknowledgments

FirstEdition

Iwouldliketothankmywifeanddaughterfortheirloveandsupportduringthewriting ofthisbook.Thewritingofabookrequiressignificanttimethatistakenawayfromfamily members.Thisbookistheresultoftheirpatiencewithmeduringthistime.Ialsoowemy lateparentsadebtofgratitudeforinstillinginmealoveofeducation,whichhasplayedan importantinspirationalroleinmybook-writingefforts.

IwouldalsoliketothankmymanagerNaguiHalimforprovidingthetremendoussupport necessaryforthewritingofthisbook.Hisprofessionalsupporthasbeeninstrumentalfor mymanybookeffortsinthepastandpresent.

Overtheyears,Ihavebenefitedfromtheinsightsofnumerouscollaborators.Anincompletelistoftheselong-termcollaboratorsinalphabeticalorderisTarekF.Abdelzaher, JiaweiHan,ThomasS.Huang,LatifurKhan,MohammadM.Masud,SpirosPapadimitriou, GuojunQi,andPhilipS.Yu.Iwouldliketothankthemfortheircollaborationsandinsights overthecourseofmanyyears.

IwouldalsoliketospeciallythankmyadvisorJamesB.Orlinforhisguidanceduring myearlyyearsasaresearcher.WhileInolongerworkinthesamearea,thelegacyofwhat Ilearnedfromhimisacrucialpartofmyapproachtoresearch.Inparticular,hetaught metheimportanceofintuitionandsimplicityofthoughtintheresearchprocess.These aremoreimportantaspectsofresearchthanisgenerallyrecognized.Thisbookiswritten inasimpleandintuitivestyle,andismeanttoimproveaccessibilityofthisareatoboth researchersandpractitioners.

Finally,IwouldliketothankLataAggarwalforhelpingmewithsomeofthefigures createdusingPowerPointgraphicsinthisbook.

AcknowledgmentsforSecondEdition

Ireceivedsignificantfeedbackfromvariouscolleaguesduringthewritingofthesecond edition.Inparticular,IwouldliketoacknowledgeLemanAkoglu,Chih-JenLin,Saket Sathe,JiliangTang,andSuhangWang.LemanandSaketprovideddetailedfeedbackon severalsectionsandchaptersofthisbook.

AuthorBiography

CharuC.Aggarwal isaDistinguishedResearchStaffMember(DRSM)attheIBM T.J.WatsonResearchCenterinYorktownHeights,NewYork.HecompletedhisundergraduatedegreeinComputerSciencefromtheIndianInstituteofTechnologyatKanpurin1993andhisPh.D.fromtheMassachusettsInstituteofTechnologyin1996. Hehasworkedextensivelyinthefieldofdatamining.Hehaspublishedmorethan300papersinrefereedconferencesandjournalsand authoredover80patents.Heistheauthororeditorof15books, includingatextbookondataminingandacomprehensivebookon outlieranalysis.Becauseofthecommercialvalueofhispatents,he hasthricebeendesignatedaMasterInventoratIBM.HeisarecipientofanIBMCorporateAward(2003)forhisworkonbio-terrorist threatdetectionindatastreams,arecipientoftheIBMOutstandingInnovationAward(2008)forhisscientificcontributionstoprivacy technology,arecipientoftwoIBMOutstandingTechnicalAchievementAwards(2009,2015) forhisworkondatastreamsandhigh-dimensionaldata,respectively.HereceivedtheEDBT 2014TestofTimeAwardforhisworkoncondensation-basedprivacy-preservingdatamining.HeisalsoarecipientoftheIEEEICDMResearchContributionsAward(2015),which isoneofthetwohighestawardsforinfluentialresearchcontributionsinthefieldofdata mining.

Hehasservedasthegeneralco-chairoftheIEEEBigDataConference(2014)andas theprogramco-chairoftheACMCIKMConference(2015),theIEEEICDMConference (2015),andtheACMKDDConference(2016).HeservedasanassociateeditoroftheIEEE TransactionsonKnowledgeandDataEngineeringfrom2004to2008.Heisanassociate editoroftheACMTransactionsonKnowledgeDiscoveryfromData,anassociateeditorof theIEEETransactionsonBigData,anactioneditoroftheDataMiningandKnowledge DiscoveryJournal,editor-in-chiefoftheACMSIGKDDExplorations,andanassociate editoroftheKnowledgeandInformationSystemsJournal.Heservesontheadvisoryboard oftheLectureNotesonSocialNetworks,apublicationbySpringer.Hehasservedasthe vice-presidentoftheSIAMActivityGrouponDataMiningandisamemberoftheSIAM industrycommittee.HeisafellowoftheSIAM,ACM,andtheIEEE,for“contributionsto knowledgediscoveryanddataminingalgorithms.”

Chapter1

AnIntroductiontoOutlierAnalysis

“Nevertakethecommentthatyouaredifferentasacondemnation,itmight beacompliment.Itmightmeanthatyoupossessuniquequalitiesthat,likethe mostrarestofdiamondsis oneofakind.”–EugeneNathanielButler

1.1Introduction

Anoutlierisadatapointthatissignificantlydifferentfromtheremainingdata.Hawkins defined[249] anoutlierasfollows:

“Anoutlierisanobservationwhichdeviatessomuchfromtheotherobservations astoarousesuspicionsthatitwasgeneratedbyadifferentmechanism.”

Outliersarealsoreferredtoas abnormalities, discordants, deviants,or anomalies inthe dataminingandstatisticsliterature.Inmostapplications,thedataiscreatedbyoneor moregeneratingprocesses,whichcouldeitherreflectactivityinthesystemorobservations collectedaboutentities.Whenthegeneratingprocessbehavesunusually,itresultsinthe creationofoutliers.Therefore,anoutlieroftencontainsusefulinformationaboutabnormal characteristicsofthesystemsandentitiesthatimpactthedatagenerationprocess.The recognitionofsuchunusualcharacteristicsprovidesusefulapplication-specificinsights.Some examplesareasfollows:

• Intrusiondetectionsystems: Inmanycomputersystems,differenttypesofdata arecollectedabouttheoperatingsystemcalls,networktraffic,orotheruseractions. Thisdatamayshowunusualbehaviorbecauseofmaliciousactivity.Therecognition ofsuchactivityisreferredtoasintrusiondetection.

• Credit-cardfraud: Credit-cardfraudhasbecomeincreasinglyprevalentbecauseof greatereasewithwhichsensitiveinformationsuchasacredit-cardnumbercanbe compromised.Inmanycases,unauthorizeduseofacreditcardmayshowdifferent patterns,suchasbuyingspreesfromparticularlocationsorverylargetransactions. Suchpatternscanbeusedtodetectoutliersincredit-cardtransactiondata.

• Interestingsensorevents: Sensorsareoftenusedtotrackvariousenvironmentalandlocationparametersinmanyreal-worldapplications.Suddenchangesinthe underlyingpatternsmayrepresenteventsofinterest.Eventdetectionisoneofthe primarymotivatingapplicationsinthefieldofsensornetworks.Asdiscussedlaterin thisbook,eventdetectionisanimportant temporal versionofoutlierdetection.

• Medicaldiagnosis: Inmanymedicalapplications,thedataiscollectedfromavarietyofdevicessuchasmagneticresonanceimaging(MRI)scans,positronemission tomography(PET)scansorelectrocardiogram(ECG)time-series.Unusualpatterns insuchdatatypicallyreflectdiseaseconditions.

• Lawenforcement: Outlierdetectionfindsnumerousapplicationsinlawenforcement, especiallyincaseswhereunusualpatternscanonlybediscoveredovertimethrough multipleactionsofanentity.Determiningfraudinfinancialtransactions,trading activity,orinsuranceclaimstypicallyrequirestheidentificationofunusualpatterns inthedatageneratedbytheactionsofthecriminalentity.

• Earthscience: Asignificantamountofspatiotemporaldataaboutweatherpatterns, climatechanges,orland-coverpatternsiscollectedthroughavarietyofmechanisms suchassatellitesorremotesensing.Anomaliesinsuchdataprovidesignificantinsights abouthumanactivitiesorenvironmentaltrendsthatmaybetheunderlyingcauses. Inalltheseapplications,thedatahasa“normal”model,andanomaliesarerecognizedas deviationsfromthisnormalmodel.Normaldatapointsaresometimesalsoreferredtoas inliers.Insomeapplicationssuchasintrusionorfrauddetection,outlierscorrespondto sequences ofmultipledatapointsratherthanindividualdatapoints.Forexample,afraud eventmayoftenreflecttheactionsofanindividualinaparticularsequence.Thespecificity ofthesequenceisrelevanttoidentifyingtheanomalousevent.Suchanomaliesarealso referredtoas collectiveanomalies,becausetheycanonlybeinferredcollectivelyfromaset orsequenceofdatapoints.Suchcollectiveanomaliesareoftenaresultofunusual events thatgenerateanomalouspatternsofactivity.Thisbookwilladdressthesedifferenttypes ofanomalies.

Theoutputofanoutlierdetectionalgorithmcanbeoneoftwotypes:

• Outlierscores: Mostoutlierdetectionalgorithmsoutputascorequantifyingthe levelof“outlierness”ofeachdatapoint.Thisscorecanalsobeusedtorankthedata pointsinorderoftheiroutliertendency.Thisisaverygeneralformofoutput,which retainsalltheinformationprovidedbyaparticularalgorithm,butitdoesnotprovide aconcisesummaryofthesmallnumberofdatapointsthatshouldbeconsidered outliers.

• Binarylabels: Asecondtypeofoutputisabinarylabelindicatingwhetheradata pointisanoutlierornot.Althoughsomealgorithmsmightdirectlyreturnbinary labels,outlierscorescanalsobeconvertedintobinarylabels.Thisistypicallyachieved byimposingthresholdsonoutlierscores,andthethresholdischosenbasedonthe statisticaldistributionofthescores.Abinarylabelingcontainslessinformationthan ascoringmechanism,butitisthefinalresultthatisoftenneededfordecisionmaking inpracticalapplications.

Itisoftenasubjectivejudgement,astowhatconstitutesa“sufficient”deviationfor apointtobeconsideredanoutlier.Inrealapplications,thedatamaybeembeddedina

Figure1.1:Thedifferencebetweennoiseandanomalies

INCREASING OUTLIERNESS SCORE FROM LEFT TO RIGHT

Figure1.2:Thespectrumfromnormaldatatooutliers

significantamountofnoise,andsuchnoisemaynotbeofanyinteresttotheanalyst.It isusuallythe significantlyinterestingdeviations thatareofinterest.Inordertoillustrate thispoint,considertheexamplesillustratedinFigures 1.1(a)and(b).Itisevidentthat themainpatterns(orclusters)inthedataareidenticalinbothcases,althoughthereare significantdifferencesoutsidethesemainclusters.InthecaseofFigure 1.1(a),asingledata point(markedby‘A’)seemstobeverydifferentfromtheremainingdata,andistherefore veryobviouslyananomaly.ThesituationinFigure 1.1(b)ismuchmoresubjective.While thecorrespondingdatapoint‘A’inFigure 1.1(b)isalsoinasparseregionofthedata,it ismuchhardertostateconfidentlythatitrepresentsatruedeviationfromtheremaining dataset.Itisquitelikelythatthisdatapointrepresentsrandomlydistributednoiseinthe data.Thisisbecausethepoint‘A’seemstofitapatternrepresentedbyotherrandomly distributedpoints.Therefore,throughoutthisbook,theterm“outlier”referstoadata pointthatcouldeitherbeconsideredanabnormalityornoise,whereasan“anomaly”refers toaspecialkindofoutlierthatisofinteresttoananalyst.

Inthe unsupervisedscenario,wherepreviousexamplesofinterestinganomaliesare notavailable,thenoiserepresentsthesemanticboundarybetweennormaldataandtrue anomalies–noiseisoftenmodeledasaweakformofoutliersthatdoesnotalwaysmeetthe strongcriterianecessaryforadatapointtobeconsideredinterestingoranomalousenough. Forexample,datapointsattheboundariesofclustersmayoftenbeconsiderednoise.Typ-

ically,mostoutlierdetectionalgorithmsusesomequantifiedmeasureofthe outlierness ofa datapoint,suchasthesparsityoftheunderlyingregion,nearestneighborbaseddistance, orthefittotheunderlyingdatadistribution.Everydatapointliesonacontinuousspectrumfromnormaldatatonoise,andfinallytoanomalies,asillustratedinFigure 1.2. The separationofthedifferentregionsofthisspectrumisoftennotpreciselydefined,andis chosenonan adhoc basisaccordingtoapplication-specificcriteria.Furthermore,theseparationbetweennoiseandanomaliesisnotpure,andmanydatapointscreatedbyanoisy generativeprocessmaybedeviantenoughtobeinterpretedasanomaliesonthebasisof theoutlierscore.Thus,anomalieswill typically haveamuchhigheroutlierscorethannoise, butthisisnotadistinguishingfactorbetweenthetwoasamatterof definition.Rather,it istheinterestoftheanalystthatregulatesthedistinctionbetweennoiseandananomaly.

Someauthorsusetheterms weakoutliers and strongoutliers inordertodistinguish betweennoiseandanomalies [4,318].Thedetectionofnoiseinthedatahasnumerous applicationsofitsown.Forexample,theremovalofnoisecreatesamuchcleanerdata set,whichcanbeutilizedforotherdataminingalgorithms.Althoughnoisemightnotbe interestinginitsownright,its removalandidentification continuestobeanimportant problemforminingpurposes.Therefore,bothnoiseandanomalydetectionproblemsare importantenoughtobeaddressedinthisbook.Throughoutthisbook,methodsspecifically relevantto either anomalydetectionornoiseremovalwillbeidentified.However,thebulk oftheoutlierdetectionalgorithmscouldbeusedforeitherproblem,sincethedifference betweenthemisreallyoneofsemantics.

Sincethesemanticdistinctionbetweennoiseandanomaliesisbasedonanalystinterest, thebestwaytofindsuchanomaliesanddistinguishthemfromnoiseistousethefeedback from previouslyknownoutlierexamplesofinterest.Thisisquiteoftenthecaseinmany applications,suchascredit-cardfrauddetection,wherepreviousexamplesofinteresting anomaliesmaybeavailable.Thesemaybeusedinordertolearn amodelthatdistinguishes thenormalpatternsfromtheabnormaldata. Supervisedoutlierdetectiontechniquesare typicallymuchmoreeffectiveinmanyapplication-specificscenarios,becausethecharacteristicsofthepreviousexamplescanbeusedtosharpenthesearchprocesstowardsmore relevantoutliers.Thisisimportant,becauseoutlierscanbedefinedinnumerouswaysin agivendataset,mostofwhichmaynotbeinteresting.Forexample,inFigures 1.1(a) and(b),previousexamplesmaysuggestthatonlyrecordswithunusuallyhighvaluesof bothattributesshouldbeconsideredanomalies.Insuchacase,thepoint‘A’in both figures shouldberegardedasnoise,andthepoint‘B’inFigure 1.1(b)shouldbeconsideredan anomalyinstead!Thecrucialpointtounderstandhereisthatanomaliesneedtobe unusual inaninterestingway,andthesupervisionprocessre-defineswhatonemightfindinteresting. Generally,unsupervisedmethodscanbeusedeitherfornoiseremovaloranomalydetection, andsupervisedmethodsaredesignedforapplication-specificanomalydetection.Unsupervisedmethodsareoftenusedinanexploratorysetting,wherethediscoveredoutliersare providedtotheanalystforfurtherexaminationoftheirapplication-specificimportance.

Severallevelsofsupervisionarepossibleinpracticalscenarios.Inthefullysupervised scenario,examplesofbothnormalandabnormaldataareavailablethatcanbeclearlydistinguished.Insomecases,examplesofoutliersareavailable,buttheexamplesof“normal” datamayalsocontainoutliersinsome(unknown)proportion.Thisisreferredtoasclassificationwithpositiveandunlabeleddata.Inothersemi-supervisedscenarios,onlyexamples ofnormaldataoronlyexamplesofanomalousdatamaybeavailable.Thus,thenumberof variationsoftheproblemisratherlarge,eachofwhichrequiresarelatedbutdedicatedset oftechniques.

Finally,thedatarepresentationmayvarywidelyacrossapplications.Forexample,the datamaybepurelymultidimensionalwithnorelationshipsamongpoints,orthedata maybesequentialwithtemporalordering,ormaybedefinedintheformofanetwork witharbitraryrelationshipsamongdatapoints.Furthermore,theattributesinthedata maybenumerical,categorical,ormixed.Clearly,theoutlierdetectionprocessneedstobe sensitivetothenatureoftheattributesandrelationshipsintheunderlyingdata.Infact, therelationshipsthemselvesmayoftenprovideanoutlier-detectioncriterionintheformof connectionsbetweenentitiesthatdonotusuallyoccurtogether.Suchoutliersarereferred toas contextual outliers.Aclassicalexampleofthisistheconceptof linkageoutliers in socialnetworkanalysis[17].Inthiscase,entities(nodes)inthegraphthatarenormallynot connectedtogethermayshow anomalous connectionswitheachother.Thus,theimpactof datatypesontheanomalydetectionprocessissignificantandwillbecarefullyaddressed inthisbook.

Thischapterisorganizedasfollows.Insection 1.2, theimportanceofdatamodelingin outlieranalysisisdiscussed.Insection 1.3, thebasicoutliermodelsforoutlierdetectionare introduced.Outlierensemblesareintroducedinsection 1.4. Section 1.5 discussesthebasic datatypesusedforanalysis.Section 1.6 introducestheconceptofsupervisedmodelingof outliersfordataanalysis.Methodsforevaluatingoutlierdetectionalgorithmsarediscussed insection 1.7. Theconclusionsarepresentedinsection 1.8.

1.2TheDataModelisEverything

Virtuallyalloutlierdetectionalgorithmscreateamodelofthenormalpatternsinthedata, andthencomputeanoutlierscoreofagivendatapointonthebasisofthedeviations fromthesepatterns.Forexample,thisdatamodelmaybeagenerativemodelsuchasa Gaussian-mixturemodel,aregression-basedmodel,oraproximity-basedmodel.Allthese modelsmakedifferentassumptionsaboutthe“normal”behaviorofthedata.Theoutlier scoreofadatapointisthencomputedbyevaluatingthequalityofthefitbetweenthe datapointandthemodel.Inmanycases,themodelmaybealgorithmicallydefined.For example,nearestneighbor-basedoutlierdetectionalgorithmsmodeltheoutliertendencyof adatapointintermsofthedistributionofits k -nearestneighbordistance.Thus,inthis case,theassumptionisthatoutliersarelocatedatlargedistancesfrommostofthedata. Clearly,thechoiceofthedatamodeliscrucial.Anincorrectchoiceofdatamodel mayleadtopoorresults.Forexample,afullygenerativemodelsuchastheGaussian mixturemodelmaynotworkwell,ifthedatadoesnotfitthegenerativeassumptionsof themodel,orifasufficientnumberofdatapointsarenotavailabletolearntheparameters ofthemodel.Similarly,alinearregression-basedmodelmayworkpoorly,iftheunderlying dataisclusteredarbitrarily.Insuchcases,datapointsmaybeincorrectlyreportedas outliers becauseofpoorfittotheerroneousassumptionsofthemodel.Unfortunately,outlier detectionislargelyan unsupervised probleminwhichexamplesofoutliersarenotavailable tolearn1 thebestmodel(inanautomatedway)foraparticulardataset.Thisaspectof outlierdetectiontendstomakeitmorechallengingthanmanyother supervised datamining problemslikeclassificationinwhich labeled examplesareavailable.Therefore,inpractice, thechoiceofthemodelisoftendictatedbytheanalyst’sunderstandingofthekindsof deviationsrelevanttoanapplication.Forexample,inaspatialapplicationmeasuringa behavioralattributesuchasthelocation-specifictemperature,itwouldbereasonableto assumethatanunusualdeviationofthetemperatureattributeinaspatiallocalityisan

1 Insupervisedproblemslikeclassification,thisprocessisreferredtoas modelselection

(a)Normaldistribution (b)Zipfdistribution

Figure1.3:Applying Z -valuetestontheNormalandZipfdistributions

indicatorofabnormality.Ontheotherhand,forthecaseofhigh-dimensionaldata,even thedefinitionofdatalocalitymaybeill-definedbecauseofdatasparsity.Thus,aneffective modelforaparticulardatadomainmayonlybeconstructedaftercarefullyevaluatingthe relevantmodelingpropertiesofthatdomain.

Inordertounderstandtheimpactofthemodel,itisinstructivetoexaminetheuseofa simplemodelknownasthe Z -valuetest foroutlieranalysis.Considerasetof1-dimensional quantitativedataobservations,denotedby X1 ...XN ,withmean μ andstandarddeviation σ .The Z -valueforthedatapoint Xi isdenotedby Zi andisdefinedasfollows:

The Z -valuetestcomputesthenumberofstandarddeviationsbywhichadatapointis distantfromthemean.Thisprovidesagoodproxyfortheoutlierscoreofthatpoint.An implicitassumptionisthatthedataismodeledfromanormaldistribution,andthereforethe Z -valueisarandomvariabledrawnfroma standard normaldistributionwithzeromean andunitvariance.Incaseswherethemeanandstandarddeviationofthedistribution canbeaccuratelyestimated,agood“rule-of-thumb”istouse Zi ≥ 3asaproxyforthe anomaly.However,inscenariosinwhichveryfewsamplesareavailable,themeanand standarddeviationoftheunderlyingdistributioncannotbeestimatedrobustly.Insuch cases,theresultsfromthe Z -valuetestneedtobeinterpretedmorecarefullywiththeuse ofthe(related) Student’s t-distribution ratherthananormaldistribution.Thisissuewill bediscussedinChapter 2.

Itisoftenforgottenbypractitionersduringmodelingthatthe Z -valuetestimplicitly assumesanormaldistributionfortheunderlyingdata.Whensuchanapproximationis poor,theresultsarehardertointerpret.Forexample,considerthetwodatafrequency histogramsdrawnonvaluesbetween1and20inFigure 1.3. Inthefirstcase,thehistogram issampledfromanormaldistributionwith(μ,σ )=(10, 2),andinthesecondcase,itis sampledfromaZipfdistribution1/i.Itisevidentthatmostofthedataliesintherange [10 2 ∗ 3, 10+2 ∗ 3]forthenormaldistribution,andalldatapointslyingoutsidethisrange canbetrulyconsideredanomalies.Thus,the Z -valuetestworksverywellinthiscase.In thesecondcasewiththeZipfdistribution,theanomaliesarenotquiteasclear,although thedatapointswithveryhighvalues(suchas20)canprobablybeconsideredanomalies. Inthiscase,themeanandstandarddeviationofthedataare5.24and5.56,respectively. Asaresult,the Z -valuetestdoesnotdeclare any ofthedatapointsasanomalous(fora

Figure1.4:LinearlyCorrelatedData

thresholdof3),althoughitdoescomeclose.Inanycase,thesignificanceofthe Z -valuefrom theZipf-distributionisnotverymeaningfulatleastfromtheperspectiveofprobabilistic interpretability.Thissuggeststhatifmistakesaremadeatthemodelingstage,itcanresult inanincorrectunderstandingofthedata.Suchtestsareoftenusedasa heuristic toprovide aroughideaoftheoutlierscoresevenfordatasetsthatarefarfromnormallydistributed, anditisimportanttointerpretsuchscorescarefully.

Anexampleinwhichthe Z -valuetestwouldnotworkevenasaheuristic,wouldbe oneinwhichitwasappliedtoadatapointthatwasanoutlieronlybecauseofitsrelative position,ratherthanitsextremeposition.Forexample,ifthe Z -valuetestisappliedtoan individualdimensioninFigure 1.1(a),thetestwouldfailmiserably,becausepoint‘A’would beconsideredthemostcentrallylocatedandnormaldatapoint.Ontheotherhand,the testcanstillbereasonablyappliedtoasetof extracted 1-dimensionalvaluescorresponding tothe k -nearestneighbordistancesofeachpoint.Therefore,theeffectivenessofamodel dependsbothonthechoiceofthetestused,and how itisapplied.

Thebestchoiceofamodelisoftendata-specific.Thisrequiresagoodunderstanding ofthedataitselfbeforechoosingthemodel.Forexample,aregression-basedmodelwould bemostsuitableforfindingtheoutliersinthedatadistributionsofFigure 1.4, wheremost ofthedataisdistributedalonglinearcorrelationplanes.Ontheotherhand,aclustering modelwouldbemoresuitableforthecasesillustratedinFigure 1.1.Anpoorchoiceof modelforagivendatasetislikelytoprovidepoorresults. Therefore,thecoreprincipleof discoveringoutliersisbasedonassumptionsaboutthestructureofthenormalpatternsin agivendataset.Clearly,thechoiceofthe“normal”modeldependshighlyontheanalyst’s understandingofthenaturaldatapatternsinthatparticulardomain.Thisimpliesthatit isoftenusefulfortheanalysttohaveasemanticunderstandingofthedatarepresentation, althoughthisisoftennotpossibleinrealsettings.

Therearemanytrade-offsassociatedwithmodelchoice;ahighlycomplexmodelwith toomanyparameterswillmostlikelyoverfitthedata,andwillalsofindawaytofitthe outliers.Asimplemodel,whichisconstructedwithagoodintuitiveunderstandingofthe data(andpossiblyalsoanunderstandingofwhattheanalystislookingfor),islikelyto leadtomuchbetterresults.Ontheotherhand,anoversimplifiedmodel,whichfitsthedata poorly,islikelytodeclarenormalpatternsasoutliers.Theinitialstageofselectingthedata modelisperhapsthemostcrucialoneinoutlieranalysis.Thethemeabouttheimpactof datamodelswillberepeatedthroughoutthebook,withspecificexamples.

(a)2-ddata
(b)3-ddata

1.2.1ConnectionswithSupervisedModels

Onecanviewtheoutlierdetectionproblemasavariantoftheclassificationproblemin whichtheclasslabel(“normal”or“anomaly”)isunobserved.Therefore,byvirtueofthe factthatnormalexamplesfaroutnumbertheanomalousexamples,onecan“pretend”that theentiredatasetcontainsthenormalclassandcreatea(possiblynoisy)modelofthe normaldata.Deviationsfromthenormalmodelaretreatedasoutlierscores.Thisconnection betweenclassificationandoutlierdetectionisimportantbecausemuchofthetheoryand methodsfromclassificationgeneralizetooutlierdetection[32].Theunobservednatureof thelabels(oroutlierscores)isthereasonthatoutlierdetectionmethodsarereferredto as unsupervised whereasclassificationmethodsarereferredtoas supervised.Incaseswhere theanomalylabelsareobserved,theproblemsimplifiestotheimbalancedversionofdata classification,anditisdiscussedindetailinChapter 7.

Themodelofnormaldataforunsupervisedoutlierdetectionmaybeconsidereda oneclassanalog ofthemulti-classsettinginclassification.However,theone-classsettingis sometimesfarmoresubtlefromamodelingperspective,becauseitismucheasiertodistinguishbetweenexamplesoftwoclassesthantopredictwhetheraparticularinstance matchesexamplesofasingle(normal)class.Whenatleasttwoclassesareavailable,the distinguishing characteristicsbetweenthetwoclassescanbelearnedmoreeasilyinorder tosharpentheaccuracyofthemodel.

Inmanyformsofpredictivelearning,suchasclassificationandrecommendation,there isanaturaldichotomybetween instance-basedlearningmethods and explicitgeneralization methods.Sinceoutlierdetectionmethodsrequirethedesignofamodelofthenormaldata inordertomakepredictions,thisdichotomyappliestotheunsuperviseddomainaswell. Ininstance-basedmethods,atrainingmodelisnotconstructedupfront.Rather,fora giventestinstance,onecomputesthemostrelevant(i.e.,closest)instancesofthetraining data,andmakespredictionsonthetestinstanceusingtheserelatedinstances.Instancebasedmethodsarealsoreferredtoas lazylearners inthefieldofclassification[33]and memory-basedmethods inthefieldofrecommendersystems[34].

Asimpleexampleofaninstance-basedlearningmethodinoutlieranalysisistheuseof the1-nearest-neighbordistanceofadatapointasitsoutlierscore.Notethatthisapproach doesnotrequiretheconstructionofatrainingmodelupfrontbecausealltheworkofdeterminingthenearestneighborisdoneafterspecifyingtheidentityoftheinstancetobe predicted(scored).The1-nearestneighboroutlierdetectorcanbeconsideredtheunsupervisedanalogofthe1-nearestneighborclassifierinthesuperviseddomain.Instance-based modelsareextremelypopularintheoutlieranalysisdomainbecauseoftheirsimplicity,effectiveness,andintuitivenature.Infact,manyofthemostpopularandsuccessfulmethods foroutlierdetection,suchas k -nearestneighbordetectors[58, 456]andLocalOutlierFactor (LOF)[96](cf.Chapter 4),areinstance-basedmethods.

Thepopularityofinstance-basedmethodsissogreatintheoutlieranalysiscommunity thatthevastarrayofone-classanalogsofothersupervisedmethodsareoftenoverlooked.In principle,almostanyclassificationmethodcanbere-designedtocreateaone-classanalog. Mostofthesemethodsare explicitgeneralizationmethods,inwhichasummarizedmodel needstobecreatedupfront.Explicitgeneralizationmethodsuseatwo-stepprocessonthe dataset D :

1.Createaone-classmodelofthenormaldatausingtheoriginaldataset D .Forexample, onemightlearnalinearhyperplanedescribingthenormaldatainFigure 1.4(b). Thishyperplanerepresentsa summarizedmodel oftheentiredatasetandtherefore representsanexplicitgeneralizationofthedataset.

Table1.1:Classificationmethodsandtheirunsupervisedanalogsinoutlieranalysis

SupervisedModel UnsupervisedAnalog(s) Type

k -nearestneighbor k -NNdistance,LOF,LOCI Instance-based (Chapter 4)

LinearRegression PrincipalComponentAnalysis ExplicitGeneralization (Chapter 3)

NaiveBayes Expectation-maximization ExplicitGeneralization (Chapter 2)

Rocchio Mahalanobismethod(Chapter 3) ExplicitGeneralization Clustering(Chapter 4)

DecisionTrees IsolationTrees Explicitgeneralization RandomForests IsolationForests (Chapters 5 and 6)

Rule-based FP-Outlier ExplicitGeneralization (Chapter 8)

Support-vector One-classsupport-vector Explicitgeneralization machines machines(Chapter 3)

NeuralNetworks Replicatorneuralnetworks Explicitgeneralization (Chapter 3)

Matrixfactorization Principalcomponentanalysis Explicitgeneralization (incompletedata Matrixfactorization prediction) (Chapter 3)

2.Scoreeachpointin D basedonitsdeviationfromthismodelofnormaldata.For example,ifwelearnalinearhyperplaneusingthedatasetofFigure 1.4(b)inthefirst step,thenwemightreporttheEuclideandistancefromthishyperplaneastheoutlier score.

Oneproblemwithexplicitgeneralizationmethodsisthatthesamedataset D isused forbothtrainingandscoring.Thisisbecauseitishardtoexcludeaspecifictestpoint duringthescoringprocess(likeinstance-basedmethods).Furthermore,unlikeclassification inwhichthepresenceorabsenceofground-truth(labeling)naturallypartitionsthedata intotrainingandtestportions,thereisnolabelingavailableinunsupervisedproblems. Therefore,onetypicallywantstousetheentiredataset D forbothtrainingandtestingin unsupervisedproblems,whichcausesoverfitting.Nevertheless,theinfluenceofindividual pointsonoverfittingisoftensmallinreal-worldsettingsbecauseexplicitgeneralization methodstendtocreateaconcisesummary(i.e., generalized representation)ofamuchlarger dataset.Sincethesamedata D isusedfortrainingandtesting,onecanviewoutlierscores asthetrainingdataerrorsmadeintheassumptionof“pretending”thatalltrainingdata pointsbelongtothenormalclass.Oftenaneffectiveapproachtoreduceoverfittingisto repeatedlypartitionthedataintotrainingandtestsetsinarandomizedwayandaverage theoutlierscoresoftestpointsfromthevariousmodels.Suchmethodswillbediscussedin laterchapters.

Virtuallyallclassificationmodelscanbegeneralizedtooutlierdetectionbyusingan appropriateone-classanalog.Examplesofsuchmodelsincludelinearregressionmodels, principalcomponentanalysis,probabilisticexpectation-maximizationmodels,clustering methods,one-classsupportvectormachines,matrixfactorizationmodels,andone-class

neuralnetworks.Forthereaderwhohasafamiliaritywiththeclassificationproblem,we havelistedvariousclassificationmodelsandtheircorrespondingone-classanalogsforoutlier detectioninTable 1.1.Thetableisnotcomprehensiveandisintendedtoprovideintuition abouttheconnectionsbetweenthesupervisedandunsupervisedsettingswithrepresentative examples.Theconnectionsbetweensupervisedandunsupervisedlearningareverydeep;in section 7.7 ofChapter 7,wepointoutyetanotherusefulconnectionbetweenoutlierdetectionandregressionmodeling.Thisparticularconnectionhasthemeritthatitenablesthe useofhundredsofoff-the-shelfregressionmodelsforunsupervisedoutlierdetectionwithan almosttrivialimplementation.

1.3TheBasicOutlierDetectionModels

Thissectionwillpresentanoverviewofthemostimportantmodelsintheliterature,and alsoprovidesomeideaofthesettingsinwhichtheymightworkwell.Adetaileddiscussionof thesemethodsareprovidedinlaterchapters.Severalfactorsinfluencethechoiceofanoutlier model,includingthedatatype,datasize,availabilityofrelevantoutlierexamples,andthe needforinterpretabilityinamodel.Thelastofthesecriteriameritsfurtherexplanation.

The interpretability ofanoutlierdetectionmodelisextremelyimportantfromtheperspectiveoftheanalyst.Itisoftendesirabletodetermine why aparticulardatapointshould beconsideredanoutlierbecauseitprovidestheanalystfurtherhintsaboutthediagnosis requiredinanapplication-specificscenario.Thisprocessisalsoreferredtoasthatofdiscoveringthe intensionalknowledge abouttheoutliers[318]orthatof outlierdetectionand description [44].Differentmodelshavedifferentlevelsofinterpretability.Typically,models thatworkwiththeoriginalattributesandusefewertransformsonthedata(e.g.,principal componentanalysis)havehigherinterpretability.Thetrade-offisthatdatatransformations oftenenhancethecontrastbetweentheoutliersandnormaldatapointsattheexpense ofinterpretability.Therefore,itiscriticaltokeepthesefactorsinmindwhilechoosinga specificmodelforoutlieranalysis.

1.3.1FeatureSelectioninOutlierDetection

Itisnotoriouslydifficulttoperformfeatureselectioninoutlierdetectionbecauseofthe unsupervisednatureoftheoutlierdetectionproblem.Unlikeclassification,inwhichlabels canbeusedasguidingposts,itisdifficulttolearnhowfeaturesrelatetothe(unobserved) groundtruthinunsupervisedoutlierdetection.Nevertheless,acommonwayofmeasuring thenon-uniformityofasetofunivariatepoints x1 ...xN isthe Kurtosismeasure.Thefirst stepistocomputethemean μ andstandarddeviation σ ofthissetofvaluesandstandardize thedatatozeromeanandunitvarianceasfollows:

(1.2)

Notethatthemeanvalueofthe squares of zi isalways1becauseofhow zi isdefined.The Kurtosismeasurecomputesthemeanvalueofthe fourth powerof zi :

Featuredistributionsthatareverynon-uniformshowahighlevelofKurtosis.Forexample, whenthedatacontainsafewextremevalues,theKurtosismeasurewillincreasebecause

oftheuseofthefourthpower.Kurtosismeasuresareoftenused[367]inthecontextof subspaceoutlierdetectionmethods (seeChapter 5),inwhichoutliersareexploredinlowerdimensionalprojectionsofthedata.

OneproblemwiththeKurtosismeasureisthatitdoesnotusethe interactions between variousattributeswell,whenitanalyzesthefeaturesindividually.Itisalsopossibletouse theKurtosismeasureonlower-dimensionaldistancedistributions.Forexample,onecan computetheKurtosismeasureonthesetof N Mahalanobisdistancesofalldatapoints tothecentroidofthedataafterthedatahasbeenprojectedintoalower-dimensional subspace S .Suchacomputationprovidesthe multidimensional Kurtosisofthatsubspace S ,whiletakingintoaccounttheinteractionsbetweenthevariousdimensionsof S .The MahalanobisdistanceisintroducedinChapter 2.Onecancombinethiscomputationwith agreedymethodofiterativelyaddingfeaturestoacandidatesubset S offeaturesinorderto constructadiscriminativesubsetofdimensionswiththehighestmultidimensionalKurtosis.

Asecondmethodologyforfeatureselection[429]istousetheconnectionsoftheoutlier detectionproblemtosupervisedlearning.Thebasicideaisthatfeaturesthatareuncorrelatedwithallotherfeaturesshouldbeconsideredirrelevantbecauseoutliersoftencorrespondtoviolationofthemodelofnormaldatadependencies.Uncorrelatedfeaturescannot beusedtomodeldatadependencies.Therefore,ifoneusesaregressionmodeltopredict oneofthefeaturesfromtheotherfeatures,andthe average squarederroristoolarge, thensuchafeatureshouldbepruned.Allfeaturesarestandardizedtounitvarianceand theroot-meansquarederror RMSEk ofpredictingthe k thfeaturefromotherfeaturesis computed.Notethatif RMSEk islargerthan1,thentheerrorofpredictionisgreater thanthefeaturevarianceandthereforethe k thfeatureshouldbepruned.Onecanalsouse thisapproachtoweightthefeatures.Specifically,theweightofthe k thfeatureisgivenby max{0, 1 RMSEk }.Detailsofthismodelarediscussedinsection 7.7 ofChapter 7.

1.3.2Extreme-ValueAnalysis

Themostbasicformofoutlierdetectionisextreme-valueanalysisof1-dimensionaldata. Theseareveryspecifictypesofoutliersinwhichitisassumedthatthevaluesthatare eithertoolargeortoosmallareoutliers.Suchspecialkindsofoutliersarealsoimportant inmanyapplication-specificscenarios.

Thekeyistodeterminethe statisticaltailsoftheunderlyingdistribution.Asillustrated earlierinFigure 1.3,thenatureofthetailsmayvaryconsiderablydependinguponthe underlyingdatadistribution.Thenormaldistributionistheeasiesttoanalyze,becausemost statisticaltests(suchasthe Z -valuetest)canbeinterpreteddirectlyintermsofprobabilities ofsignificance.Nevertheless,evenforarbitrarydistributions,suchtestsprovideagood heuristicideaoftheoutlierscoresofdatapoints,evenwhentheycannotbeinterpreted statistically.Theproblemofdeterminingthetailsofdistributionshasbeenwidelystudied inthestatisticsliterature.DetailsofsuchmethodswillbediscussedinChapter 2

Extreme-valuestatistics[437]isdistinctfromthetraditionaldefinitionofoutliers.The traditionaldefinitionofoutliers,asprovidedbyHawkins,definessuchobjectsbytheir generativeprobabilities ratherthantheextremityintheirvalues.Forexample,inthedataset {1, 2, 2, 50, 98, 98, 99} of1-dimensionalvalues,thevalues1and99could,verymildly,be consideredextremevalues.Ontheotherhand,thevalue50istheaverageofthedataset, andismostdefinitelynotanextremevalue.However,thevalue50isisolatedfrommostof theotherdatavalues,whicharegroupedintosmallrangessuchas {1, 2, 2} and {98, 98, 99} Therefore,mostprobabilisticanddensity-basedmodelswouldclassifythevalue50asthe strongestoutlierinthedata,andthisresultwouldalsobeconsistentwithHawkins’sgener-

Another random document with no related content on Scribd:

— « Quelle perte pour moi — pour nous, se reprit-elle avec une magnanime générosité, et elle ajouta dans un murmure : « pour le monde entier »… Aux dernières lueurs du crépuscule je pouvais distinguer la lumière de ses yeux pleins de larmes, de larmes qui ne voulaient pas couler.

— « J’ai été très heureuse, très fortunée, très fière, continua-telle. Trop fortunée, trop heureuse pendant quelque temps. Et maintenant je suis malheureuse pour toujours… »

« Elle se leva. Ses cheveux blonds semblaient recueillir, dans un scintillement doré, tout ce qui restait de clarté dans l’air. Je me levai à mon tour.

— « Et de tout cela, fit-elle encore, avec désolation, de tout ce qu’il promettait, de toute sa grandeur, de cette âme généreuses de ce cœur si noble, il ne reste plus rien — rien qu’un souvenir… Vous et moi…

— « Nous nous souviendrons toujours de lui !… » fis-je hâtivement.

— « Non, s’écria-t-elle. Il est impossible que tout soit perdu, qu’une vie comme la sienne soit sacrifiée sans rien laisser derrière elle — sinon de la douleur… Vous savez quels étaient ses vastes projets. Je les connaissais aussi. Peut-être ne comprenais-je pas. Mais d’autres étaient au courant. Il doit demeurer quelque chose. Ses paroles au moins ne sont pas mortes !…

— « Ses paroles resteront, dis-je…

— « Et son exemple, murmura-t-elle, comme pour elle-même. On avait les yeux fixés sur lui. Sa bonté brillait dans toutes ses actions. Son exemple…

— « C’est vrai, fis-je. Son exemple demeure aussi. Oui, son exemple, je l’oubliais…

— « Mais non, je n’oublie pas. Je ne puis, je ne puis croire encore, je ne puis croire que je ne le reverrai plus, que personne ne le verra plus jamais… »

« Comme vers une image qui s’éloigne, elle joignit ses mains pâles et tendit ses bras qui, à contre-jour de l’étroite et pâlissante lueur de la fenêtre, apparurent tout noirs. Ne plus jamais le revoir ! — Je le revoyais à ce moment bien assez distinctement !… Toute ma vie, je reverrai ce loquace fantôme, et je la verrai elle-même, ombre tragique et familière, pareille dans son attitude, à une autre, également tragique, et ornée de charmes impuissants, qui étendait ses bras nus, au-dessus du scintillement du fleuve infernal, du fleuve de ténèbre. Soudain, elle dit, très bas : « Il est mort comme il a vécu… »

— « Sa mort, fis-je, cependant qu’une sourde irritation montait en moi, a été de tous points digne de sa vie.

— « Et je n’étais pas auprès de lui, » murmura-t-elle.

« Mon irritation céda à un sentiment de pitié sans bornes.

— « Tout ce qui pouvait être fait… », bredouillai-je.

— « Ah ! J’avais foi en lui plus que quiconque au monde !… Plus que sa propre mère… Plus que lui-même. Il avait besoin de moi… Ah ! J’aurais jalousement recueilli le moindre de ses soupirs, ses moindres paroles, chacun de ses mouvements, chacun de ses regards. »

« Je sentis une main glacée sur ma poitrine. « Ne l’ai-je pas fait ?… » dis-je d’une voix étouffée.

— « Pardonnez-moi !… J’ai si longtemps pleuré en silence, en silence. Vous êtes demeuré avec lui, jusqu’au bout… Je songe à son isolement… Personne auprès de lui pour le comprendre, comme je l’aurais compris… Personne pour entendre…

— « Jusqu’au bout, fis-je d’un ton saccadé… J’ai entendu ses derniers mots… » Je m’arrêtai, saisi.

— « Répétez-les, murmura-t-elle d’un ton accablé. Je veux, je veux avoir quelque chose avec quoi je puisse vivre… »

« Je fus sur le point de lui crier : « Mais ne les entendez-vous pas ? » L’obscurité autour de nous ne cessait de les répéter dans un chuchotement persistant, dans un chuchotement qui semblait

s’enfler de façon menaçante, comme le premier bruissement du vent qui se lève : « L’horreur ! L’horreur !… »

— « Son dernier mot : que j’en puisse vivre !… » reprit-elle. « Ne comprenez-vous donc pas que je l’aimais, je l’aimais, je l’aimais ! »

« Je me ressaisis et parlant lentement :

— « Le dernier mot qu’il ait prononcé : ce fut votre nom… »

« Je perçus un léger soupir et mon cœur ensuite cessa de battre, comme arrêté net par un cri exultant et terrible, un cri d’inconcevable triomphe et de douleur inexprimable : « Je le savais, j’en étais sûre !… » Elle savait. Elle était sûre. Je l’entendis sangloter : elle avait caché son visage dans ses mains. J’eus l’impression que la maison allait s’écrouler avant que je n’eusse le temps de m’esquiver, que le ciel allait choir sur ma tête. Mais rien de pareil. Les cieux ne tombent pas pour si peu. Seraient-ils tombés, je me le demande, si j’avais rendu à Kurtz la justice qui lui était due ?… N’avait-il pas dit qu’il ne demandait que justice ? Mais je ne pouvais pas. Je ne pouvais lui dire. C’eût été trop affreux, décidément trop affreux… »

Marlow s’arrêta et demeura assis à l’écart, indistinct et silencieux, dans la pose de Bouddha qui médite. Personne, pendant un moment, ne fit un mouvement. — « Nous avons manqué le premier flot de la marée », fit l’administrateur tout à coup. Je relevai la tête. L’horizon était barré par un banc de nuages noirs et cette eau, qui comme un chemin tranquille mène aux confins de la terre, coulait sombre sous un ciel chargé, semblait mener vers le cœur même d’infinies ténèbres.

ACHEVÉ D’IMPRIMER

LE 30 JUILLET 1925

PAR F PAILLART A ABBEVILLE (SOMME).

*** END OF THE PROJECT GUTENBERG EBOOK JEUNESSE, SUIVI DU CŒUR DES TÉNÈBRES ***

Updated editions will replace the previous one—the old editions will be renamed.

Creating the works from print editions not protected by U.S. copyright law means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties. Special rules, set forth in the General Terms of Use part of this license, apply to copying and distributing Project Gutenberg™ electronic works to protect the PROJECT GUTENBERG™ concept and trademark. Project Gutenberg is a registered trademark, and may not be used if you charge for an eBook, except by following the terms of the trademark license, including paying royalties for use of the Project Gutenberg trademark. If you do not charge anything for copies of this eBook, complying with the trademark license is very easy. You may use this eBook for nearly any purpose such as creation of derivative works, reports, performances and research. Project Gutenberg eBooks may be modified and printed and given away—you may do practically ANYTHING in the United States with eBooks not protected by U.S. copyright law. Redistribution is subject to the trademark license, especially commercial redistribution.

START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE

PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free distribution of electronic works, by using or distributing this work (or any other work associated in any way with the phrase “Project Gutenberg”), you agree to comply with all the terms of the Full Project Gutenberg™ License available with this file or online at www.gutenberg.org/license.

Section 1. General Terms of Use and Redistributing Project Gutenberg™ electronic works

1.A. By reading or using any part of this Project Gutenberg™ electronic work, you indicate that you have read, understand, agree to and accept all the terms of this license and intellectual property (trademark/copyright) agreement. If you do not agree to abide by all the terms of this agreement, you must cease using and return or destroy all copies of Project Gutenberg™ electronic works in your possession. If you paid a fee for obtaining a copy of or access to a Project Gutenberg™ electronic work and you do not agree to be bound by the terms of this agreement, you may obtain a refund from the person or entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be used on or associated in any way with an electronic work by people who agree to be bound by the terms of this agreement. There are a few things that you can do with most Project Gutenberg™ electronic works even without complying with the full terms of this agreement. See paragraph 1.C below. There are a lot of things you can do with Project Gutenberg™ electronic works if you follow the terms of this agreement and help preserve free future access to Project Gutenberg™ electronic works. See paragraph 1.E below.

1.C. The Project Gutenberg Literary Archive Foundation (“the Foundation” or PGLAF), owns a compilation copyright in the collection of Project Gutenberg™ electronic works. Nearly all the individual works in the collection are in the public domain in the United States. If an individual work is unprotected by copyright law in the United States and you are located in the United States, we do not claim a right to prevent you from copying, distributing, performing, displaying or creating derivative works based on the work as long as all references to Project Gutenberg are removed. Of course, we hope that you will support the Project Gutenberg™ mission of promoting free access to electronic works by freely sharing Project Gutenberg™ works in compliance with the terms of this agreement for keeping the Project Gutenberg™ name associated with the work. You can easily comply with the terms of this agreement by keeping this work in the same format with its attached full Project Gutenberg™ License when you share it without charge with others.

1.D. The copyright laws of the place where you are located also govern what you can do with this work. Copyright laws in most countries are in a constant state of change. If you are outside the United States, check the laws of your country in addition to the terms of this agreement before downloading, copying, displaying, performing, distributing or creating derivative works based on this work or any other Project Gutenberg™ work. The Foundation makes no representations concerning the copyright status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other immediate access to, the full Project Gutenberg™ License must appear prominently whenever any copy of a Project Gutenberg™ work (any work on which the phrase “Project Gutenberg” appears, or with which the phrase “Project Gutenberg” is associated) is accessed, displayed, performed, viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived from texts not protected by U.S. copyright law (does not contain a notice indicating that it is posted with permission of the copyright holder), the work can be copied and distributed to anyone in the United States without paying any fees or charges. If you are redistributing or providing access to a work with the phrase “Project Gutenberg” associated with or appearing on the work, you must comply either with the requirements of paragraphs 1.E.1 through 1.E.7 or obtain permission for the use of the work and the Project Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted with the permission of the copyright holder, your use and distribution must comply with both paragraphs 1.E.1 through 1.E.7 and any additional terms imposed by the copyright holder. Additional terms will be linked to the Project Gutenberg™ License for all works posted with the permission of the copyright holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project Gutenberg™ License terms from this work, or any files containing a part of this work or any other work associated with Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this electronic work, or any part of this electronic work, without prominently displaying the sentence set forth in paragraph 1.E.1 with active links or immediate access to the full terms of the Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary, compressed, marked up, nonproprietary or proprietary form, including any word processing or hypertext form. However, if you provide access to or distribute copies of a Project Gutenberg™ work in a format other than “Plain Vanilla ASCII” or other format used in the official version posted on the official Project Gutenberg™ website (www.gutenberg.org), you must, at no additional cost, fee or expense to the user, provide a copy, a means of exporting a copy, or a means of obtaining a copy upon request, of the work in its original “Plain Vanilla ASCII” or other form. Any alternate format must include the full Project Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying, performing, copying or distributing any Project Gutenberg™ works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing access to or distributing Project Gutenberg™ electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from the use of Project Gutenberg™ works calculated using the method you already use to calculate your applicable taxes. The fee is owed to the owner of the Project Gutenberg™ trademark, but he has agreed to donate royalties under this paragraph to the Project Gutenberg Literary Archive Foundation. Royalty payments must be paid within 60 days following each date on which you prepare (or are legally required to prepare) your periodic tax returns. Royalty payments should be clearly marked as such and sent to the Project Gutenberg Literary Archive Foundation at the address specified in Section 4, “Information about donations to the Project Gutenberg Literary Archive Foundation.”

• You provide a full refund of any money paid by a user who notifies you in writing (or by e-mail) within 30 days of receipt that s/he does not agree to the terms of the full Project

Gutenberg™ License. You must require such a user to return or destroy all copies of the works possessed in a physical medium and discontinue all use of and all access to other copies of Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of any money paid for a work or a replacement copy, if a defect in the electronic work is discovered and reported to you within 90 days of receipt of the work.

• You comply with all other terms of this agreement for free distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™ electronic work or group of works on different terms than are set forth in this agreement, you must obtain permission in writing from the Project Gutenberg Literary Archive Foundation, the manager of the Project Gutenberg™ trademark. Contact the Foundation as set forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend considerable effort to identify, do copyright research on, transcribe and proofread works not protected by U.S. copyright law in creating the Project Gutenberg™ collection. Despite these efforts, Project Gutenberg™ electronic works, and the medium on which they may be stored, may contain “Defects,” such as, but not limited to, incomplete, inaccurate or corrupt data, transcription errors, a copyright or other intellectual property infringement, a defective or damaged disk or other medium, a computer virus, or computer codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the “Right of Replacement or Refund” described in paragraph 1.F.3, the Project Gutenberg Literary Archive Foundation, the owner of the Project Gutenberg™ trademark, and any other party distributing a Project Gutenberg™ electronic work under this

agreement, disclaim all liability to you for damages, costs and expenses, including legal fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a defect in this electronic work within 90 days of receiving it, you can receive a refund of the money (if any) you paid for it by sending a written explanation to the person you received the work from. If you received the work on a physical medium, you must return the medium with your written explanation. The person or entity that provided you with the defective work may elect to provide a replacement copy in lieu of a refund. If you received the work electronically, the person or entity providing it to you may choose to give you a second opportunity to receive the work electronically in lieu of a refund. If the second copy is also defective, you may demand a refund in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied warranties or the exclusion or limitation of certain types of damages. If any disclaimer or limitation set forth in this agreement violates the law of the state applicable to this agreement, the agreement shall be interpreted to make the maximum disclaimer or limitation permitted

by the applicable state law. The invalidity or unenforceability of any provision of this agreement shall not void the remaining provisions.

1.F.6.

INDEMNITY

- You agree to indemnify and hold the Foundation, the trademark owner, any agent or employee of the Foundation, anyone providing copies of Project Gutenberg™ electronic works in accordance with this agreement, and any volunteers associated with the production, promotion and distribution of Project Gutenberg™ electronic works, harmless from all liability, costs and expenses, including legal fees, that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this or any Project Gutenberg™ work, (b) alteration, modification, or additions or deletions to any Project Gutenberg™ work, and (c) any Defect you cause.

Section 2. Information about the Mission of Project Gutenberg™

Project Gutenberg™ is synonymous with the free distribution of electronic works in formats readable by the widest variety of computers including obsolete, old, middle-aged and new computers. It exists because of the efforts of hundreds of volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the assistance they need are critical to reaching Project Gutenberg™’s goals and ensuring that the Project Gutenberg™ collection will remain freely available for generations to come. In 2001, the Project Gutenberg Literary Archive Foundation was created to provide a secure and permanent future for Project Gutenberg™ and future generations. To learn more about the Project Gutenberg Literary Archive Foundation and how your efforts and donations can help, see Sections 3 and 4 and the Foundation information page at www.gutenberg.org.

Section 3. Information about the Project

Gutenberg

Literary Archive Foundation

The Project Gutenberg Literary Archive Foundation is a non-profit 501(c)(3) educational corporation organized under the laws of the state of Mississippi and granted tax exempt status by the Internal Revenue Service. The Foundation’s EIN or federal tax identification number is 64-6221541. Contributions to the Project Gutenberg Literary Archive Foundation are tax deductible to the full extent permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West, Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up to date contact information can be found at the Foundation’s website and official page at www.gutenberg.org/contact

Section 4. Information about Donations to the Project Gutenberg Literary Archive Foundation

Project Gutenberg™ depends upon and cannot survive without widespread public support and donations to carry out its mission of increasing the number of public domain and licensed works that can be freely distributed in machine-readable form accessible by the widest array of equipment including outdated equipment. Many small donations ($1 to $5,000) are particularly important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating charities and charitable donations in all 50 states of the United States. Compliance requirements are not uniform and it takes a considerable effort, much paperwork and many fees to meet and keep up with these requirements. We do not solicit donations in locations where we have not received written confirmation of

compliance. To SEND DONATIONS or determine the status of compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where we have not met the solicitation requirements, we know of no prohibition against accepting unsolicited donations from donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make any statements concerning tax treatment of donations received from outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation methods and addresses. Donations are accepted in a number of other ways including checks, online payments and credit card donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project Gutenberg™ electronic works

Professor Michael S. Hart was the originator of the Project Gutenberg™ concept of a library of electronic works that could be freely shared with anyone. For forty years, he produced and distributed Project Gutenberg™ eBooks with only a loose network of volunteer support.

Project Gutenberg™ eBooks are often created from several printed editions, all of which are confirmed as not protected by copyright in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition.

Most people start at our website which has the main PG search facility: www.gutenberg.org.

This website includes information about Project Gutenberg™, including how to make donations to the Project Gutenberg Literary

Archive Foundation, how to help produce our new eBooks, and how to subscribe to our email newsletter to hear about new eBooks.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.