1DataMining:ConceptsandTechniques(3rded.)—Chapter3—JiaweiHan,MichelineKamber,andJianPeiUniversityofIllinoisatUrbana-Champaign&SimonFraserUniversity©2011Han,Kamber&Pei.Allrightsreserved.01/26/2523Chapter3:DataPreprocessingDataPreprocessing:AnOverviewDataQualityMajorTasksinDataPreprocessingDataCleaningDataIntegrationDataReductionDataTransformationandDataDiscretizationSummary4DataQuality:WhyPreprocesstheData?Measuresfordataquality:AmultidimensionalviewAccuracy:correctorwrong,accurateornotCompleteness:notrecorded,unavailable,…Consistency:somemodifiedbutsomenot,dangling,…Timeliness:timelyupdate?Believability:howtrustablethedataarecorrect?Interpretability:howeasilythedatacanbeunderstood?5MajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmultipledatabases,datacubes,orfilesDatareductionDimensionalityreductionNumerosityreductionDatacompressionDatatransformationanddatadiscretizationNormalizationConcepthierarchygeneration6Chapter3:DataPreprocessingDataPreprocessing:AnOverviewDataQualityMajorTasksinDataPreprocessingDataCleaningDataIntegrationDataReductionDataTransformationandDataDiscretizationSummary7DataCleaningDataintheRealWorldIsDirty:Lotsofpotentiallyincorrectdata,e.g.,instrumentfaulty,humanorcomputererror,transmissionerrorincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatae.g.,Occupation=“”(missingdata)noisy:containingnoise,errors,oroutlierse.g.,Salary=“−10”(anerror)inconsistent:containingdiscrepanciesincodesornames,e.g.,Age=“42”,Birthday=“03/07/2010”Wasrating“1,2,3”,nowrating“A,B,C”discrepancybetweenduplicaterecordsIntentional(e.g.,disguisedmissingdata)Jan.1aseveryone’sbirthday?8Incomplete(Missing)DataDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdat...