AThoroughComparisonofDeltaLake,IcebergandHudiJunjieChenAboutMe▪SoftwareengineeratTencentDataLakeTeam▪FocusonbigdataareaforyearsAgendaIntroductiontoDeltaLake,ApacheIcebergandApacheHudiKeyFeaturesComparison▪Transaction▪Datamutation▪StreamingSupport▪SchemaevolutionMaturity▪Tooling▪Integration▪PerformanceConclusionWhatfeaturesareexpectforthedatalake?DataLakeDataQualityTransaction(ACID)IndependenceofEnginesUnifiedBatch&StreamingStoragePluggableScalableMetadataDataMutationDeltaLakeDeltaLakeisanopen-sourcestoragelayerthatbringsACIDtransactionstoApacheSpark™andbigdataworkloads.ApacheIcebergAntableformatforhugeanalyticdatasetswhichdelivershighqueryperformancefortableswithtensofpetabytesofdata,alongwithatomiccommits,concurrentwrites,andSQL-compatibletableevolution.DFS/CloudStorageSparkBatch&StreamingAI&ReportingInteractiveQueriesStreamingStreamingAnalyticsApacheHudiApacheHudiingests&managesstorageoflargeanalyticaldatasetsoverDFSAQuickComparisonDeltaLake(opensource)ApacheIcebergApacheHudiTransaction(ACID)YYYMVCCYYYTimetravelYYYSchemaEvolutionYYYDataMutationY(update/delete/mergeinto)NY(upsert)StreamingSinkandsourceforsparkstructstreamingSinkandsource(wip)forSparkstructstreaming,Flink(wip)DeltaStreamerHiveIncrementalPullerFileFormatParquetParquet,ORC,AVROParquetCompaction/CleanupManualAPIavailable(SparkAction)ManualandAutoIntegrationDSv1,DeltaconnectorDSv2,InputFormatDSv1,InputFormatMultiplelanguagesupportScala/java/pythonJava/pythonJava/pythonStorageAbstractionYYNAPIdependencySpark-bundledNative/EnginebundledSpark-bundledDataingestionSpark,presto,hiveSpark,hiveDeltaStreamer2020-05TransactionDeltaLake▪Model▪TransactionLog(DeltaLog)▪Optimisticconcurrencycontrol▪Checkpointchangesintoparquet▪AtomicityGuarantee▪HDFSrename▪S3filewrite▪Azurerenamewithoutoverwrite▪TimeTravel▪timestamp▪versionnumberApacheIceberg▪Model▪Snapshot▪Optimisticconcurrencycontrol▪AtomicityGuarantee▪HDFSRename▪Hivemetastorelock▪TimeTravel▪snapshotid▪timestampRWS1S2S3S4ApacheHudi...