Batcheffectsinsingle-cellRNAsequencingdataarecorrectedbymatchingmutualnearestneighboursLalehHaghverdi1,2,AaronT.L.Lun3,MichaelD.Morgan4,andJohnC.Marioni1,3,41EuropeanMolecularBiologyLaboratory,EuropeanBioinformaticsInstitute(EMBL-EBI),Cambridge,UnitedKingdom2InstituteofComputationalBiology,HelmholtzZentrumMünchen,Munich,Germany3CancerResearchUKCambridgeInstitute,UniversityofCambridge,Cambridge,UnitedKingdom4WellcomeTrustSangerInstitute,Cambridge,UnitedKingdomAbstractLarge-scalesingle-cellRNAsequencing(scRNA-seq)datasetsthatareproducedindifferentlaboratoriesandatdifferenttimescontainbatcheffectsthatcouldcompromiseintegrationandinterpretationofthesedata.ExistingscRNA-seqanalysismethodsincorrectlyassumethatthecompositionofcellpopulationsiseitherknown,orthesame,acrossbatches.Wepresentastrategyforbatchcorrectionthatisbasedonthedetectionofmutualnearestneighbours(MNN)inthehigh-dimensionalexpressionspace.Ourapproachdoesnotrelyonpre-definedorequalpopulationcompositionsacrossbatches,andonlyrequiresthatasubsetofthepopulationbesharedbetweenbatches.WedemonstratethesuperiorityofourapproachoverexistingmethodsusingbothsimulatedandrealscRNA-seqdatasets.Usingmultipledroplet-basedscRNA-seqdatasets,wedemonstratethatourMNNbatch-effectcorrectionmethodscalestolargenumbersofcells.CorrespondenceshouldbeaddressedtoJ.C.M.:John.Marioni@cruk.cam.ac.uk.DataavailabilityThepublisheddatasetsusedinthismanuscriptareavailablethroughGEOaccessionnumbersSMART-seq2platformhaematopoieticdatabyNestorowaetal.[12]:GSE81682,MARS-seqplatformhaematopoieticdatabyPauletal.[18]:GSE72857,CEL-seqplatformpancreasdatabyGrünetal.[20]:GSE81076,CEL-seq2platformpancreasdatabyMuraroetal.[21]:GSE85241,SMART-seq2platformpancreasdatabyLawloretal.[22]:GSE86473,orArrayExpressaccessionnumber:SMART-seq2platformpancreasdatabySegerstolpeetal.[23]:E-MTAB-5061.SoftwareavailabilityAnopen-sourcesoftwareimplementationofourMNNmethodisavailableasthemnnCorrectfunctioninversion1.6.2ofthescranpackageonBioconductor(https://bioconductor.or...