贪心科技|让每个人享受个性化教育服务Review:PaperReadingBERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding范老师2020/06/07贪心科技|让每个人享受个性化教育服务Content•Introduction•RelatedWork•BERT•Experiments•AblationStudies•Conclusion贪心科技|让每个人享受个性化教育服务IntroductionBERT:BidirectionalEncoderRepresentationsfromTransformersBidirectionalpre-trainingforlanguagerepresentationsPre-trainedrepresentationsreducetheneedformanyheavily-engineeredtask-specificarchitecturesFine-tuningwithoutdoingmuchworkBERTadvancesthestateoftheartfor11NLPtasks.贪心科技|让每个人享受个性化教育服务RelatedWork2.1UnsupervisedFeature-basedApproaches(notbidirectional)Left-to-rightlanguagemodelingobjectiveshavebeenused,aswellasobjectivestodiscriminatecorrectfromincorrectwordsinleftandrightcontext.Extractcontext-sensitivefeaturesfromaleft-to-rightandaright-to-leftlanguagemodel.2.2UnsupervisedFine-tuningApproachesSentenceordocumentencoderswhichproducecontextualtokenrepresentationshavebeenpre-trainedfromunlabeledtextandfine-tunedforasuperviseddownstreamtask2.3TransferLearningfromSupervisedData贪心科技|让每个人享受个性化教育服务RelatedWork贪心科技|让每个人享受个性化教育服务BERT贪心科技|让每个人享受个性化教育服务BERT贪心科技|让每个人享受个性化教育服务BERT-Pre-trainingBERTTask#1:MaskedLMMasklanguagemodel:onlypredictmaskedwords(15%tokens)+anoutputsoftmaxoverthevocabularyDatageneration:thetrainingdatageneratorchooses15%ofthetokenpositionsatrandomforprediction.Ifthei-thtokenischosen,wereplacethei-thtokenwith(1)the[MASK]token80%ofthetime(2)arandomtoken10%ofthetime(3)theunchangedi-thtoken10%ofthetime.Then,Tiwillbeusedtopredicttheoriginaltokenwithcrossentropyloss.WecomparevariationsofthisprocedureinAppendixC.2.Task#2:NextSentencePrediction(NSP)Datageneration:whenchoosingthesentencesAandBforeachpretrainingexample,50%ofthetimeBistheactualnextsentencethatfollowsA(labeledasIsNext),and50%ofthetimeitisarandomsentencefromthecorpus(labeledasNotNext).贪心科技|让...