各式各樣的AttentionHung-yiLee李宏毅Prerequisitehttps://youtu.be/hYdO9CscNeshttps://youtu.be/gmsMY5kc-zw【機器學習2021】自注意力機制(Self-attention)(上)【機器學習2021】自注意力機制(Self-attention)(下)3ToLearnMore…https://arxiv.org/abs/2009.06732EfficientTransformers:ASurveyLongRangeArena:ABenchmarkforEfficientTransformershttps://arxiv.org/abs/2011.04006Howtomakeself-attentionefficient?AttentionMatrixkeyqueryNotice•Self-attentionisonlyamoduleinalargernetwork.•UsuallydevelopedforimageprocessingSkipSomeCalculationswithHumanKnowledgeCanwefillinsomevalueswithhumanknowledge?LocalAttention/TruncatedAttentionCalculateattentionweightSetto0……SimilarwithCNNkeyqueryStrideAttention…………GlobalAttentionAddspecialtokenintooriginalsequencespecialtoken=“token中的里長伯“Noattentionbetweennon-specialtokenManyDifferentChoices…Differentheadsusedifferentpatterns.小孩子才做選擇...ManyDifferentChoices…•Longformer•BigBirdhttps://arxiv.org/abs/2004.05150https://arxiv.org/abs/2007.14062CanweonlyfocusonCriticalParts?keyquerylargevaluesmallvalue•Directlysetto0•SmallerinfluenceonresultsHowtoquicklyestimatetheportionwithsmallattentionweights?ClusteringkeyqueryReformerhttps://openreview.net/forum?id=rkgNKkHtvBRoutingTransformerhttps://arxiv.org/abs/2003.05997Clusteringbasedonsimilarity1111122233333444(approximate&fast)Step1ClusteringkeyqueryStep2Belongtothesamecluster,thencalculateattentionweightNotthesamecluster,setto0LearnablePatternsSinkhornSortingNetworkhttps://arxiv.org/abs/2002.11296(simplifiedversion)keyqueryInputsequenceNNJointlylearnedAgridshouldbeskippedornotisdecidedbyanotherlearnedmoduleDoweneedfullattentionmatrix?ManyredundantcolumnsLinformerhttps://arxiv.org/abs/2006.04768keyqueryLowRankkeyqueryvalueRepresentativekeysCanwereducethenumberofqueries?outputchangeoutputsequencelengthReduceNumberofKeysConvConvConvConvLinformerhttps://arxiv.org/abs/2006.04768CompressedAttentionhttps://arxiv.org/abs/1801.10198Review=====AttentionMechanismisthree-matrixMultiplicationsoftma...