贪心科技|让每个人享受个性化教育服务文本相似度————————VSM,embedding,deepmatchbrucehan2020-04-05贪心科技|让每个人享受个性化教育服务学习的哲学思考1.从公式、公理、定律、规范、制度等一系列上层设计出发,进行操作、执行、推理、演绎等2.基于实践、经验、教训等既定历史存在,总结规律、运用数理知识进行分析、归纳,得出最佳实践贝叶斯决策树SVM神经网络正则表达式自动机词典强化学习知识库体系结构贪心科技|让每个人享受个性化教育服务NLP技术栈文本分类信息检索信息抽取阅读理解问答系统多轮交互推荐系统自动摘要语义embedding机器学习机器翻译知识图谱语义相似性计算分词关系抽取词性标注句法分析命名体语言模型神经网络逻辑推理经验规则领域任务基础技术底层原理语音识别情感计算贪心科技|让每个人享受个性化教育服务Contents目录向量空间模型01语义表征02排序04深度匹配03贪心科技|让每个人享受个性化教育服务几个基本问题句子表示相似度计算(距离表示)对齐的字符序列非对齐的词向量平均句子级别的embedding编辑距离DICE系数、Jaccard系数欧式距离余弦距离a.b.cbcdf(I,j)=f(i-1,j-1).Ifa[i]==b[j]F(I,j)=min(f(i-1,j),f(I,j-1),f(i-1,j-1))贪心科技|让每个人享受个性化教育服务One-hot特点:稀疏表示、词表空间英文常见单词3w词向量book:(1,0,0,0,0,0,0,….)me:(0,1,0,0,0,0,0,….)a:(0,0,1,0,0,0,0,….)ticket:(0,0,0,1,0,0,0,….)to:(0,0,0,0,1,0,0,….)beijing:(0,0,0,0,0,1,0,….)句向量bookmeatickettoBeijing:(1,1,1,1,1,1,0,….)贪心科技|让每个人享受个性化教育服务Bagofwordseg:WelcometoBeijing,IloveBeijing频度(1,1,2,1,1,0,0,0,….)主题模型02无监督的深度语义自回归自编码PART自回归(AR)SBookmeatickettoBeijingtodayBookmeatickettoBeijingtodaye自编码(AE)BookmeatickettobeijingtodayBookmeatickettobeijingtodaycontextExample:word2vecCBOWskip-gramExample:sentence2vectorSkip-thoughtELMoEmbeddingfromLanguageModelstrain++inferseq2seqEncode-decodeNMTseq2seqEncode-decodeNMTwithattentionAttention一个查询(query)到一系列(键key-值value)对的相似性加权transformertransformerMulti-headself-attentionStep1:域变换Step2:矩阵计算注意力Step3:多组计算transformerAdd&NormGPTTransformerdecoderBERTTransformerencoderBERTInputembeddingBERTMaskedlangua...