2023年2月25日第7卷第4期现代信息科技ModernInformationTechnologyFeb.2023Vol.7No.436362023.022023.02收稿日期:2022-10-13基金项目:甘肃省农业大学盛彤笙科技创新基金(GSAU-STS-2021-15);国家自然基金(32060437);甘肃农业大学省级大学生创新创业训练计划项目(202216018)基于Word2Vec及TextRank算法的长文档摘要自动生成研究朱玉婷,刘乐,辛晓乐,陈珑慧,康亮河(甘肃农业大学,甘肃兰州730070)摘要:近年来,如何从大量信息中提取关键信息已成为一个急需解决的问题。针对中文专利长文档,提出一种结合Word2Vec和TextRank的专利生成算法。首先利用PythonJieba技术对中文专利文档进行分词,利用停用词典去除无意义的词;其次利用Word2Vec算法进行特征提取,并利用WordCloud对提取的关键词进行可视化展示;最后利用TextRank算法计算语句间的相似度,生成摘要候选句,根据候选句的权重生成该专利文档的摘要信息。实验表明,采用Word2Vec和TextRank生成的专利摘要质量高,概括性也强。关键词:Jieba分词;关键词提取;Word2Vec算法;TextRank算法中图分类号:TP391.1文献标识码:A文章编号:2096-4706(2023)04-0036-04ResearchonAbstractAutomaticGenerationofLongDocumentBasedontheWord2Vec+TextRankAlgorithmZHUYuting,LIULe,XINXiaole,CHENLonghui,KANGLianghe(GansuAgriculturalUniversity,Lanzhou730070,China)Abstract:Inrecentyears,howtoextractcriticalinformationfromlargeamountsofinformationhasbecomeaproblemwhichneedstobesolvedurgently.ForChinesepatentlongdocuments,apatentgenerationalgorithmcombiningWord2VecandTextRankisproposed.Firstly,PythonJiebatechnologyisusedtosegmentwordsinChinesepatentdocuments,andmeaninglesswordsareremovedbyusingthestopdictionary.Secondly,theWord2Vecalgorithmisusedforfeatureextraction,andtheextractedkeywordsarevisuallydisplayedbyWordCloud.Finally,theTextRankalgorithmisusedtocalculatethesimilaritybetweensentences,generateabstractcandidatesentences,andgenerateabstractinformationofthepatentdocumentsaccordingtotheweightofcandidatesentences.ExperimentsshowthatpatentabstractsgeneratedbyWord2VecandTextRankareofhighqualityandhavestronggeneralization.Keywords:Jiebawordsegmentation;keywordextraction;Word2Vecalgorithm;TextRankalgorit...