基于特征融合与Transformer模型的声音事件定位与检测算法研究*濮子俊,张寿明(昆明理工大学信息工程与自动化学院,云南昆明650500)摘要:针对多通道环境声音检测问题,提出了一种引入Transformer结构的特征融合网络模型TBCF-MTNN。该网络模型以对数梅尔谱和广义互相关谱作为输入,首先通过CNN和GRU获取谱的局部特征以及时间上下文关系特征,之后将2种特征图通过Cross-stitch模块进行融合,有效解决了传统网络中多特征信息无法共享的问题;然后,将融合后的特征图送入Transformer进行特征的再次采集;最终,通过全链接层输出分类和定位结果。在TAU-NIGENS2020数据集上的实验结果表明,所提出的TBCF-MTNN网络在声音检测任务中的分类错误率能够减小至0.26;在声源定位任务中与Baseline相比较其定位误差减小至4.7°;通过和Baseline、FPN、EIN等模型相比较,结果表明所提网络具有更优的识别检测效果。关键词:声音事件定位与检测;深度学习;Transformer模型;Cross-stitch;特征融合中图分类号:TP510.4010;TP520.2050文献标志码:Adoi:10.3969/j.issn.1007-130X.2023.06.017AsoundeventlocalizationanddetectionalgorithmbasedonfeaturefusionandTransformermodelPUZi-jun,ZHANGShou-ming(FacultyofInformationEngineeringandAutomation,KunmingUniversityofScienceandTechnology,Kunming650500,China)Abstract:Aimingattheproblemofmulti-channelenvironmentalsounddetection,afeaturefusionnetworkmodelTBCF-MTNNisproposed,whichintroducestheTransformerstructure.ThenetworkstructuretakeslogarithmicMel-spectrumandgeneralizedcross-correlationspectrumasinput.Firstly,thelocalfeaturesofthespectrumandthetemporalcontextrelationshipfeaturesareobtainedthroughCNNandGRU,andthenthetwofeaturemapsaremergedthroughtheCross-stitchmodule,whichcaneffectivelysolvethetraditionalproblemthatmulti-featureinformationcannotbesharedinthenetwork.Secondly,thefusedfeaturemapissenttoTransformerforre-collectionoffeatures.Finallytheclassifi-cationandpositioningresultsareoutputthroughthefulllinklayer.TheverificationonTAU-NIGENS2020datasetshowthat,comparedwiththeBaselinemodel,theTBCF-MTNNnetworkcanreducetheclassificationerrorrateto0.26inthesounddetectiontask,andreducethelocalizationerrorto4.7°inthesoundsourcelocalizationtask.Compar...