计算机技术与发展

2025, 03, v.35 109-116

基于语义空间感知与注意力的文本生成图像方法

基金项目(Foundation): 陕西省重点研发计划项目(2022GY-030,2022GY-039)

邮箱(Email):

DOI: 10.20165/j.cnki.ISSN1673-629X.2024.0355

287	2	5
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

文本生成图像任务中存在图像与文本描述不匹配现象以及图像生成质量不佳的现象。为了改善文本与生成图像之间的匹配程度以及更高质量地生成图像，该文提出了一种新颖的生成对抗网络模型(WSA-GAN)。将单词文本编码后的嵌入向量经过交叉注意力方法以及置信度特征融合方法，有效地将单词级语义特征与图像隐藏特征融合。同时引入了语义空间感知卷积模块(SSACN)并对其进行改进，采用深度可分离卷积替代了普通卷积，减少模型参数量，达成改善模型复杂度的目的，并利用自注意力与卷积混合层(ACMix)来捕获图像特征中各个像素之间的关系，在保证模型复杂度的条件下建模特征之间的长距离关系，使得模型能够捕获更广泛的上下文信息，在提高图像质量的同时，提升了文本与生成图像之间的对齐程度。通过在CUB-200-2011数据集上进行验证，对比主流模型，生成质量与文本对齐度均有一定程度的提高。

关键词： 生成对抗网络; 多模态融合; 注意力机制; 文本描述生成图像; 深度学习;

Abstract：

In the task of text image generation, there exist the phenomenon of mismatch between image and text description and the phenomenon of poor image generation quality. In order to improve the matching degree between text and generated images and generate higher quality generated images, a novel generative adversarial network model(WSA-GAN) is proposed. The embedding vector encoded by the word text is fused with the hidden features of the image effectively through the cross-attention method and the confidence feature fusion method. At the same time, the semantic spatial-aware convolution module(SSACN) is introduced and improved, and deep separable convolution is used to replace ordinary convolution to reduce the number of model parameters and achieve the purpose of improving the complexity of the model. Self-attention and convolution mixing(ACMix) is used to capture the relationship between each pixel in the image features, and the long-distance relationship between the features is modeled under the condition of ensuring the complexity of the model, so that the model can capture a wider range of context information, improving the alignment between the text and the generated image while improving the image quality. By verifying on CUB-200-2011 data set, compared to mainstream models, the quality of generation and the alignment with the text have both improved to some extent.

KeyWords： generative adversarial networks; multi-modality fusion; attention mechanism; text description generated image; deep learning;

如需获取全文，请访问cnki.net

参考文献

[1] KINGMA D P.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.

[2] HO J,JAIN A,ABBEEL P.Denoising diffusion probabilistic models[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.

[3] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.

[4] ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE international conference on computer vision.Venice:IEEE,2017:2223-2232.

[5] 曹寅，秦俊平，马千里，等.文本生成图像研究综述[J].浙江大学学报：工学版，2024,58(2):219-238.

[6] MIRZA M.Conditional generative adversarial nets[J].arXiv:1411.1784,2014.

[7] ZHANG H,XU T,LI H,et al.StackGan:text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE international conference on computer vision.Venice:IEEE,2017:5907-5915.

[8] XU T,ZHANG P,HUANG Q,et al.Attngan:fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.Salt Lake City:IEEE,2018:1316-1324.

[9] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].arXiv:1706.03762,2017.

[10] LIAO W,HU K,YANG M Y,et al.Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.New Orleans:IEEE,2022:18187-18196.

[11] 李乐阳，佟国香，赵迎志，等.基于生成对抗网络的文本生成图像研究综述[J].电子科技，2023,36(10):39-55.

[12] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM networks[C]//Proceedings of 2005 IEEE international joint conference on neural networks.Montreal:IEEE,2005:2047-2052.

[13] RUMELHART D E,HINTON G E,WILLIAMS R J.Learning representations by back-propagating errors[J].Nature,1986,323(6088):533-536.

[14] SCHMIDHUBER J,HOCHREITER S.Long short-term memory[J].Neural Comput,1997,9(8):1735-1780.

[15] PAN X,GE C,LU R,et al.On the integration of self-attention and convolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.New Orleans:IEEE,2022:815-825.

[16] ULYANOV D.Instance normalization:the missing ingredient for fast stylization[J].arXiv:1607.08022,2016.

[17] SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.Las Vegas:IEEE,2016:2818-2826.

[18] TAO M,TANG H,WU F,et al.DF-GAN:a simple and effective baseline for text-to-image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.New Orleans:IEEE,2022:16515-16525.

[19] HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//Proceedings of the 31st international conference on neural information processing systems.Long Beach:Neural Information Processing Systems Foundation,2017:6629-6640.

[20] BARRATT S,SHARMA R.A note on the inception score[J].arXiv:1801.01973,2018.

[21] ZHU M,PAN P,CHEN W,et al.DM-GAN:dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.Long Beach:IEEE,2019:5802-5810.

[22] 薛志杭，许喆铭，郎丛妍，等.基于图像-文本语义一致性的文本生成图像方法[J].计算机研究与发展，2023,60(9):2180-2190.

[23] MA Y,LIU L,ZHANG H,et al.Generative adversarial network based on semantic consistency for text-to-image generation[J].Applied Intelligence,2023,53(4):4703-4716.

[24] LI Z,MIN M R,LI K,et al.Stylet2i:toward compositional and high-fidelity text-to-image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.New Orleans:IEEE,2022:18197-18207.

[25] QIAO T,ZHANG J,XU D,et al.Mirrorgan:learning text-to-image generation by redescription[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.Long Beach:IEEE,2019:1505-1514.

[26] YE S,WANG H,TAN M,et al.Recurrent affine transformation for text-to-image synthesis[J].IEEE Transactions on Multimedia,2023,26:462-473.

[27] LEE M,SEOK J.Controllable generative adversarial network[J].IEEE Access,2019,7:28158-28169.

基本信息:

DOI：10.20165/j.cnki.ISSN1673-629X.2024.0355

中图分类号:TP391.41;TP18

引用信息:

[1]欧阳安杰,孙大盟,何立明.基于语义空间感知与注意力的文本生成图像方法[J].计算机技术与发展,2025,35(03):109-116.DOI:10.20165/j.cnki.ISSN1673-629X.2024.0355.

基金信息:

陕西省重点研发计划项目(2022GY-030,2022GY-039)

发布时间：

2024-12-10

出版时间：

2024-12-10

网络发布时间：

2024-12-10

请选择需要下载的pdf数据

计算机技术与发展

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

计算机技术与发展

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈