Annotating aspects in text and image: A new task and dataset for multimodal aspect‐based sentiment analysis
Journal of the American Society for Information Science and Technology
Published online on September 08, 2025
Abstract
["Journal of the Association for Information Science and Technology, EarlyView. ", "\nAbstract\nAspect‐Based Sentiment Analysis (ABSA) has evolved from textual analysis to a multimodal paradigm, integrating visual information to capture nuanced sentiments. Despite advancements, existing Multimodal ABSA (MABSA) research remains limited in granularity, which focuses on either coarse‐level categories or named entities, neglecting fine‐grained sentiment analysis at the aspect term level and visual objects depicted in images. To address these gaps, we propose a new task, Multimodal Aspect–Category–Sentiment–Appearance Quad Extraction (MASQE), which aims to extract textual aspect terms and visual aspect objects, their associated categories, sentiments, and modality appearances. To facilitate research on this task, we introduce MM‐Rest, a novel dataset comprising 19,962 manually annotated aspect–category–sentiment–appearance quadruples from restaurant reviews, annotated across both text and images. Additionally, we propose a Visual Aspect‐aware Multimodal Large Language Model (VAM‐LLM), which leverages predicted visual aspect objects to enhance multimodal quadruple extraction in an end‐to‐end framework. Experimental results demonstrate the effectiveness of VAM‐LLM over baseline systems, establishing strong benchmarks for MASQE and its subtasks. We believe our work opens new avenues for fine‐grained multimodal sentiment analysis, providing rich resources and methodologies for future research.\n"]