Multimodal NER
- Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Zhang et al., WSDM, 2023
De-bias Contrastive Learning
- Flat Multi-modal Interaction Transformer for Named Entity Recognition. Lu et al., COLING, 2022
One auxiliary task (entity span detection)
- Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Zhao et al., ACM MM, 2022 [RGCN-code]
- CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention. Wang et al., ICME, 2022
- PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. Wang et al., DASFAA, 2022
- MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. Xu et al. WSDM, 2022 [MAF-code]
two auxiliary tasks (self-supervised learning matching+alignment)
- Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention. Mai et al. WSDM, 2022
- ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. Wang et al. NAACL, 2022 [ITA-code]
align image into object tags (local), captions (global) and OCR as visual contexts; KL(cross-modal view,text view)
- Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. Chen et al. Findings: NAACL, 2022 [HVPNeT-code]
- Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. Chen et al. DASFAA, 2021
- Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. Zhang et al. AAAI, 2021 [UMGF-code]
multimodal GNNs for NER; averagely segemented vs. targeted visual cues; cross-domain generalizatoin comparison
- RpBERT: A text-image relation propagation-based BERT model for multimodal NER. Sun et al. AAAI, 2021 [RpBERT-code]
multimodal BERT for NER; one auxiliary task (text-image relation [CLS]) on the external source (the TRC data)
- RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. Sun et al. COLING, 2020
Relationship Inference and Visual Attention (RIVA); auxiliary task (text-image relation [CLS]) on unlabeled
large tweet corpus
- Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. Wu et al. ACM MM, 2020
OCSGA: Object + Character + SA (Self-Attention) + GA (Guide Attention)
- Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. Yu et al. ACL, 2020 [UMT-code]
multimodal Transformer for NER; one auxiliary task (entity span detection)
- Visual Attention Model for Name Tagging in Multimodal Social Media. Lu et al. ACL, 2018
Twitter-17 & Snapchat datasets; multimodal Hierarchy BiLSTM-CRF for NER
- Multimodal Named Entity Recognition for Short Social Media Posts. Moon et al. NAACL, 2018
SnapCaptions dataset
- Adaptive Co-attention Network for Named Entity Recognition in Tweets. Zhang et al. AAAI, 2018
Twitter-15 dataset; multimodal CNN-BiLSTM-CRF for NER
Journal
- Hierarchical self-adaptation network for multimodal named entity recognition in social media. Tian et al. Neurocomputing, 2021
- Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. Zhang et al. TMM, 2021
Transfer/ Cross-domain/ Low-resource NER
- CycleNER: An Unsupervised Training Approach for Named Entity Recognition. Iovine et al. TheWebConf, 2022
cycle-consistency unsupervised training for emerging and/or low-resource domains; Track: Web Mining and Content Analysis
- Exploring Modular Task Decomposition in Cross-domain Named Entity Recognition. Zhang et al. SIGIR, 2022 [MTD-code]
adversarial regularization
Neural/ Prompt-based NER & Survey
- Prompt-Based Metric Learning for Few-Shot NER. ICLR unver reivew, 2023
input with label-aware prompt: append the entity type to each entity occurrence in the input so that the model is aware of such information
- Effective Named Entity Recognition with Boundary-aware Bidirectional Neural Networks. Li et al. TheWebConf, 2021
pointer networks in boundary detection and entity classification
- Modularized Interaction Network for Named Entity Recognition. Li et al. ACL, 2021
sharing between boundary detection & type prediction to enhance NER
- A Survey on Deep Learning for Named Entity Recognition. Li et al. TKDE, 2020
- End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Ma & Hovy. ACL, 2016
Workshop
- Can images help recognize entities: A study of the role of images for Multimodal NER. Chen et al. EMNLP workshop, 2021 [LSTM-CNNs-CRF-code]
Information Extraction & Multimodal IE
- Multimodal Relation Extraction with Efficient Graph Alignment. Zheng et al. ACM MM, 2021
- Modeling Dense Cross-Modal Interactions for Joint Entity-Relation Extraction. Zhao et al. IJCAI, 2020
Tutorial
- Multi-Modal Information Extraction from Text, Semi-Structured, and Tabular Data on the Web. Dong et al. SIGKDD Tutorial, 2020
Multimodal MT
- A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. Yin et al. ACL, 2020
- A Shared Task on Multimodal Machine Translation and Crosslingual Image Description. Specia et al. WMT, 2016
Multimodal KG
- Multi-modal Siamese Network for Entity Alignment. Chen et al. SIGKDD, 2022
- Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. Chen et al. SIGIR, 2022
- Multi-Modal Knowledge Graph Construction and Application: A Survey. Zhu et al. arXiv, 2022
Multimodal IR, RecSys, and Ads
- Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieva. ICLR under review, 2023
- Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation. Han et al. TheWebConf, 2022
- CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval. Yu et al. SIGKDD, 2022
- Automatic Generation of Product-Image Sequence in E-commerce. Fan et al. SIGKDD, 2022
leverage textual review feedback as additional training target, and utilize product textual description to provide extra semantic information
- ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. Baltescu et al. SIGKDD, 2022
Transformer to aggregate image & text modalities to learn product representations
- PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest. Pal et al. SIGKDD, 2020
- Combo-Attention Network for Baidu Video Advertising. Yu et al. SIGKDD, 2020
combo-attention module (CAM): exploit cross-modal attentions besides self attentions to effectively capture relevance between words (search query) & bounding boxes (short video)
Cross-Modal Hashing Retrieval
- Vulnerability vs. Reliability Disentangled Adversarial Examples for Cross-Modal Learning. Li et al. SIGKDD, 2020
learn cross-modal correlations by exploring modality-related component: modality-unrelated + modality-related components
Multimodal Sentiment/ Emotion
- “I Have No Text in My Post”: Using Visual Hints to Model User Emotions in Social Media. Song et al. TheWebConf, 2022
- Multimodal Transformer for Unaligned Multimodal Language Sequences. Tsai et al. ACL, 2019
crossmodal attention, crossmodal Transformer, multimodal Transformer
- Adapting BERT for Target-Oriented Multimodal Sentiment Classification. Yu & Jiang. IJCAI, 2019
Vision+Language (VQA, NLVR, VCSR, I-T Retri., V Entail., REC)
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. Li et al. EMNLP, 2022 [mPLUG-code]
cross-modal skip-connections create inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side
- An Empirical Study of Training End-to-End Vision-and-Language Transformers. Dou et al. CVPR, 2022
co-attention; merged-attention
- UNITER: UNiversal Image-TExt Representation Learning. Chen et al. ECCV, 2020
Optimal Transport for finegrained alignment between words and image regions
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Li et al. ECCV, 2020
object tags as anchor points to learn alignment
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations. Su et al. ICLR, 2020
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. Li et al. AAAI, 2020
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Lu et al. NeurIPS, 2019
- Deep Modular Co-Attention Networks for Visual Question Answering. Yu et al. CVPR, 2019
Self Attn (word-to-word, region-to-region); Guided Attn (word-to-region)
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. Tan & Bansal. EMNLP, 2019
three endoers: object relationship, language,
and cross-modality
- VisualBERT: A Simple and Performant Baseline for Vision and Language. Harold et al. arXiv, 2019
Multimodal Learning
- Attentional Context Alignment for Multimodal Sequential Learning . ICLR under review, 2023
- On Uni-modal Feature Learning in Multi-modal Learning. ICLR under review, 2023
features of multi-modal data into: uni-modal features and paired features. Although multi-modal joint training provides the opportunity for cross-modal interaction to learn paired features, the model easily saturates and ignores
the uni-modal features that are hard to learn but also important to generalization.
- Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. Liang et al. arXiv, 2022
- Multimodal Learning with Transformers: A Survey. Xu et al. arXiv, 2022
NER related
- Show and Write: Entity-aware Article Generation with Image Information. ICLR under review, 2023
produce an article conditioned on article metadata, and on image info like extracted named entities providing important context
- An End-to-End Progressive Multi-Task Learning Framework for Medical Named Entity Recognition and Normalization. Zhou et al. ACL, 2021
- Linguistically-Enriched and Context-Aware Zero-shot Slot Filling. Siddique et al. TheWebConf, 2021
NER cues for zero-shot slot filling in task-oriented dialog systems
- A Unified MRC Framework for Named Entity Recognition. Li et al. ACL, 2020
NER as machine reading comprehension (MRC)
Multi-Modal Sarcasm, Fake News, Hate Speech
- Multimodal Hate Speech Detection via Cross-Domain Knowledge Transfer. Yang et al. ACM MM, 2022
- Cross-modal Ambiguity Learning for Multimodal Fake News Detection. Chen et al. TheWebConf, 2022
Track: Web Mining and Content Analysis
- Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network. Liang et al. ACL, 2022
cross-modal graph; cross-modal graph convolutional network
- Multi-Modal Sarcasm Detection with Interactive In-Modal and Cross-Modal Graphs
. Liang et al. ACM MM, 2021
Tutorial
- Multi-modal Network Representation Learning. Zhang et al. SIGKDD Tutorial, 2020
Please do not hesitate to inform us of any missing works.
[BackToHome]
|