Multimodal Learning Tasks [BackToHome]

Multimodal NER

  1. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Zhang et al., WSDM, 2023
    De-bias Contrastive Learning
  2. Flat Multi-modal Interaction Transformer for Named Entity Recognition. Lu et al., COLING, 2022
    One auxiliary task (entity span detection)
  3. Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Zhao et al., ACM MM, 2022 [RGCN-code]
  4. CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention. Wang et al., ICME, 2022
  5. PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. Wang et al., DASFAA, 2022
  6. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. Xu et al. WSDM, 2022 [MAF-code]
    two auxiliary tasks (self-supervised learning matching+alignment)
  7. Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention. Mai et al. WSDM, 2022
  8. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. Wang et al. NAACL, 2022 [ITA-code]
    align image into object tags (local), captions (global) and OCR as visual contexts; KL(cross-modal view,text view)
  9. Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. Chen et al. Findings: NAACL, 2022 [HVPNeT-code]
  10. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. Chen et al. DASFAA, 2021
  11. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. Zhang et al. AAAI, 2021 [UMGF-code]
    multimodal GNNs for NER; averagely segemented vs. targeted visual cues; cross-domain generalizatoin comparison
  12. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. Sun et al. AAAI, 2021 [RpBERT-code]
    multimodal BERT for NER; one auxiliary task (text-image relation [CLS]) on the external source (the TRC data)
  13. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. Sun et al. COLING, 2020
    Relationship Inference and Visual Attention (RIVA); auxiliary task (text-image relation [CLS]) on unlabeled large tweet corpus
  14. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. Wu et al. ACM MM, 2020
    OCSGA: Object + Character + SA (Self-Attention) + GA (Guide Attention)
  15. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. Yu et al. ACL, 2020 [UMT-code]
    multimodal Transformer for NER; one auxiliary task (entity span detection)
  16. Visual Attention Model for Name Tagging in Multimodal Social Media. Lu et al. ACL, 2018
    Twitter-17 & Snapchat datasets; multimodal Hierarchy BiLSTM-CRF for NER
  17. Multimodal Named Entity Recognition for Short Social Media Posts. Moon et al. NAACL, 2018
    SnapCaptions dataset
  18. Adaptive Co-attention Network for Named Entity Recognition in Tweets. Zhang et al. AAAI, 2018
    Twitter-15 dataset; multimodal CNN-BiLSTM-CRF for NER
Journal
  1. Hierarchical self-adaptation network for multimodal named entity recognition in social media. Tian et al. Neurocomputing, 2021
  2. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. Zhang et al. TMM, 2021
Transfer/ Cross-domain/ Low-resource NER
  1. CycleNER: An Unsupervised Training Approach for Named Entity Recognition. Iovine et al. TheWebConf, 2022
    cycle-consistency unsupervised training for emerging and/or low-resource domains; Track: Web Mining and Content Analysis
  2. Exploring Modular Task Decomposition in Cross-domain Named Entity Recognition. Zhang et al. SIGIR, 2022 [MTD-code]
    adversarial regularization
Neural/ Prompt-based NER & Survey
  1. Prompt-Based Metric Learning for Few-Shot NER. ICLR unver reivew, 2023
    input with label-aware prompt: append the entity type to each entity occurrence in the input so that the model is aware of such information
  2. Effective Named Entity Recognition with Boundary-aware Bidirectional Neural Networks. Li et al. TheWebConf, 2021
    pointer networks in boundary detection and entity classification
  3. Modularized Interaction Network for Named Entity Recognition. Li et al. ACL, 2021
    sharing between boundary detection & type prediction to enhance NER
  4. A Survey on Deep Learning for Named Entity Recognition. Li et al. TKDE, 2020
  5. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Ma & Hovy. ACL, 2016
Workshop
  1. Can images help recognize entities: A study of the role of images for Multimodal NER. Chen et al. EMNLP workshop, 2021 [LSTM-CNNs-CRF-code]

Information Extraction & Multimodal IE

  1. Multimodal Relation Extraction with Efficient Graph Alignment. Zheng et al. ACM MM, 2021
  2. Modeling Dense Cross-Modal Interactions for Joint Entity-Relation Extraction. Zhao et al. IJCAI, 2020
Tutorial
  1. Multi-Modal Information Extraction from Text, Semi-Structured, and Tabular Data on the Web. Dong et al. SIGKDD Tutorial, 2020

Multimodal MT

  1. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. Yin et al. ACL, 2020
  2. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description. Specia et al. WMT, 2016

Multimodal KG

  1. Multi-modal Siamese Network for Entity Alignment. Chen et al. SIGKDD, 2022
  2. Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. Chen et al. SIGIR, 2022
  3. Multi-Modal Knowledge Graph Construction and Application: A Survey. Zhu et al. arXiv, 2022

Multimodal IR, RecSys, and Ads

  1. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieva. ICLR under review, 2023
  2. Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation. Han et al. TheWebConf, 2022
  3. CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval. Yu et al. SIGKDD, 2022
  4. Automatic Generation of Product-Image Sequence in E-commerce. Fan et al. SIGKDD, 2022
    leverage textual review feedback as additional training target, and utilize product textual description to provide extra semantic information
  5. ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. Baltescu et al. SIGKDD, 2022
    Transformer to aggregate image & text modalities to learn product representations
  6. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest. Pal et al. SIGKDD, 2020
  7. Combo-Attention Network for Baidu Video Advertising. Yu et al. SIGKDD, 2020
    combo-attention module (CAM): exploit cross-modal attentions besides self attentions to effectively capture relevance between words (search query) & bounding boxes (short video)
Cross-Modal Hashing Retrieval
  1. Vulnerability vs. Reliability Disentangled Adversarial Examples for Cross-Modal Learning. Li et al. SIGKDD, 2020
    learn cross-modal correlations by exploring modality-related component: modality-unrelated + modality-related components

Multimodal Sentiment/ Emotion

  1. “I Have No Text in My Post”: Using Visual Hints to Model User Emotions in Social Media. Song et al. TheWebConf, 2022
  2. Multimodal Transformer for Unaligned Multimodal Language Sequences. Tsai et al. ACL, 2019
    crossmodal attention, crossmodal Transformer, multimodal Transformer
  3. Adapting BERT for Target-Oriented Multimodal Sentiment Classification. Yu & Jiang. IJCAI, 2019

Vision+Language (VQA, NLVR, VCSR, I-T Retri., V Entail., REC)

  1. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. Li et al. EMNLP, 2022 [mPLUG-code]
    cross-modal skip-connections create inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side
  2. An Empirical Study of Training End-to-End Vision-and-Language Transformers. Dou et al. CVPR, 2022
    co-attention; merged-attention
  3. UNITER: UNiversal Image-TExt Representation Learning. Chen et al. ECCV, 2020
    Optimal Transport for finegrained alignment between words and image regions
  4. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Li et al. ECCV, 2020
    object tags as anchor points to learn alignment
  5. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. Su et al. ICLR, 2020
  6. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. Li et al. AAAI, 2020
  7. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Lu et al. NeurIPS, 2019
  8. Deep Modular Co-Attention Networks for Visual Question Answering. Yu et al. CVPR, 2019
    Self Attn (word-to-word, region-to-region); Guided Attn (word-to-region)
  9. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. Tan & Bansal. EMNLP, 2019
    three endoers: object relationship, language, and cross-modality
  10. VisualBERT: A Simple and Performant Baseline for Vision and Language. Harold et al. arXiv, 2019

Multimodal Learning

  1. Attentional Context Alignment for Multimodal Sequential Learning . ICLR under review, 2023
  2. On Uni-modal Feature Learning in Multi-modal Learning. ICLR under review, 2023
    features of multi-modal data into: uni-modal features and paired features. Although multi-modal joint training provides the opportunity for cross-modal interaction to learn paired features, the model easily saturates and ignores the uni-modal features that are hard to learn but also important to generalization.
  3. Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. Liang et al. arXiv, 2022
  4. Multimodal Learning with Transformers: A Survey. Xu et al. arXiv, 2022
NER related
  1. Show and Write: Entity-aware Article Generation with Image Information. ICLR under review, 2023
    produce an article conditioned on article metadata, and on image info like extracted named entities providing important context
  2. An End-to-End Progressive Multi-Task Learning Framework for Medical Named Entity Recognition and Normalization. Zhou et al. ACL, 2021
  3. Linguistically-Enriched and Context-Aware Zero-shot Slot Filling. Siddique et al. TheWebConf, 2021
    NER cues for zero-shot slot filling in task-oriented dialog systems
  4. A Unified MRC Framework for Named Entity Recognition. Li et al. ACL, 2020
    NER as machine reading comprehension (MRC)
Multi-Modal Sarcasm, Fake News, Hate Speech
  1. Multimodal Hate Speech Detection via Cross-Domain Knowledge Transfer. Yang et al. ACM MM, 2022
  2. Cross-modal Ambiguity Learning for Multimodal Fake News Detection. Chen et al. TheWebConf, 2022
    Track: Web Mining and Content Analysis
  3. Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network. Liang et al. ACL, 2022
    cross-modal graph; cross-modal graph convolutional network
  4. Multi-Modal Sarcasm Detection with Interactive In-Modal and Cross-Modal Graphs . Liang et al. ACM MM, 2021
Tutorial
  1. Multi-modal Network Representation Learning. Zhang et al. SIGKDD Tutorial, 2020

Please do not hesitate to inform us of any missing works.

[BackToHome]