PAPERADVANCED

Multimodal disaster-tweet classification: what I found

April 2, 2026 · nlp · llm · multimodal · research

Summary

Fine-tuned open-source multimodal LLMs can beat both the strongest proprietary zero-shot models (GPT-4o) and the prior best fine-tuned CLIP baselines on disaster-related tweet classification. On CrisisMMD, LLaMA 3.2 11B with LoRA reaches state-of-the-art F1 by training less than 1% of the model's ~11B parameters. A fine-tuned 3B text-only model also outperforms the previous best on both tasks, which suggests a genuinely deployable option for edge devices.

What we did

We evaluated three multimodal LLMs (GPT-4o, GPT-4o mini, LLaMA 3.2 11B) on the CrisisMMD dataset under zero-shot, one-shot, and (for LLaMA) LoRA fine-tuned settings. The two tasks are:

  • Informativeness: is this tweet usefully crisis-related, or not?
  • Humanitarian category: what kind of information is in it (affected individuals, infrastructure damage, rescue efforts, etc.)?

CrisisMMD spans seven real 2017 events including California Wildfires, Hurricanes Harvey/Irma/Maria, and the Mexico earthquake.

Main findings

  1. Zero-shot multimodal LLMs are already solid, and GPT-4o mini wins the cost/quality tradeoff. Across most zero-shot settings, GPT-4o mini beat GPT-4o while being significantly cheaper. With prompt engineering, GPT-4o's informativeness F1 on text+image rose from 76.90 (a prior report) to 87.71.
  2. One-shot and five-shot prompting did not consistently help. For LLaMA 3.2 11B specifically, one-shot with multiple images actually hurt performance. That aligns with a known limitation the Meta LLaMA team has flagged: the 11B vision model is not reliable with multiple images at inference time.
  3. LoRA fine-tuning is the hero result. LLaMA 3.2 11B + LoRA reached F1 94.77 on informativeness (text+image) and 91.62 on humanitarian (text+image), surpassing both the CLIP baseline (93.13 and 90.04) and every zero-shot model tested. LoRA touched less than 1% of the ~11B parameters.
  4. Small text-only models can punch above their weight. LLaMA 3.2 3B (text-only) fine-tuned with LoRA beat the CLIP baseline on both tasks (91.73 vs 85.99 on informativeness, 83.66 vs 80.70 on humanitarian). Notably, full fine-tuning of the 1B model gave worse results than its LoRA-tuned counterpart.

Why it matters

Disaster-response teams need to triage social-media posts fast and cheaply during an event. The combination of (a) open-source models, (b) parameter-efficient fine-tuning that runs on modest hardware, and (c) accuracy that matches or exceeds the previous best is a deployable package, not just a benchmark delta. For agencies that can't route every tweet through a third-party API, a LoRA-tuned LLaMA running on their own infrastructure is a practical path.

My take

The result that stuck with me isn't the 11B headline number. It's that a 3B text-only model, fine-tuned with LoRA, beats the fine-tuned CLIP baseline. Edge-device inference for crisis triage stops being aspirational when the model weights fit in a few GB and the accuracy holds. The obvious next question is how robust these fine-tuned models are to disaster types not in CrisisMMD's seven events, because the real test of a deployable system is the one it hasn't seen before.

Read the paper

Multimodal Disaster-Related Tweet Classification with Parameter-Efficient Fine-Tuning of Large Language Models. Guo, Tran, Xiao, Li, Caragea. ASONAM 2025 Proceedings, Springer, pp. 413-428.

pdf · code

← all writing