Skip to main content

SoccerChat

A Multimodal Video-Text Dataset for Natural Language Soccer Game Understanding

PublicationGitHub repoLast updated Jun 2026

SoccerChat is a multimodal dataset for video–language understanding in the context of soccer match analysis.
It enables training and evaluation of large vision–language models (VLMs) for event detection, temporal reasoning, and natural language generation over real-world broadcast video clips.

Introduced in the paper:
📄 "SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding"
📰 arXiv preprint, May 2025
Paper (arXiv:2505.16630)
GitHub Project Page
Trained Model (Qwen2-VL-7B)
Web Demo (Colab)


⚽ Dataset Summary

Component Details
Total Examples 89,000 (train: 85,220 / validation: 4,625)
Modality Video + Text
Tasks Event Detection • Video Question Answering • Text Generation
Languages English
Video Format Short broadcast snippets (~5–15 seconds)
Total Size ~48 GB (videos)
Annotation Fields Video clip, natural language query, model response, and event tags
License Research use (CC BY-NC 4.0)

Each example includes:

  • 🎞️ video — soccer match video snippet
  • 💬 query — natural language question or prompt
  • 🧠 response — generated or annotated answer
  • events — list of SoccerNet event tags (e.g., Goal, Card, Foul)
  • 📂 path — relative file path within /videos/

📁 Dataset Structure

Split Examples Size
train 85,220 36.7 MB (metadata only)
validation 4,625 1.47 MB (metadata only)

Videos must be downloaded separately (see below).


💾 Download Instructions

Clone from Hugging Face using Git LFS:

git lfs install
git clone https://huggingface.co/datasets/SimulaMet/SoccerChat

📦 Videos are stored under SoccerChat/videos/ (~48 GB total)


🧮 Data Fields

Field Type Description
video Video Video snippet of soccer event
query string Natural language question
response string Natural language answer
events list[string] Associated SoccerNet event types
path string Relative path to video file

🔄 Convert to JSONL (for MS-Swift or other VLMs)

import os, json
from datasets import load_dataset
import pandas as pd

base = "/content/SoccerChat/videos"
ds = load_dataset("SimulaMet/SoccerChat")

for split, out_file in [("train", "SoccerChat_train.jsonl"), ("validation", "SoccerChat_valid.jsonl")]:
    df = ds[split].to_pandas()
    df["query"] = "<video>" + df["query"]
    df["videos"] = df["path"].apply(lambda p: [os.path.join(base, os.path.basename(p))])
    df[["query", "response", "videos"]].to_json(out_file, orient="records", lines=True)

🧠 Training & Evaluation (Example with MS-Swift)

🏋️ Training Example (Qwen2-VL-7B)

NFRAMES=24 MAX_PIXELS=100352 NPROC_PER_NODE=4 swift sft   --model_type qwen2-vl-7b-instruct   --model_id_or_path qwen/Qwen2-VL-7B-Instruct   --sft_type lora   --dataset SoccerChat_train.jsonl   --num_train_epochs 5   --batch_size 14   --deepspeed default-zero2   --eval_steps 100   --dataset_test_ratio 0.05

📊 Evaluation

NFRAMES=24 MAX_PIXELS=100352 swift infer   --ckpt_dir checkpoint-dir   --load_dataset_config true   --merge_lora true   --val_dataset SoccerChat_valid.jsonl

📜 Terms of Use

Released under CC BY-NC 4.0 — for non-commercial research and educational purposes only.


🧾 Citation

If you use this dataset, please cite:

@article{Gautam2025May,
  author = {Gautam, Sushant and Midoglu, Cise and Thambawita, Vajira and Riegler, Michael A. and Halvorsen, P{a}l and Shah, Mubarak},
  title = {{SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding}},
  journal = {arXiv},
  year = {2025},
  month = may,
  eprint = {2505.16630},
  doi = {10.48550/arXiv.2505.16630}
}

📬 Contact

For any queries or collaborations, please contact:
📧 sushant@simula.no
🌐 sushant.info.np