Name	Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows	.github/workflows
apps	apps
chunkformer	chunkformer
docs	docs
examples	examples
samples	samples
tests	tests
tools	tools
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
DEVELOPMENT.md	DEVELOPMENT.md
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
pyproject.toml	pyproject.toml
setup.py	setup.py

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription". The paper has been fully accepted by the reviewers with the highest scores: 4/4/4.

demo.mp4

Introduction
Key Features
Installation
- Install from PyPI (Recommended)
- Install from source
- Pretrained Models
Usage
- Feature Extraction
- Python API Transcription
- Command Line Transcription
Training the Model
Citation
Acknowledgments

Introduction

ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a chunk-wise processing mechanism with relative right context and employs the Masked Batch technique to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.

Key Features

Transcribing Extremely Long Audio: ChunkFormer can transcribe audio recordings up to 16 hours in length with results comparable to existing models. It is currently the first model capable of handling this duration.
Efficient Decoding on Low-Memory GPUs: Chunkformer can handle long-form transcription on GPUs with limited memory without losing context or mismatching the training phase.
Masked Batching Technique: ChunkFormer efficiently removes the need for padding in batches with highly variable lengths. For instance, decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.

GPU Memory	Total Batch Duration (minutes)
80GB	980
24GB	240

Installation

Option 1: Install from PyPI (Recommended)

pip install chunkformer

Option 2: Install from source

# Clone the repository git clone https://github.com/your-username/chunkformer.git cd chunkformer # Install in development mode pip install -e .

Pretrained Models

Language	Model
Vietnamese
Vietnamese
English

Usage

Feature Extraction

from chunkformer import ChunkFormerModel import torch device = "cuda:0" # Load a pre-trained model from Hugging Face or local directory model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device) x, x_len = model._load_audio_and_extract_features("path/to/audio") # x: (T, F), x_len: int x = x.unsqueeze(0).to(device) x_len = torch.tensor([x_len], device=device) # Extract feature feature, feature_len = model.encode( xs=x, xs_lens=x_len, chunk_size=64, left_context_size=128, right_context_size=128, ) print("feature: ", feature.shape) print("feature_len: ", feature_len)

Python API

Classification

ChunkFormer also supports speech classification tasks (e.g., gender, dialect, emotion, age recognition).

from chunkformer import ChunkFormerModel # Load a pre-trained classification model from Hugging Face or local directory model = ChunkFormerModel.from_pretrained("path/to/classification/model") # Single audio classification result = model.classify_audio( audio_path="path/to/audio.wav", chunk_size=-1, # -1 for full attention left_context_size=-1, right_context_size=-1, ) print(result)

Transcription

from chunkformer import ChunkFormerModel # Load a pre-trained encoder from Hugging Face or local directory model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie") # For single long-form audio transcription transcription = model.endless_decode( audio_path="path/to/long_audio.wav", chunk_size=64, left_context_size=128, right_context_size=128, total_batch_duration=14400, # in seconds return_timestamps=True ) print(transcription) # For batch processing of multiple audio files audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"] transcriptions = model.batch_decode( audio_paths=audio_files, chunk_size=64, left_context_size=128, right_context_size=128, total_batch_duration=1800 # Total batch duration in seconds ) for i, transcription in enumerate(transcriptions): print(f"Audio {i+1}: {transcription}")

Command Line

Long-Form Audio Transcription

To test the model with a single long-form audio file. Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:

chunkformer-decode \ --model_checkpoint path/to/hf/checkpoint/repo \ --audio_file path/to/audio.wav \ --total_batch_duration 14400 \ --chunk_size 64 \ --left_context_size 128 \ --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example [00:00:02.500] - [00:00:03.700]: testing the long-form audio

Batch Audio Transcription

The data.tsv file must have at least one column named wav. Optionally, a column named txt can be included to compute the Word Error Rate (WER). Output will be saved to the same file.

chunkformer-decode \ --model_checkpoint path/to/hf/checkpoint/repo \ --audio_list path/to/data.tsv \ --total_batch_duration 14400 \ --chunk_size 64 \ --left_context_size 128 \ --right_context_size 128

Example Output:

WER: 0.1234

Classification

To classify a single audio file:

chunkformer-decode \ --model_checkpoint path/to/classification/model \ --audio_file path/to/audio.wav

Training

See Training Guide for complete documentation.

Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640, author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh}, booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, year={2025}, volume={}, number={}, pages={1-5}, keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription}, doi={10.1109/ICASSP49660.2025.10888640}}

Acknowledgments

This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.

License

khanld/chunkformer

Folders and files

Latest commit

History

Repository files navigation

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Table of Contents

Introduction

Key Features

Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install from source

Pretrained Models

Usage

Feature Extraction

Python API

Classification

Transcription

Command Line

Long-Form Audio Transcription

Batch Audio Transcription

Classification

Training

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages