Benchmarks
2024 MacBook Pro, 48GB Ram, M4 Pro, Tahoe 26.0
Transcription
https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml
Language | WER% | CER% | RTFx | Duration | Processed | Skipped
-----------------------------------------------------------------------------------------
Bulgarian (Bulgaria) | 12.8 | 4.1 | 195.2 | 3468.0s | 350 | -
Croatian (Croatia) | 14.0 | 4.3 | 204.9 | 3647.0s | 350 | -
Czech (Czechia) | 12.0 | 3.8 | 214.2 | 4247.4s | 350 | -
Danish (Denmark) | 20.2 | 7.4 | 214.4 | 10579.1s | 930 | -
Dutch (Netherlands) | 7.8 | 2.6 | 191.7 | 3337.7s | 350 | -
English (US) | 5.4 | 2.5 | 207.4 | 3442.9s | 350 | -
Estonian (Estonia) | 20.1 | 4.2 | 225.3 | 10825.4s | 893 | -
Finnish (Finland) | 14.8 | 3.1 | 222.0 | 11894.4s | 918 | -
French (France) | 5.9 | 2.2 | 199.9 | 3667.3s | 350 | -
German (Germany) | 5.9 | 1.9 | 220.9 | 4684.6s | 350 | -
Greek (Greece) | 36.9 | 13.7 | 183.0 | 6862.0s | 650 | -
Hungarian (Hungary) | 17.6 | 5.2 | 213.6 | 11050.9s | 905 | -
Italian (Italy) | 4.0 | 1.3 | 236.7 | 5098.7s | 350 | -
Latvian (Latvia) | 27.1 | 7.5 | 217.8 | 10218.6s | 851 | -
Lithuanian (Lithuania) | 25.0 | 6.8 | 202.8 | 10686.5s | 986 | -
Maltese (Malta) | 25.2 | 9.3 | 217.4 | 12770.6s | 926 | -
Polish (Poland) | 8.6 | 2.8 | 190.2 | 3409.6s | 350 | -
Romanian (Romania) | 14.4 | 4.7 | 200.4 | 9099.4s | 883 | -
Russian (Russia) | 7.2 | 2.2 | 209.7 | 3974.6s | 350 | -
Slovak (Slovakia) | 12.6 | 4.4 | 227.6 | 4169.6s | 350 | -
Slovenian (Slovenia) | 27.4 | 9.2 | 197.1 | 8173.1s | 834 | -
Spanish (Spain) | 4.5 | 2.2 | 221.7 | 4258.9s | 350 | -
Swedish (Sweden) | 16.8 | 5.0 | 219.5 | 8399.2s | 759 | -
Ukrainian (Ukraine) | 7.2 | 2.5 | 201.9 | 3853.7s | 350 | -
-----------------------------------------------------------------------------------------
AVERAGE | 14.7 | 4.7 | 209.8 | 161819.2 | 14085 | -
Dataset: librispeech test-clean
Files processed: 2620
Average WER: 2.5%
Median WER: 0.0%
Average CER: 1.0%
Median RTFx: 139.6x
Overall RTFx: 155.6x (19452.5s / 125.0s)
swift run fluidaudiocli asr-benchmark --max-files all --model-version v2
Use v2 if you only need English, it is a bit more accurate
--- Benchmark Results ---
Dataset: librispeech test-clean
Files processed: 2620
Average WER: 2.1%
Median WER: 0.0%
Average CER: 0.7%
Median RTFx: 128.6x
Overall RTFx: 145.8x (19452.5s / 133.4s)
ASR Model Compilation
Core ML first-load compile times captured on iPhone 16 Pro Max and iPhone 13 running the parakeet-tdt-0.6b-v3-coreml bundle. Cold-start compilation happens the first time each Core ML model is loaded; subsequent loads hit the cached binaries. Warm compile metrics were collected only on the iPhone 16 Pro Max run, and only for models that were reloaded during the session.
| Model | iPhone 16 Pro Max cold (ms) | iPhone 16 Pro Max warm (ms) | iPhone 13 cold (ms) | Compute units |
|---|---|---|---|---|
| Preprocessor | 9.15 | - | 632.63 | MLComputeUnits(rawValue: 2) |
| Encoder | 3361.23 | 162.05 | 4396.00 | MLComputeUnits(rawValue: 1) |
| Decoder | 88.49 | 8.11 | 146.01 | MLComputeUnits(rawValue: 1) |
| JointDecision | 48.46 | 7.97 | 71.85 | MLComputeUnits(rawValue: 1) |
Transcription with Keyword Boosting
CTC-based custom vocabulary boosting system, which enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model.
swift run fluidaudiocli ctc-earnings-benchmark --auto-download
# Run the benchmark
swift run fluidaudiocli ctc-earnings-benchmark
Earnings Benchmark (TDT transcription + CTC keyword spotting)
Data directory: /Users/<user>/Library/Application Support/FluidAudio/earnings22-kws/test-dataset
Output file: ctc_earnings_benchmark.json
TDT version: v2
CTC model: /Users/<user>/Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml
Loading TDT models (v2) for transcription...
TDT models loaded successfully
Loading CTC models from: /Users/<user>/Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml
Loaded CTC vocabulary with 1024 tokens, variant: Parakeet CTC 110M (hybrid)
Created CTC spotter with blankId=1024
Processing 773 test files...
[ 1/772] 4329526_chunk0 WER: 10.3% Dict: 1/1
[ 2/772] 4329526_chunk109 WER: 12.5% Dict: 2/2
[ 3/772] 4329526_chunk118 WER: 3.1% Dict: 3/3
[ 4/772] 4329526_chunk132 WER: 8.1% Dict: 1/1
[ 5/772] 4329526_chunk135 WER: 25.7% Dict: 1/1
[ 6/772] 4329526_chunk16 WER: 8.6% Dict: 1/1
...
[767/772] 4485206_chunk_86 WER: 5.0% Dict: 2/2
[768/772] 4485206_chunk_88 WER: 8.3% Dict: 2/2
[769/772] 4485206_chunk_92 WER: 14.7% Dict: 4/4
[770/772] 4485206_chunk_97 WER: 30.5% Dict: 1/1
[771/772] 4485206_chunk_98 WER: 18.6% Dict: 4/4
[772/772] 4485206_chunk_99 WER: 22.0% Dict: 1/1
============================================================
EARNINGS22 BENCHMARK (TDT + CTC)
============================================================
Model: /Users/<user>/Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml
Total tests: 771
Average WER: 15.00%
Dict Pass (Recall): 1299/1308 (99.3%)
Vocab Precision: 99.3% (TP=1068, FP=8)
Vocab Recall: 85.2% (TP=1068, FN=185)
Vocab F-score: 91.7%
Total audio: 11564.5s
Total processing: 182.5s
RTFx: 63.36x
============================================================
Results written to: ctc_earnings_benchmark.json
In context of vocabulary/keyword detection:
| Metric | Definition |
|---|---|
| TP (True Positive) | Word is in reference AND in hypothesis (correctly detected) |
| FP (False Positive) | Word is in hypothesis but NOT in reference (hallucinated/wrong) |
| FN (False Negative) | Word is in reference but NOT in hypothesis (missed) |
Derived metrics:
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | "Of words we output, how many were correct?" |
| Recall | TP / (TP + FN) | "Of words that should appear, how many did we find?" |
| F-Score | 2 x P x R / (P + R) | Harmonic mean of precision and recall |
Text-to-Speech
We generated the same strings with to generate audio between 1s to ~300s in order to test the speed across a range of varying inputs on Pytorch CPU, MPS, and MLX pipeline, and compared it against the native Swift version with Core ML models.
Each pipeline warmed up the models by running through it once with pesudo inputs, and then comparing the raw inference time with the model already loaded. You can see that for the Core ML model, we traded lower memory and very slightly faster inference for longer initial warm-up.
Note that the Pytorch kokoro model in Pytorch has a memory leak issue: hexgrad/kokoro#152
The following tests were ran on M4 Pro, 48GB RAM, Macbook Pro. If you have another device, please do try replicating it as well!
Kokoro-82M PyTorch (CPU)
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.187 14.737x 1.44
2 129 8.625 0.530 16.264x 1.85
3 254 15.525 0.923 16.814x 2.65
4 93 6.125 0.349 17.566x 2.66
5 104 7.200 0.410 17.567x 2.70
6 130 9.300 0.504 18.443x 2.72
7 197 12.850 0.726 17.711x 2.83
8 6 1.350 0.098 13.823x 2.83
9 1228 76.200 4.342 17.551x 3.19
10 567 35.200 2.069 17.014x 4.85
11 4615 286.525 17.041 16.814x 4.78
Total - 461.650 27.177 16.987x 4.85
Kokoro-82M PyTorch (MPS)
I wasn't able to run the MPS model for longer durations, even with PYTORCH_ENABLE_MPS_FALLBACK=1 enabled, it kept crashing for the longer strings.
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.414 6.649x 1.41
2 129 8.625 0.729 11.839x 1.54
Total - 11.375 1.142 9.960x 1.54
Kokoro-82M MLX Pipeline
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.347 7.932x 1.12
2 129 8.650 0.597 14.497x 2.47
3 254 15.525 0.825 18.829x 2.65
4 93 6.125 0.306 20.039x 2.65
5 104 7.200 0.343 21.001x 2.65
6 130 9.300 0.560 16.611x 2.65
7 197 12.850 0.596 21.573x 2.65
8 6 1.350 0.364 3.706x 2.65
9 1228 76.200 2.979 25.583x 3.29
10 567 35.200 1.374 25.615x 3.37
11 4615 286.500 11.112 25.783x 3.37
Total - 461.650 19.401 23.796x 3.37
Swift + Fluid Audio Core ML models
Note that it does take ~15s to compile the model on the first run, subsequent runs are shorter, we expect ~2s to load.
...
FluidAudio TTS benchmark for voice af_heart (warm-up took an extra 2.348s)
Test Chars Ouput (s) Inf(s) RTFx
1 42 2.825 0.440 6.424x
2 129 7.725 0.594 13.014x
3 254 13.400 0.776 17.278x
4 93 5.875 0.587 10.005x
5 104 6.675 0.613 10.889x
6 130 8.075 0.621 13.008x
7 197 10.650 0.627 16.983x
8 6 0.825 0.360 2.290x
9 1228 67.625 2.362 28.625x
10 567 33.025 1.341 24.619x
11 4269 247.600 9.087 27.248x
Total - 404.300 17.408 23.225
Peak memory usage (process-wide): 1.503 GB
Voice Activity Detection
Model is nearly identical to the base model in terms of quality, performance wise we see an up to ~3.5x improvement compared to the silero Pytorch VAD model with the 256ms batch model (8 chunks of 32ms)
Dataset: https://github.com/Lab41/VOiCES-subset
swift run fluidaudiocli vad-benchmark --dataset voices-subset --all-files --threshold 0.85
...
Timing Statistics:
[18:56:31.208] [INFO] [VAD] Total processing time: 0.29s
[18:56:31.208] [INFO] [VAD] Total audio duration: 351.05s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Audio loading time: 0.00s (0.6%)
[18:56:31.208] [INFO] [VAD] VAD inference time: 0.28s (98.7%)
[18:56:31.208] [INFO] [VAD] Average per file: 0.011s
[18:56:31.208] [INFO] [VAD] Min per file: 0.001s
[18:56:31.208] [INFO] [VAD] Max per file: 0.020s
[18:56:31.208] [INFO] [VAD]
VAD Benchmark Results:
[18:56:31.208] [INFO] [VAD] Accuracy: 96.0%
[18:56:31.208] [INFO] [VAD] Precision: 100.0%
[18:56:31.208] [INFO] [VAD] Recall: 95.8%
[18:56:31.208] [INFO] [VAD] F1-Score: 97.9%
[18:56:31.208] [INFO] [VAD] Total Time: 0.29s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Files Processed: 25
[18:56:31.208] [INFO] [VAD] Avg Time per File: 0.011s
swift run fluidaudiocli vad-benchmark --dataset musan-full --num-files all --threshold 0.8
...
[23:02:35.539] [INFO] [VAD] Total processing time: 322.31s
[23:02:35.539] [INFO] [VAD] Timing Statistics:
[23:02:35.539] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.539] [INFO] [VAD] Audio loading time: 1.20s (0.4%)
[23:02:35.539] [INFO] [VAD] VAD inference time: 319.57s (99.1%)
[23:02:35.539] [INFO] [VAD] Average per file: 0.160s
[23:02:35.539] [INFO] [VAD] Total audio duration: 393442.58s
[23:02:35.539] [INFO] [VAD] Min per file: 0.000s
[23:02:35.539] [INFO] [VAD] Max per file: 0.873s
[23:02:35.711] [INFO] [VAD] VAD Benchmark Results:
[23:02:35.711] [INFO] [VAD] Accuracy: 94.2%
[23:02:35.711] [INFO] [VAD] Precision: 92.6%
[23:02:35.711] [INFO] [VAD] Recall: 78.9%
[23:02:35.711] [INFO] [VAD] F1-Score: 85.2%
[23:02:35.711] [INFO] [VAD] Total Time: 322.31s
[23:02:35.711] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.711] [INFO] [VAD] Files Processed: 2016
[23:02:35.711] [INFO] [VAD] Avg Time per File: 0.160s
[23:02:35.744] [INFO] [VAD] Results saved to: vad_benchmark_results.json
Qwen3-ASR (Beta / In Progress)
Encoder-decoder ASR using Qwen3-ASR-0.6B converted to CoreML. Autoregressive generation with KV-cache.
Note: WER/CER may be higher than the original PyTorch model due to CoreML conversion limitations. See FLEURS results below for full multilingual benchmarks.
Model: FluidInference/qwen3-asr-0.6b-coreml (f32 variant)
Hardware: Apple M2, 2022, macOS 26
LibriSpeech test-clean (2620 files)
| Metric | Value |
|---|---|
| WER (Avg) | 4.4% |
| WER (Median) | 0.0% |
| RTFx | 2.8x |
| Per-token | ~75ms |
AISHELL-1 Chinese (6920 files, 9.7h audio)
| Metric | Value |
|---|---|
| CER (Avg) | 6.6% |
| WER (Avg) | 10.3% |
| Median RTFx | 4.6x |
| Overall RTFx | 3.8x |
| Processing Time | 2.6h |
Methodology notes:
- CER (Character Error Rate) is the primary metric for Chinese ASR, as per the Qwen3-ASR Technical Report: "We use CER for character-based languages (e.g., Mandarin Chinese, Cantonese, and Korean) and WER for word-delimited languages"
- WER calculation uses Apple's
NLTokenizerfor Chinese word segmentation; we were unable to verify how official Qwen3-ASR evaluation performs tokenization - Official Qwen3-ASR reports 3.15% on AISHELL-2 (different dataset) per HuggingFace model card; our 6.6% CER on AISHELL-1 suggests some accuracy loss in CoreML conversion
- Why AISHELL-1? AISHELL-2 (1000h) requires an application with institutional affiliation and is restricted to non-commercial use. AISHELL-1 (178h) is openly available under Apache 2.0.
- Dataset: AudioLLMs/aishell_1_zh_test
swift run -c release fluidaudiocli qwen3-benchmark --dataset aishell
FLEURS Multilingual (30 languages, ~70h audio)
Full benchmark across all 30 languages supported by Qwen3-ASR, matching the official FLEURS tiers.
Which metric to use:
- CER for character-based languages (Chinese, Japanese, Korean, Thai, Vietnamese, Cantonese) - WER is meaningless due to word segmentation differences
- WER for word-delimited languages (European, Arabic, etc.)
Results by FLEURS Tier
| Tier | Languages | Our CER | Official 0.6B WER |
|---|---|---|---|
| FLEURS (12 core) | en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru | 10.3% | 10.0% |
| FLEURS+ (8 add) | hi, id, ms, nl, pl, th, tr, vi | 20.9% | 31.9% |
| FLEURS++ (10 hardest) | cs, da, el, fa, fi, fil, hu, mk, ro, sv | 41.0% | 47.8% |
Note: Official Qwen3-ASR reports WER, but for CJK languages this includes word segmentation artifacts. Our CER comparison shows CoreML conversion has minimal accuracy loss on core languages.
Full Results (sorted by CER)
| Language | RTFx | Avg CER | Med CER | Avg WER | Med WER | Use |
|---|---|---|---|---|---|---|
| en_us | 1.16x | 4.0% | 2.3% | 7.3% | 5.3% | WER |
| es_419 | 2.04x | 4.9% | 3.0% | 10.5% | 8.1% | WER |
| it_it | 3.46x | 5.1% | 2.8% | 12.4% | 10.0% | WER |
| ru_ru | 1.84x | 6.9% | 4.6% | 18.0% | 15.6% | WER |
| de_de | 1.22x | 8.1% | 5.1% | 16.6% | 13.3% | WER |
| pt_br | 3.27x | 8.6% | 5.4% | 17.5% | 13.0% | WER |
| fr_fr | 1.72x | 8.9% | 6.2% | 17.3% | 13.3% | WER |
| cmn_hans_cn | 1.74x | 9.4% | 5.1% | 99.7%* | 100%* | CER |
| ko_kr | 1.10x | 10.6% | 7.9% | 23.5% | 21.7% | CER |
| tr_tr | 2.84x | 11.6% | 9.6% | 33.0% | 31.2% | WER |
| id_id | 2.86x | 16.0% | 9.1% | 30.9% | 22.2% | WER |
| nl_nl | 2.29x | 17.2% | 13.6% | 36.5% | 30.3% | WER |
| ms_my | 2.24x | 17.4% | 13.2% | 37.6% | 33.3% | WER |
| th_th | 1.42x | 18.3% | 15.4% | 96.8%* | 100%* | CER |
| ar_eg | 1.53x | 18.5% | 13.8% | 40.3% | 36.4% | WER |
| ja_jp | 0.83x | 19.3% | 17.1% | 94.4%* | 100%* | CER |
| yue_hant_hk | 0.87x | 19.5% | 13.8% | 99.8%* | 100%* | CER |
| vi_vn | 2.69x | 25.4% | 21.0% | 35.9% | 31.0% | CER |
| fi_fi | 1.56x | 25.9% | 22.7% | 70.3% | 70.0% | WER |
| hi_in | 0.74x | 30.8% | 21.4% | 36.0% | 30.6% | WER |
| pl_pl | 1.69x | 30.8% | 27.4% | 61.9% | 60.0% | WER |
| sv_se | 2.38x | 31.3% | 30.1% | 67.8% | 66.7% | WER |
| fil_ph | 1.56x | 32.2% | 22.4% | 64.8% | 61.1% | WER |
| mk_mk | 0.79x | 43.2% | 27.9% | 73.0% | 75.9% | WER |
| da_dk | 2.33x | 45.5% | 46.5% | 81.1% | 84.6% | WER |
| fa_ir | 1.88x | 48.9% | 34.4% | 75.1% | 75.0% | WER |
| el_gr | 0.95x | 51.9% | 39.2% | 78.2% | 76.5% | WER |
| hu_hu | 1.05x | 59.0% | 55.7% | 91.8% | 95.8% | WER |
| ro_ro | 1.03x | 60.9% | 56.2% | 97.2% | 100% | WER |
| cs_cz | 2.26x | 62.2% | 56.5% | 88.2% | 96.2% | WER |
*WER >90% is expected for CJK/Thai due to word segmentation - FLEURS references have artificial character-by-character spacing while our output is natural continuous text. CER shows actual transcription quality.
Averages
| Metric | Average | Median |
|---|---|---|
| CER (all 30) | 25.1% | 19.4% |
| RTFx | 1.78x | 1.72x |
Speed by Language Type
| Type | Avg RTFx | Notes |
|---|---|---|
| Romance (es, it, pt, fr) | 2.6x | Fastest |
| Turkic/Indonesian | 2.5x | Fast |
| Germanic (en, de, nl) | 1.6x | Medium |
| Slavic (ru, pl, cs) | 1.9x | Medium |
| CJK (zh, ja, ko, yue) | 1.1x | Slow - more tokens |
| Indic (hi) | 0.74x | Slowest |
swift run -c release fluidaudiocli qwen3-benchmark --dataset fleurs --languages all
Streaming ASR (Parakeet EOU)
Real-time streaming ASR with End-of-Utterance detection using the Parakeet EOU 120M CoreML model.
Model: FluidInference/parakeet-realtime-eou-120m-coreml
Hardware: Apple M2, 2022, macOS 26
LibriSpeech test-clean (2620 files, 5.40h audio)
| Chunk Size | WER (Avg) | RTFx | Total Time |
|---|---|---|---|
| 320ms | 4.87% | 12.48x | 1558s (26m) |
| 160ms | 8.29% | 4.78x | 4070s (68m) |
swift run -c release fluidaudiocli parakeet-eou --benchmark --chunk-size 320 --use-cache
# Run 160ms benchmark
swift run -c release fluidaudiocli parakeet-eou --benchmark --chunk-size 160 --use-cache
Speaker Diarization
The offline version uses the community-1 model, the online version uses the legacy speaker-diarization-3.1 model.
Offline diarization pipeline
For slightly ~1.2% worse DER we default to a higher step ratio segmentation duration than the baseline community-1 pipeline. This allows us to get nearly ~2x the speed (as expected because we're processing 1/2 of the embeddings). For highly critical use cases, one may should use step ratio = 0.1 and minSegmentDurationSeconds = 0.0
Running on the full voxconverse benchmark:
Average DER: 15.07% | Median DER: 10.70% | Average JER: 39.40% | Median JER: 40.95% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 122.06 (from 232 clips)
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
Step Ratio 2, min duration 1.0
StepRatio = 0.1, minSegmentDurationSeconds= 0
Average DER: 13.89% | Median DER: 10.49% | Average JER: 42.84% | Median JER: 43.30% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 64.75 (from 232 clips)
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
Step Ratio 1, min duration 0 (edited)
Note that the baseline pytorch version is ~11% DER, we lost some precision dropping down to fp16 precision in order to run most of the embedding model on neural engine. But as a result, we significantly out perform the baseline mps backend as well. the pyannote-community-1 on cpu is ~1.5-2 RTFx, on mps, it's ~20-25 RTFx.
Streaming/online Diarization
This is more tricky and honestly a lot more fragile to clustering. Expect +10-15% worse DER for the streaming implementation. Only use this when you critically need realtime streaming speaker diarization. In most cases, offline is more than enough for most applications.
Running a near real-time diarization benchmark for 3s chunks, 1s overlap, and 0.85 clustering threshold:
--dataset ami-sdm \
--threshold 0.85 \
--auto-download \
--chunk-seconds 3.0 \
--overlap-seconds 1.0
...
------------------------------------------------------------------------------------------
Meeting DER % JER % Miss % FA % SE % Speakers RTFx
------------------------------------------------------------------------------------------
ES2004a 31.6 41.6 6.7 2.1 22.7 7/4 49.8
ES2005a 39.7 65.0 6.9 7.3 25.5 5/4 59.1
IS1002b 40.4 51.3 1.1 5.2 34.1 9/4 45.3
ES2002a 41.5 56.0 5.3 10.1 26.1 6/4 48.6
ES2003a 53.1 78.7 5.3 2.3 45.5 5/4 57.1
IS1000a 66.7 74.0 6.1 7.6 53.0 7/4 50.7
IS1001a 75.0 88.6 7.1 4.7 63.2 10/4 48.8
------------------------------------------------------------------------------------------
AVERAGE 49.7 65.0 5.5 5.6 38.6 - 51.4
==========================================================================================
Diarization benchmark with 10s chunks, 0s overlap, and 0.7 clustering threshold:
--dataset ami-sdm
--threshold 0.7
--auto-download
--chunk-seconds 10.0
--overlap-seconds 0.0
...
------------------------------------------------------------------------------------------
Meeting DER % JER % Miss % FA % SE % Speakers RTFx
------------------------------------------------------------------------------------------
ES2003a 12.0 19.5 6.9 1.2 3.9 4/4 477.0
ES2004a 15.1 24.8 9.2 1.2 4.7 4/4 367.4
ES2002a 17.8 26.8 8.6 5.8 3.4 6/4 356.8
IS1002b 38.0 41.8 3.1 3.1 31.8 5/4 361.9
ES2005a 22.5 36.8 7.7 6.8 8.0 4/4 460.8
IS1000a 57.7 80.6 11.9 3.9 41.9 8/4 352.1
IS1001a 70.1 85.4 11.2 2.4 56.5 7/4 370.9
------------------------------------------------------------------------------------------
AVERAGE 33.3 45.1 8.4 3.5 21.5 - 392.4
==========================================================================================
Diarization benchmark with 5s chunks, 0s overlap, and 0.8 clustering threshold (best configuration found):
--dataset ami-sdm
--threshold 0.8
--auto-download
--chunk-seconds 5.0
--overlap-seconds 0.0
...
------------------------------------------------------------------------------------------
Meeting DER % JER % Miss % FA % SE % Speakers RTFx
------------------------------------------------------------------------------------------
IS1002b 9.8 11.7 3.5 3.8 2.6 5/4 205.2
ES2003a 14.4 23.3 7.4 1.6 5.3 4/4 260.9
ES2004a 17.0 26.0 9.0 1.3 6.7 7/4 218.1
ES2005a 18.4 31.0 9.2 5.8 3.4 4/4 259.8
ES2002a 20.8 30.5 9.5 7.4 3.9 5/4 198.0
IS1000a 24.7 35.7 12.1 4.3 8.3 6/4 204.2
IS1001a 78.0 94.5 13.3 3.0 61.6 6/4 215.7
------------------------------------------------------------------------------------------
AVERAGE 26.2 36.1 9.2 3.9 13.1 - 223.1
==========================================================================================
Diarization benchmark with 5s chunks, 2s overlap, and 0.8 clustering threshold:
--dataset ami-sdm
--threshold 0.8
--auto-download
--chunk-seconds 5.0
--overlap-seconds 2.0
...
------------------------------------------------------------------------------------------
Meeting DER % JER % Miss % FA % SE % Speakers RTFx
------------------------------------------------------------------------------------------
ES2003a 24.5 42.1 4.7 1.9 18.0 6/4 81.4
ES2005a 27.5 50.6 5.5 7.6 14.4 5/4 76.8
ES2004a 31.6 54.8 6.4 2.3 23.0 5/4 66.9
IS1002b 39.6 57.0 0.8 5.1 33.7 6/4 63.7
ES2002a 41.1 57.2 4.7 9.8 26.7 5/4 65.5
IS1000a 57.4 54.2 6.1 7.7 43.6 9/4 67.2
IS1001a 79.0 86.8 7.0 5.0 66.9 10/4 64.5
------------------------------------------------------------------------------------------
AVERAGE 43.0 57.5 5.0 5.6 32.3 - 69.4
==========================================================================================
Sortformer Streaming Diarization
NVIDIA's Sortformer model for streaming speaker diarization, converted to CoreML.
Model: FluidInference/diar-streaming-sortformer-coreml (V2 models for macOS 26+ compatibility)
Hardware: Apple M2, 2022, macOS 26.1
AMI SDM Dataset (NVIDIA High-Latency Config - 30.4s chunks)
================================================================================
SORTFORMER BENCHMARK SUMMARY
================================================================================
Results Sorted by DER:
----------------------------------------------------------------------
Meeting DER % Miss % FA % SE % Speakers RTFx
----------------------------------------------------------------------
IS1009b 16.4 10.6 0.6 5.3 4/4 127.0
ES2004c 23.8 17.8 0.3 5.7 4/4 126.5
ES2004b 23.9 18.7 0.2 5.0 4/4 123.9
IS1009a 26.5 16.0 1.4 9.1 4/4 134.4
ES2004d 28.3 19.7 0.3 8.3 4/4 123.5
IS1009d 29.1 16.5 1.0 11.6 4/4 127.9
TS3003b 31.1 27.1 0.6 3.4 4/4 125.5
EN2002c 31.8 20.1 0.2 11.5 4/3 126.0
ES2004a 33.7 24.6 0.1 9.0 4/4 127.2
EN2002b 34.0 20.2 0.6 13.3 4/4 127.7
TS3003c 34.4 31.1 0.3 3.1 4/4 126.6
EN2002a 35.6 20.0 0.4 15.2 4/4 125.4
EN2002d 37.1 20.1 0.5 16.5 4/4 125.5
IS1009c 38.1 12.8 0.9 24.4 4/4 129.2
TS3003d 41.0 32.0 0.1 8.8 4/4 125.6
TS3003a 41.8 36.8 0.7 4.3 4/4 125.7
----------------------------------------------------------------------
AVERAGE 31.7 21.5 0.5 9.7 - 126.7
======================================================================