perf: batch duplicate marking in batch deduplication#14458

Open

valentijnscholten wants to merge 3 commits intoDefectDojo:devfrom

valentijnscholten:batch-duplicate-marking

Open

perf: batch duplicate marking in batch deduplication#14458
valentijnscholten wants to merge 3 commits intoDefectDojo:devfrom
valentijnscholten:batch-duplicate-marking

Conversation

Copy link

Member

valentijnscholten commented Mar 6, 2026 *

edited

Loading

Summary

Deduplication matching happens in batches. But the "marking" of findings as duplicates still happened one by one. For duplicate heavy instances this is a major slow down and resource hog. This PR batches the "marking" of duplicates to reduce the complexity from O(2N) to O(1).

set_duplicate now accepts a save=False keyword argument; when false, field changes are applied to the model instance in memory without hitting the database
Added _flush_duplicate_changes helper that bulk-updates all modified duplicate findings in a single Finding.objects.bulk_update call -- one round-trip regardless of batch size, no saves on original findings
All four _dedupe_batch_* algorithm functions (hash_code, unique_id, uid_or_hash, legacy) now collect duplicate pairs and flush via _flush_duplicate_changes instead of saving per-pair
Transitive duplicates (findings that previously pointed to new_finding as their original, now re-pointed to the true original) are captured through set_duplicate's return value and included in the bulk update
Added Finding.DEDUPLICATION_FIELDS constant (allowlist) used with .only() in get_finding_models_for_deduplication to skip loading unused columns for the findings being processed
Added Finding.DEDUPLICATION_DEFERRED_FIELDS constant (denylist of large text columns) used with .defer() in build_candidate_scope_queryset to skip loading unused columns for the candidate pool
Performance test query count updated: second import with duplicates drops from 236 - 183 queries (-53)


 perf: batch duplicate marking in batch deduplication

ca2c062

Instead of saving each duplicate finding individually, collect all modified findings during a batch deduplication run and flush them in a single bulk_update call. Original (existing) findings are still saved individually to preserve auto_now timestamp updates and post_save signal behavior, but are deduplicated by id so each is saved at most once per batch. Reduces DB writes from O(2N) individual saves to 1 bulk_update + O(unique originals) saves for a batch of N duplicates. Performance test shows -23 queries on a second import with duplicates.

github-actions bot added the unittests label Mar 6, 2026

valentijnscholten added this to the 2.57.0 milestone Mar 6, 2026

valentijnscholten added 2 commits March 6, 2026 19:20


 perf: restrict SELECT columns for batch deduplication via only()

bf6c28e

Add Finding.DEDUPLICATION_FIELDS -- the union of all Finding fields needed across every deduplication algorithm -- and apply it as an only() clause in get_finding_models_for_deduplication. This avoids loading large text columns (description, mitigation, impact, references, steps_to_reproduce, severity_justification, etc.) when loading findings for the batch deduplication task, reducing data transferred from the database without affecting query count. build_candidate_scope_queryset is intentionally excluded: it is also used for reimport matching (which accesses severity, numerical_severity and other fields outside this set) and applying only() there would cause deferred-field extra queries.


 perf(dedup): defer large text fields on candidate queryset

f87f9b7

- Add Finding.DEDUPLICATION_DEFERRED_FIELDS constant listing large text columns (description, mitigation, impact, references, etc.) that are never read during deduplication or candidate matching. - Apply .defer(*Finding.DEDUPLICATION_DEFERRED_FIELDS) in build_candidate_scope_queryset to avoid loading those columns for the potentially large candidate pool fetched per dedup batch. Reduces deduplication second-import query count from 213 to 183 (-30).

valentijnscholten marked this pull request as ready for review March 8, 2026 17:22

valentijnscholten requested review from Maffooch and mtesauro as code owners

March 8, 2026 17:22

Labels

unittests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch duplicate marking in batch deduplication#14458

perf: batch duplicate marking in batch deduplication#14458
valentijnscholten wants to merge 3 commits intoDefectDojo:devfrom
valentijnscholten:batch-duplicate-marking

Conversation

valentijnscholten commented Mar 6, 2026 *

edited

Loading

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valentijnscholten commented Mar 6, 2026 * edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

valentijnscholten commented Mar 6, 2026 *

edited

Loading