Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

perf: batch duplicate marking in batch deduplication#14458

Open
valentijnscholten wants to merge 3 commits intoDefectDojo:devfrom
valentijnscholten:batch-duplicate-marking
Open

perf: batch duplicate marking in batch deduplication#14458
valentijnscholten wants to merge 3 commits intoDefectDojo:devfrom
valentijnscholten:batch-duplicate-marking

Conversation

Copy link
Member

valentijnscholten commented Mar 6, 2026 *
edited
Loading

Summary

Deduplication matching happens in batches. But the "marking" of findings as duplicates still happened one by one. For duplicate heavy instances this is a major slow down and resource hog. This PR batches the "marking" of duplicates to reduce the complexity from O(2N) to O(1).

  • set_duplicate now accepts a save=False keyword argument; when false, field changes are applied to the model instance in memory without hitting the database
  • Added _flush_duplicate_changes helper that bulk-updates all modified duplicate findings in a single Finding.objects.bulk_update call -- one round-trip regardless of batch size, no saves on original findings
  • All four _dedupe_batch_* algorithm functions (hash_code, unique_id, uid_or_hash, legacy) now collect duplicate pairs and flush via _flush_duplicate_changes instead of saving per-pair
  • Transitive duplicates (findings that previously pointed to new_finding as their original, now re-pointed to the true original) are captured through set_duplicate's return value and included in the bulk update
  • Added Finding.DEDUPLICATION_FIELDS constant (allowlist) used with .only() in get_finding_models_for_deduplication to skip loading unused columns for the findings being processed
  • Added Finding.DEDUPLICATION_DEFERRED_FIELDS constant (denylist of large text columns) used with .defer() in build_candidate_scope_queryset to skip loading unused columns for the candidate pool
  • Performance test query count updated: second import with duplicates drops from 236 - 183 queries (-53)

Instead of saving each duplicate finding individually, collect all
modified findings during a batch deduplication run and flush them in
a single bulk_update call. Original (existing) findings are still
saved individually to preserve auto_now timestamp updates and
post_save signal behavior, but are deduplicated by id so each is
saved at most once per batch.

Reduces DB writes from O(2N) individual saves to 1 bulk_update +
O(unique originals) saves for a batch of N duplicates.

Performance test shows -23 queries on a second import with duplicates.
github-actions bot added the unittests label Mar 6, 2026
valentijnscholten added this to the 2.57.0 milestone Mar 6, 2026
valentijnscholten added 2 commits March 6, 2026 19:20
Add Finding.DEDUPLICATION_FIELDS -- the union of all Finding fields
needed across every deduplication algorithm -- and apply it as an
only() clause in get_finding_models_for_deduplication.

This avoids loading large text columns (description, mitigation,
impact, references, steps_to_reproduce, severity_justification, etc.)
when loading findings for the batch deduplication task, reducing
data transferred from the database without affecting query count.

build_candidate_scope_queryset is intentionally excluded: it is also
used for reimport matching (which accesses severity, numerical_severity
and other fields outside this set) and applying only() there would
cause deferred-field extra queries.
- Add Finding.DEDUPLICATION_DEFERRED_FIELDS constant listing large text
columns (description, mitigation, impact, references, etc.) that are
never read during deduplication or candidate matching.
- Apply .defer(*Finding.DEDUPLICATION_DEFERRED_FIELDS) in
build_candidate_scope_queryset to avoid loading those columns for the
potentially large candidate pool fetched per dedup batch.

Reduces deduplication second-import query count from 213 to 183 (-30).
valentijnscholten marked this pull request as ready for review March 8, 2026 17:22
valentijnscholten requested review from Maffooch and mtesauro as code owners March 8, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Maffooch Awaiting requested review from Maffooch Maffooch is a code owner

mtesauro Awaiting requested review from mtesauro mtesauro is a code owner

At least 4 approving reviews are required to merge this pull request.

Assignees

No one assigned

Projects

None yet

Milestone

2.57.0

Development

Successfully merging this pull request may close these issues.

1 participant