Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

(fetch_url) add tool to fetch urls and store in rag#195

Open
camilleAND wants to merge 14 commits intomainfrom
camand/feat_fetch_url
Open

(fetch_url) add tool to fetch urls and store in rag#195
camilleAND wants to merge 14 commits intomainfrom
camand/feat_fetch_url

Conversation

Copy link
Contributor

camilleAND commented Dec 5, 2025 *
edited by coderabbitai bot
Loading

Purpose

Add a fetch_url tool that lets the agent automatically fetch and process content from URLs detected in conversations. The tool is added dynamically when URLs are present, without prior configuration.

Proposal

  • Add detect_url_in_conversation() to detect URLs in messages (ui_messages and messages)
  • Implement fetch_url() with support for multiple content types (HTML, PDF, markdown)
  • Special handling for docs.numerique.gouv.fr via their API
  • HTML text extraction using trafilatura
  • PDF support via Albert API for parsing
  • Smart content handling: inline return for small content (< 8000 characters), RAG storage for large or binary content
  • Retry mechanism with randomized headers (User-Agent) to avoid blocking (403, 429)
  • Security: validate that the URL was detected in the conversation before fetching
  • Dynamic integration: automatically add the tool to the agent when URLs are detected
  • Storage in RAG backend and creation of markdown attachments for large documents
  • Error handling (HTTP, timeout, exceptions)
  • Test suite covering different use cases (detected/undetected URLs, different content types, errors)

Summary by CodeRabbit

  • New Features

    • On-demand URL detection and fetching in conversations with fetched pages/PDFs stored as attachments, a new "Fetching URL..." UI state, and a configurable fetch timeout; translations added.
  • Bug Fixes

    • Prevent duplicate URL sources in responses.
    • Skip document search when no document collection exists.
  • Tests

    • Added comprehensive tests for URL detection, fetching, parsing, storage, and error scenarios.

Tip: You can customize this high-level summary in your review settings.

Copy link

coderabbitai bot commented Dec 5, 2025 *
edited
Loading

Warning

Rate limit exceeded

@camilleAND has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 35 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Reviewing files that changed from the base of the PR and between 4c25584 and 3e1934c.

Files selected for processing (1)
  • src/backend/chat/document_storage.py (1 hunks)

Walkthrough

Adds a new fetch_url tool and tests, exposes it in tools, updates the AI client to detect/register the tool dynamically, adds document storage utilities and converter input support, guards document_search_rag for missing collections, introduces FETCH_URL_TIMEOUT setting, updates frontend UI/translations, and extends .gitignore. (<=50 words)

Changes

Cohort / File(s) Change Summary
Configuration
\.gitignore
Adds ignore entry for src/backend/conversations/configuration/llm/default_dev.json.
New URL fetch tool
src/backend/chat/tools/fetch_url.py
New public tool fetch_url(ctx, url) that detects/extracts URLs, fetches HTTP content with randomized UA and retry on 403/429, handles docs.numerique.gouv.fr specially, classifies content (text/PDF/binary), uses Trafilatura for HTML extraction, chooses inline vs RAG storage for large/binary content, creates markdown attachments, and returns structured ToolReturn results with metadata and error handling.
Tool exports
src/backend/chat/tools/__init__.py
Imports and exposes fetch_url from .fetch_url.
AI client / runtime wiring
src/backend/chat/clients/pydantic_ai.py
Detects URLs in conversation messages, deduplicates URL sources, adjusts RAG-enabling logic to consider URLs/docs, consolidates document handling via the new storage utility, and dynamically registers/wraps fetch_url at runtime when URLs are present.
Document storage utilities
src/backend/chat/document_storage.py
New utilities: ensure per-conversation RAG collection, store documents (normalize bytes/BytesIO), parse via backend, create markdown attachments, and orchestrate store+attachment flows with error handling.
Document converter
src/backend/chat/agent_rag/document_converter/markitdown.py
convert_raw signature now accepts `bytes
Document-search guard
src/backend/chat/tools/document_search_rag.py
Adds early return when conversation collection_id is missing: returns "No documents to search in yet." instead of executing a search.
Document summarize
src/backend/chat/tools/document_summarize.py
read_document_content prefers doc.conversion_from when it's an http(s) URL for source naming; final summary appended with fixed instruction line "Copy paste this summary to the user."
Tests
src/backend/chat/tests/tools/test_fetch_url.py
Adds comprehensive async tests for URL detection and fetch_url behaviors (HTML extraction, PDFs, large content-RAG, HTTP errors/timeouts, special-site handling) using respx, mocks, Django fixtures, and AsyncMock.
Settings
src/backend/conversations/settings.py
Adds FETCH_URL_TIMEOUT (PositiveIntegerValue, default 5) to Base settings.
Frontend UI
src/frontend/apps/conversations/src/features/chat/components/Chat.tsx
Changes tool-invocation detection to pick the last active tool-invocation (non-result) and adds handling for fetch_url rendering "Fetching URL...".
Translations
src/frontend/apps/conversations/src/i18n/translations.json
Adds localized strings for "Fetching URL..." (FR/NL/RU/UK).

Sequence Diagram(s)

sequenceDiagram
participant User
participant Agent as Pydantic AI Agent
participant ToolReg as Tool Registry
participant FetchURL as fetch_url Tool
participant HTTP as Remote HTTP
participant Parser as Trafilatura/Parser
participant RAG as Document Store (RAG)
participant UI as Chat UI

User->>Agent: send message (may contain URL)
Agent->>Agent: detect URLs in conversation
alt URLs found
Agent->>ToolReg: register/wrap fetch_url dynamically
Agent->>FetchURL: invoke fetch_url(url)
FetchURL->>HTTP: GET (random UA, retry on 403/429)
HTTP-->>FetchURL: response (HTML / PDF / binary / error)
alt HTML & small
FetchURL->>Parser: extract text
Parser-->>FetchURL: extracted content
FetchURL-->>Agent: inline content + metadata
else large or binary or PDF
FetchURL->>RAG: store parsed document, create markdown attachment
RAG-->>FetchURL: document reference / attachment key
FetchURL-->>Agent: attachment reference + metadata
else error/timeout
FetchURL-->>Agent: error result
end
end
Agent-->>UI: final response with inline text / attachment / error
Loading

Estimated code review effort

4 (Complex) | ~45 minutes

  • Focus areas:
    • src/backend/chat/tools/fetch_url.py -- HTTP/retry logic, content classification, Trafilatura use, RAG storage and attachment creation, error paths.
    • src/backend/chat/clients/pydantic_ai.py -- dynamic tool registration/wrapping, URL detection/deduplication, and revised document handling.
    • src/backend/chat/document_storage.py -- collection lifecycle, bytes/BytesIO handling, parsed content management, attachment metadata.
    • Tests: src/backend/chat/tests/tools/test_fetch_url.py -- verify mocks/respx accurately simulate scenarios.

Possibly related PRs

Suggested labels

enhancement, backend, frontend

Pre-merge checks and finishing touches

Passed checks (3 passed)
Check name Status Explanation
Description Check Passed Check skipped - CodeRabbit's high-level summary is enabled.
Title check Passed The title accurately summarizes the main change: adding a fetch_url tool to fetch URLs and store content in RAG. It is concise, specific, and clearly represents the primary objective of the PR.
Docstring Coverage Passed Docstring coverage is 92.68% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

Share

Comment @coderabbitai help to get the list of available commands and usage tips.

github-advanced-security bot found potential problems Dec 5, 2025
coderabbitai bot reviewed Dec 5, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Nitpick comments (6)
src/backend/chat/tools/document_search_rag.py (1)

27-31: Move the guard before instantiating the document store.

The collection_id check on line 29 happens after the document_store is already instantiated on line 27. This is inefficient since the backend is created unnecessarily when there's no collection to search.

Apply this diff to optimize the control flow:

- document_store_backend = import_string(settings.RAG_DOCUMENT_SEARCH_BACKEND)
-
- document_store = document_store_backend(ctx.deps.conversation.collection_id)
-
if not ctx.deps.conversation.collection_id:
return ToolReturn(return_value=[], content="No documents to search in yet.")

+ document_store_backend = import_string(settings.RAG_DOCUMENT_SEARCH_BACKEND)
+ document_store = document_store_backend(ctx.deps.conversation.collection_id)
+
rag_results = document_store.search(query)
src/backend/chat/clients/pydantic_ai.py (1)

406-406: Accessing protected member _function_toolset is fragile.

This relies on an internal implementation detail of pydantic_ai.Agent. If the library updates its internals, this check will break silently or raise an AttributeError.

Consider wrapping this in a try-except or using a more robust approach:

- fetch_url_exists = "fetch_url" in self.conversation_agent._function_toolset.tools # pylint: disable=protected-access
+ # Check if fetch_url tool already exists
+ try:
+ fetch_url_exists = "fetch_url" in self.conversation_agent._function_toolset.tools # pylint: disable=protected-access
+ except AttributeError:
+ fetch_url_exists = False

Alternatively, consider maintaining a local set of dynamically added tool names to avoid relying on internals.

src/backend/chat/tools/__init__.py (1)

21-27: Clarify the purpose of this registry entry.

The comment states fetch_url is added dynamically in pydantic_ai.py and "won't be used via prepare". If so, this entry may be redundant or could cause confusion.

Consider either:

  1. Removing this entry if it's truly unused (dynamic registration handles everything)
  2. Using a prepare function to conditionally enable it (similar to web search tools)
  3. Updating the comment to explain why it's kept here (e.g., for discoverability, testing, or future configuration-based enablement)
src/backend/chat/tools/fetch_url.py (2)

26-28: Simplify and fix the URL regex pattern.

Static analysis flagged several issues with this regex:

  • [$-_@.&+] is an overly permissive range matching ASCII 36-95 (includes %&'()*+,-./:;<=>?@ etc.)
  • Duplicate characters and unnecessary grouping
  • [0-9] should use \d

Consider using a simpler, more robust URL pattern:

"\'\])]+' +)">-URL_PATTERN = re.compile(
- r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
-)
+URL_PATTERN = re.compile(
+ r'https?://[^\s<>"\'\])]+'
+)

Or for stricter matching:

URL_PATTERN = re.compile(
r'https?://(?:[\w\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9A-Fa-f]{2})+'
)

62-96: Consider simplifying _extract_text_from_message to reduce cognitive complexity.

Static analysis flags cognitive complexity of 24 (limit: 15). While the logic handles multiple message formats correctly, extracting helper functions could improve readability.

str: - ... + text_parts = [] + if hasattr(message, 'parts'): + text_parts.extend(_extract_text_from_parts(message.parts, is_dict=False)) + if hasattr(message, 'content') and message.content: + text_parts.append(message.content) + elif isinstance(message, dict): + text_parts.extend(_extract_text_from_parts(message.get('part s', []), is_dict=True)) + if content := message.get('content', ''): + text_parts.append(content) + return ' '.join(text_parts)">+def _extract_text_from_parts(parts, is_dict: bool) -> list[str]:
+ """Extract text from message parts."""
+ texts = []
+ for part in parts or []:
+ if is_dict:
+ if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'):
+ texts.append(part['text'])
+ elif isinstance(part, TextUIPart) and part.text:
+ texts.append(part.text)
+ return texts
+
+
def _extract_text_from_message(message) -> str:
- ...
+ text_parts = []
+ if hasattr(message, 'parts'):
+ text_parts.extend(_extract_text_from_parts(message.parts, is_dict=False))
+ if hasattr(message, 'content') and message.content:
+ text_parts.append(message.content)
+ elif isinstance(message, dict):
+ text_parts.extend(_extract_text_from_parts(message.get('part s', []), is_dict=True))
+ if content := message.get('content', ''):
+ text_parts.append(content)
+ return ' '.join(text_parts)
src/backend/chat/tests/tools/test_fetch_url.py (1)

169-169: Remove unused mock_storage variable.

Static analysis correctly flags that mock_storage is captured but never used in assertions. If storage behavior doesn't need verification, use _ to indicate intentional ignorance.

- with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
- "chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
- ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save") as mock_storage:
+ with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
+ "chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
+ ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save"):

Or add assertions on storage if that behavior should be verified.

Also applies to: 273-273

Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between a1ed561 and f50453b.

Files selected for processing (6)
  • .gitignore (1 hunks)
  • src/backend/chat/clients/pydantic_ai.py (7 hunks)
  • src/backend/chat/tests/tools/test_fetch_url.py (1 hunks)
  • src/backend/chat/tools/__init__.py (2 hunks)
  • src/backend/chat/tools/document_search_rag.py (1 hunks)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
Additional context used
Code graph analysis (5)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/models.py (1)
  • ChatConversationAttachment (78-133)
src/backend/chat/tests/tools/test_fetch_url.py (4)
src/backend/chat/factories.py (1)
  • ChatConversationFactory (12-18)
src/backend/chat/tools/fetch_url.py (2)
  • fetch_url (253-415)
  • detect_url_in_conversation (99-134)
src/backend/core/factories.py (1)
  • UserFactory (16-30)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
  • collection_id (56-62)
src/backend/chat/clients/pydantic_ai.py (4)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/tools/fetch_url.py (2)
  • fetch_url (253-415)
  • detect_url_in_conversation (99-134)
src/backend/chat/agent_rag/document_rag_backends/albert_rag_backend.py (1)
  • search (216-257)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
  • search (172-208)
src/backend/chat/tools/document_search_rag.py (1)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
  • collection_id (56-62)
src/backend/chat/tools/__init__.py (1)
src/backend/chat/tools/fetch_url.py (1)
  • fetch_url (253-415)
GitHub Check: CodeQL
src/backend/chat/tools/fetch_url.py

[warning] 27-27: Overly permissive regular expression range
Suspicious character range that is equivalent to [$%&'()*+,-./0-9:;<=>?@a-z\[\\]^_].

GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[failure] 62-62: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195


[warning] 27-27: Replace this character class by the character itself.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_y&open=AZrvDBPCJB0kz5OXAO_y&pullRequest=195


[failure] 280-280: Define a constant instead of duplicating this literal "docs.numerique.gouv.fr" 3 times.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_x&open=AZrvDBPCJB0kz5OXAO_x&pullRequest=195


[warning] 27-27: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_z&open=AZrvDBPCJB0kz5OXAO_z&pullRequest=195


[failure] 99-99: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_6&open=AZrvDBPCJB0kz5OXAO_6&pullRequest=195


[warning] 27-27: Remove duplicates in this character class.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_0&open=AZrvDBPCJB0kz5OXAO_0&pullRequest=195


[warning] 27-27: Use concise character class syntax '\d' instead of '[0-9]'.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_3&open=AZrvDBPCJB0kz5OXAO_3&pullRequest=195


[failure] 253-253: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195


[warning] 27-27: Remove or rework this redundant alternative.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_2&open=AZrvDBPCJB0kz5OXAO_2&pullRequest=195


[warning] 27-27: Remove duplicates in this character class.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_1&open=AZrvDBPCJB0kz5OXAO_1&pullRequest=195

src/backend/chat/tests/tools/test_fetch_url.py

[warning] 273-273: Remove the unused local variable "mock_storage".

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBMMJB0kz5OXAO_w&open=AZrvDBMMJB0kz5OXAO_w&pullRequest=195


[warning] 169-169: Remove the unused local variable "mock_storage".

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBMMJB0kz5OXAO_v&open=AZrvDBMMJB0kz5OXAO_v&pullRequest=195

Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test-e2e-chromium
  • GitHub Check: test-back
  • GitHub Check: build-and-push-backend
  • GitHub Check: build-and-push-frontend
Additional comments (11)
.gitignore (1)

83-84: Good practice to exclude development configuration files.

The addition of src/backend/conversations/configuration/llm/default_dev.json to .gitignore is appropriate, as development-specific configuration files (especially those containing local settings or credentials) should not be committed to version control.

src/backend/chat/clients/pydantic_ai.py (3)

395-414: URL detection and dynamic tool registration looks good.

The implementation correctly:

  • Checks the current message for URLs first (most relevant)
  • Falls back to conversation history detection
  • Uses a wrapper to preserve the original function's metadata
  • Applies appropriate retry configuration

510-541: Good refinement of RAG enablement and system prompt conditioning.

The changes correctly:

  • Enable RAG when URLs are detected (since fetch_url may store content there)
  • Only add the attached_documents_note when actual documents (not just URLs) are present, preventing hallucinations

579-579: Good deduplication of URL sources.

The _added_source_urls set effectively prevents duplicate sources from being emitted to the UI when the same URL is returned by multiple tool invocations.

Also applies to: 692-695

src/backend/chat/tools/fetch_url.py (5)

30-60: Header randomization looks reasonable.

The pool of User-Agent strings provides basic mitigation against simple blocking mechanisms. The approach is straightforward and extensible.


99-134: URL detection implementation is functional.

The function correctly extracts URLs from both ui_messages and messages fields. The nested extract_urls_from_messages function contributes to complexity but keeps the logic encapsulated.


137-180: Retry logic with randomized headers is well-implemented.

The strategy correctly:

  • Retries only on header-related status codes (403, 429)
  • Uses fresh random headers on each attempt
  • Propagates other errors immediately
  • Includes defensive code for type-checker satisfaction

183-250: RAG storage implementation handles edge cases well.

Good practices:

  • Creates collection lazily when needed
  • Uses slugified filename to avoid API errors while preserving original URL for citations
  • Creates markdown attachment for pipeline visibility
  • Properly async for database operations

341-415: General URL handling and error paths are well-structured.

The implementation correctly:

  • Uses trafilatura with raw text fallback
  • Detects binary/PDF content for RAG routing
  • Returns structured error responses for different failure modes
  • Uses appropriate logging levels (warning for HTTP errors, exception for unexpected errors)
src/backend/chat/tests/tools/test_fetch_url.py (2)

52-95: Good URL detection test coverage.

Tests cover single URL, multiple URLs, no URLs, and None conversation cases effectively.

Consider adding a test for URL detection in conversation.messages (not just ui_messages) to ensure both paths are covered.


98-353: Comprehensive fetch_url test coverage.

The test suite covers:

  • URL not detected (security check)
  • docs.numerique.gouv.fr success/large/empty cases
  • HTML extraction via trafilatura
  • PDF handling with RAG storage
  • HTTP errors and timeouts

Consider adding a test for the retry mechanism (simulating 403 on first attempt, success on retry) to verify _get_with_retry behavior end-to-end.

coderabbitai bot reviewed Dec 5, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)

28-31: Regex character range is overly permissive.

The character class [$-_@.&+] creates a range from $ (ASCII 36) to _ (ASCII 95), which unintentionally includes digits, uppercase letters, and various punctuation. Consider using an explicit character class or a well-tested URL regex pattern.

"\')\]}>]+' +)">-URL_PATTERN = re.compile(
- r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
-)
+URL_PATTERN = re.compile(
+ r'https?://[^\s<>"\')\]}>]+'
+)

Alternatively, use a more robust approach with urllib.parse for validation after extraction.

Nitpick comments (4)
src/backend/chat/tools/fetch_url.py (4)

65-99: Consider reducing cognitive complexity.

Static analysis flags this function for high cognitive complexity (24 vs 15 allowed). The nested conditionals for different message types can be simplified.

list[str]: + """Extract text parts from a dict-based message.""" + parts = [] + for part in message.get('parts', []): + if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'): + parts.append(part['text']) + if content := message.get('content'): + parts.append(content) + return parts + + def _extract_text_from_message(message) -> str: - ... + if hasattr(message, 'parts'): + text_parts = _extract_text_from_ui_message(message) + elif isinstance(message, dict): + text_parts = _extract_text_from_dict(message) + else: + text_parts = [] + return ' '.join(text_parts)">+def _extract_text_from_ui_message(message) -> list[str]:
+ """Extract text parts from a UIMessage object."""
+ parts = []
+ for part in getattr(message, 'parts', None) or []:
+ if isinstance(part, TextUIPart) and part.text:
+ parts.append(part.text)
+ if content := getattr(message, 'content', None):
+ parts.append(content)
+ return parts
+
+
+def _extract_text_from_dict(message: dict) -> list[str]:
+ """Extract text parts from a dict-based message."""
+ parts = []
+ for part in message.get('parts', []):
+ if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'):
+ parts.append(part['text'])
+ if content := message.get('content'):
+ parts.append(content)
+ return parts
+
+
def _extract_text_from_message(message) -> str:
- ...
+ if hasattr(message, 'parts'):
+ text_parts = _extract_text_from_ui_message(message)
+ elif isinstance(message, dict):
+ text_parts = _extract_text_from_dict(message)
+ else:
+ text_parts = []
+ return ' '.join(text_parts)

124-126: Remove redundant regex search.

URL_PATTERN.search() is called before findall(), but findall() already returns an empty list if there are no matches. This adds unnecessary overhead.

- if text_content and URL_PATTERN.search(text_content):
- matches = URL_PATTERN.findall(text_content)
- found_urls.update(matches)
+ if text_content:
+ found_urls.update(URL_PATTERN.findall(text_content))

140-183: Consider adding delay for rate-limit retries.

The retry logic correctly handles 403/429 status codes, but retrying immediately after a 429 (rate limit) may not help. Consider adding a small delay (e.g., asyncio.sleep(1)) before retrying on 429.


256-418: Consider extracting helper functions to reduce complexity.

This function has high cognitive complexity (23 vs 15 allowed) due to handling multiple content types and special cases. Consider extracting the docs.numerique.gouv.fr logic and general HTTP fetch logic into separate async helper functions.

Suggested structure:

async def _fetch_docs_numerique(client, url, conversation, user) -> ToolReturn | None:
"""Handle docs.numerique.gouv.fr URLs. Returns None if not applicable."""
...

async def _fetch_general_url(client, url, conversation, user) -> ToolReturn:
"""Handle general URL fetching with trafilatura extraction."""
...

async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
# Security check
# Try docs.numerique.gouv.fr handler
# Fall back to general handler
...
Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between f50453b and 8a97448.

Files selected for processing (1)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (3)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/models.py (1)
  • ChatConversationAttachment (78-133)
src/backend/core/file_upload/enums.py (1)
  • AttachmentStatus (9-22)
GitHub Check: CodeQL
src/backend/chat/tools/fetch_url.py

[warning] 30-30: Overly permissive regular expression range
Suspicious character range that is equivalent to [$%&'()*+,-./0-9:;<=>?@a-z\[\\]^_].

GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[warning] 30-30: Remove duplicates in this character class.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_1&open=AZrvDBPCJB0kz5OXAO_1&pullRequest=195


[warning] 30-30: Remove or rework this redundant alternative.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_2&open=AZrvDBPCJB0kz5OXAO_2&pullRequest=195


[warning] 30-30: Replace this character class by the character itself.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_y&open=AZrvDBPCJB0kz5OXAO_y&pullRequest=195


[failure] 65-65: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195


[failure] 102-102: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_6&open=AZrvDBPCJB0kz5OXAO_6&pullRequest=195


[failure] 256-256: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195


[warning] 30-30: Remove duplicates in this character class.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_0&open=AZrvDBPCJB0kz5OXAO_0&pullRequest=195


[warning] 30-30: Use concise character class syntax '\d' instead of '[0-9]'.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_3&open=AZrvDBPCJB0kz5OXAO_3&pullRequest=195


[warning] 30-30: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_z&open=AZrvDBPCJB0kz5OXAO_z&pullRequest=195

Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test-e2e-chromium
  • GitHub Check: test-back
  • GitHub Check: build-and-push-backend
  • GitHub Check: build-and-push-frontend
Additional comments (2)
src/backend/chat/tools/fetch_url.py (2)

33-63: LGTM!

The header randomization approach is reasonable for mitigating basic bot detection.


275-279: Good security practice: URL validation.

The check ensuring the URL was detected in the conversation before fetching is a solid security measure that prevents arbitrary URL fetching via the tool.

coderabbitai bot reviewed Dec 5, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Duplicate comments (2)
src/backend/chat/tools/fetch_url.py (2)

187-254: Blocking I/O in async function.

This async function contains multiple synchronous blocking I/O operations that will block the event loop:

  1. Lines 226-234: document_store.parse_document() and document_store.store_document() make synchronous HTTP requests to the Albert API using requests.post().
  2. Line 250: default_storage.save() performs synchronous file I/O.

Under load, these blocking calls will degrade performance and delay other concurrent requests.

Wrap these blocking calls with sync_to_async:

+from asgiref.sync import sync_to_async
...
- parsed_content = document_store.parse_document(
+ parsed_content = await sync_to_async(document_store.parse_document)(
name=safe_rag_name,
content_type=content_type,
content=BytesIO(content_bytes),
)

- document_store.store_document(
+ await sync_to_async(document_store.store_document)(
name=url,
content=parsed_content
)
...
- default_storage.save(key, ContentFile(parsed_content.encode("utf8")))
+ await sync_to_async(default_storage.save)(key, ContentFile(parsed_content.encode("utf8")))

345-347: Blocking HTML extraction in async function.

trafilatura.extract() is a synchronous operation that parses and extracts content from HTML. For large documents, this can block the event loop and degrade performance under concurrent load.

Wrap the extraction call to run off the event loop:

+from asgiref.sync import sync_to_async
...
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text
Nitpick comments (4)
src/backend/chat/tools/fetch_url.py (4)

30-32: Simplify the regex subpattern.

The grouped subpattern (?:%[0-9a-fA-F]{2}) can be simplified to %[0-9a-fA-F]{2} since the group is not captured or referenced.

Apply this diff:

URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9a-fA-F]{2})+'
)

66-100: Consider refactoring to reduce cognitive complexity.

The function handles multiple message formats with nested conditionals, resulting in high cognitive complexity. Extracting helper functions for UIMessage and dict handling would improve readability.

Example refactor:

str: """Extract text from UIMessage (Pydantic model).""" text_parts = [] for part in message.parts or []: if isinstance(part, TextUIPart) and part.text: text_parts.append(part.text) if hasattr(message, 'content') and message.content: text_parts.append(message.content) return ' '.join(text_parts) def _extract_from_dict_message(message: dict) -> str: """Extract text from dict-based message.""" text_parts = [] for part in message.get('parts', []): if isinstance(part, dict) and part.get('type') == 'text': if text := part.get('text', ''): text_parts.append(text) if content := message.get('content', ''): text_parts.append(content) return ' '.join(text_parts)">def _extract_text_from_message(message) -> str:
"""Extract all text content from a message."""
if hasattr(message, 'parts'):
return _extract_from_ui_message(message)
elif isinstance(message, dict):
return _extract_from_dict_message(message)
return ''

def _extract_from_ui_message(message) -> str:
"""Extract text from UIMessage (Pydantic model)."""
text_parts = []
for part in message.parts or []:
if isinstance(part, TextUIPart) and part.text:
text_parts.append(part.text)
if hasattr(message, 'content') and message.content:
text_parts.append(message.content)
return ' '.join(text_parts)

def _extract_from_dict_message(message: dict) -> str:
"""Extract text from dict-based message."""
text_parts = []
for part in message.get('parts', []):
if isinstance(part, dict) and part.get('type') == 'text':
if text := part.get('text', ''):
text_parts.append(text)
if content := message.get('content', ''):
text_parts.append(content)
return ' '.join(text_parts)

103-138: Consider simplifying the URL detection logic.

The nested function and multiple conditional checks contribute to elevated cognitive complexity. Flattening the structure would improve maintainability.

Example refactor:

def detect_url_in_conversation(conversation) -> list[str]:
"""Detect URLs present in the conversation messages."""
if not conversation:
return []

found_urls = set()

# Collect all message sources
message_sources = []
if hasattr(conversation, 'ui_messages') and conversation.ui_messages:
message_sources.append(conversation.ui_messages)
if hasattr(conversation, 'messages') and conversation.messages:
message_sources.append(conversation.messages)

# Extract URLs from all sources
for messages in message_sources:
for message in messages or []:
if message:
text_content = _extract_text_from_message(message)
if text_content:
found_urls.update(URL_PATTERN.findall(text_content))

if found_urls:
logger.info("URL detected in messages: %s", found_urls)

return list(found_urls)

257-419: Consider extracting helper functions to reduce complexity.

The fetch_url function handles multiple concerns: URL validation, special-case docs.numerique.gouv.fr handling, general URL fetching, content classification, and RAG storage decisions. Extracting helpers for docs.numerique.gouv.fr handling and content processing would improve maintainability.

Example structure:

async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
"""Main entry point - validates and delegates."""
deps = getattr(ctx, "deps", None)
conversation = getattr(deps, "conversation", None)
user = getattr(deps, "user", None)

# Validate URL was detected
if url not in detect_url_in_conversation(conversation):
return _url_not_detected_response(url)

try:
# Special case handling
if DOCS_HOST in url and "/docs/" in url:
return await _fetch_docs_numerique(url, conversation, user)

# General URL handling
return await _fetch_general_url(url, conversation, user)

except httpx.HTTPStatusError as e:
return _http_error_response(url, e)
except httpx.TimeoutException as e:
return _timeout_response(url, e)
except Exception as e:
return _generic_error_response(url, e)

async def _fetch_docs_numerique(url: str, conversation, user) -> ToolReturn:
"""Handle docs.numerique.gouv.fr URLs."""
# Extract existing lines 285-343
...

async def _fetch_general_url(url: str, conversation, user) -> ToolReturn:
"""Handle general URLs."""
# Extract existing lines 345-395
...
Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between 8a97448 and 1ffbd23.

Files selected for processing (1)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (4)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
  • collection_id (56-62)
src/backend/chat/models.py (1)
  • ChatConversationAttachment (78-133)
src/backend/core/models.py (1)
  • save (61-64)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[failure] 66-66: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195


[failure] 103-103: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_6&open=AZrvDBPCJB0kz5OXAO_6&pullRequest=195


[warning] 31-31: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvJnDmVLC96644tEIS&open=AZrvJnDmVLC96644tEIS&pullRequest=195


[failure] 257-257: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195

Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test-e2e-chromium
  • GitHub Check: test-back
  • GitHub Check: build-and-push-backend
  • GitHub Check: build-and-push-frontend
Additional comments (3)
src/backend/chat/tools/fetch_url.py (3)

34-64: LGTM!

The randomized header strategy is a pragmatic approach to mitigate blocking by rate-limiters or bot detection. The implementation is clean and easily extensible.


141-184: LGTM!

The retry logic is well-designed, targeting only header-related errors (403, 429) and using randomized headers on each attempt. The implementation correctly handles exceptions and provides useful debug logging.


276-280: LGTM! Good security validation.

The check ensures that only URLs actually detected in the conversation can be fetched, preventing potential abuse where an agent might be tricked into fetching arbitrary URLs.

camilleAND assigned qbey Dec 8, 2025
eliott07 mentioned this pull request Dec 8, 2025
3 tasks
qbey reviewed Dec 9, 2025
Copy link
Member

qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should not add a new path in code to manage URL content, they should be considered as documents to avoid duplicating code and logic.

coderabbitai bot reviewed Dec 11, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)

330-334: trafilatura.extract() is still synchronous and blocks the event loop.

This call performs CPU-intensive HTML parsing synchronously within an async function. For large documents, this can block the event loop. Wrap with sync_to_async:

+from asgiref.sync import sync_to_async
...
# Avoid trafilatura on PDFs
if is_pdf:
extracted = ""
else:
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text
Nitpick comments (8)
src/backend/chat/tests/tools/test_fetch_url.py (1)

172-172: Remove unused mock_storage variables.

The mock_storage variables in the patch context managers on lines 172 and 278 are never used. Remove them to clean up the code.

Apply this diff:

with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
"chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
- ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save") as mock_storage:
+ ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save"):

Also applies to: 278-278

src/backend/chat/agent_rag/document_converter/markitdown.py (1)

33-42: Minor cleanup: Remove unnecessary seek attempt.

The code creates a fresh BytesIO from the extracted bytes, so the seek(0) attempt on the original BytesIO is unnecessary and the conditional if hasattr(content, 'seek') adds complexity without benefit.

Simplify:

if isinstance(content, BytesIO):
- # Read the BytesIO content to bytes
content_bytes = content.read()
- # Reset position if possible (though we create a new BytesIO anyway)
- content.seek(0) if hasattr(content, 'seek') else None
else:
- # content is already bytes
content_io = BytesIO(content)

The normalization logic is correct otherwise.

src/backend/chat/document_storage.py (1)

73-77: Remove unnecessary seek attempt.

Similar to the markitdown.py file, the seek(0) attempt is unnecessary since a new BytesIO is always created from the extracted bytes. This also makes the comment on line 77 obsolete.

Apply this diff:

if isinstance(content, BytesIO):
- # Read the BytesIO content to bytes
content_bytes = content.read()
- # Reset position if possible (though we create a new BytesIO anyway)
- content.seek(0) if hasattr(content, 'seek') else None
elif isinstance(content, bytes):
content_bytes = content
src/backend/chat/tools/fetch_url.py (5)

31-33: URL regex pattern has unnecessary grouping and overly permissive character class.

The regex has an unnecessarily grouped subpattern (?:...) around the entire character class, and the character range includes more characters than typical URL patterns need. Per the static analysis hint, consider simplifying:

URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]+'
)

The percent-encoding alternation (?:%[0-9a-fA-F]{2}) is redundant since %, hex digits, and the encoded characters are already in the main class.


67-101: Consider reducing cognitive complexity by extracting handlers.

Static analysis flags cognitive complexity of 24 (limit 15). Extract separate handlers for each message type:

list[str]: + """Extract text from a dict-based message.""" + text_parts = [] + for part in message.get('parts', []): + if isinstance(part, dict) and part.get('type') == 'text': + if text := part.get('text', ''): + text_parts.append(text) + if content := message.get('content', ''): + text_parts.append(content) + return text_parts + + def _extract_text_from_message(message) -> str: - """ - Extract all text content from a message. - ... - """ - text_parts = [] - - # Handle UIMessage objects (Pydantic models) + """Extract all text content from a message.""" if hasattr(message, 'parts'): - for part in message.parts or []: - if isinstance(part, TextUIPart) and part.text: - text_parts.append(part.text) - # Also check the deprecated content field - if hasattr(message, 'content') and message.content: - text_parts.append(message.content) - # Handle dict-based messages (JSON deserialized) + return ' '.join(_extract_text_from_ui_message(message)) elif isinstance(message, dict): - # Check parts - parts = message.get('parts', []) - for part in parts: - if isinstance(part, dict) and part.get('type') == 'text': - text = part.get('text', '') - if text: - text_parts.append(text) - # Check deprecated content field - content = message.get('content', '') - if content: - text_parts.append(content) - - return ' '.join(text_parts) + return ' '.join(_extract_text_from_dict_message(message)) + return ''">+def _extract_text_from_ui_message(message) -> list[str]:
+ """Extract text from a UIMessage Pydantic model."""
+ text_parts = []
+ for part in message.parts or []:
+ if isinstance(part, TextUIPart) and part.text:
+ text_parts.append(part.text)
+ if hasattr(message, 'content') and message.content:
+ text_parts.append(message.content)
+ return text_parts
+
+
+def _extract_text_from_dict_message(message: dict) -> list[str]:
+ """Extract text from a dict-based message."""
+ text_parts = []
+ for part in message.get('parts', []):
+ if isinstance(part, dict) and part.get('type') == 'text':
+ if text := part.get('text', ''):
+ text_parts.append(text)
+ if content := message.get('content', ''):
+ text_parts.append(content)
+ return text_parts
+
+
def _extract_text_from_message(message) -> str:
- """
- Extract all text content from a message.
- ...
- """
- text_parts = []
-
- # Handle UIMessage objects (Pydantic models)
+ """Extract all text content from a message."""
if hasattr(message, 'parts'):
- for part in message.parts or []:
- if isinstance(part, TextUIPart) and part.text:
- text_parts.append(part.text)
- # Also check the deprecated content field
- if hasattr(message, 'content') and message.content:
- text_parts.append(message.content)
- # Handle dict-based messages (JSON deserialized)
+ return ' '.join(_extract_text_from_ui_message(message))
elif isinstance(message, dict):
- # Check parts
- parts = message.get('parts', [])
- for part in parts:
- if isinstance(part, dict) and part.get('type') == 'text':
- text = part.get('text', '')
- if text:
- text_parts.append(text)
- # Check deprecated content field
- content = message.get('content', '')
- if content:
- text_parts.append(content)
-
- return ' '.join(text_parts)
+ return ' '.join(_extract_text_from_dict_message(message))
+ return ''

146-170: Consider adding a delay between retries for 429 responses.

When retrying on HTTP 429 (rate limit), immediately retrying without delay may not help. Consider adding a small backoff:

+import asyncio
...
if (not should_retry) or is_last_attempt:
raise
+
+ # Small delay before retry, especially useful for rate limits
+ await asyncio.sleep(0.5 * (attempt + 1))

This is optional since the User-Agent rotation may be sufficient for 403 cases.


298-303: Hardcoded French strings limit internationalization.

Multiple French strings are embedded in ToolReturn values (lines 278, 298-303, 319, 357-362). If these are displayed to users or if multi-language support is needed, consider using Django's translation utilities:

from django.utils.translation import gettext as _

content=_(
"Le contenu de ce document est volumineux et a ete indexe dans "
"la base de documents de la conversation. ..."
)

If French-only is intentional for this deployment, this can be deferred.

Also applies to: 357-362


264-320: Extract docs.numerique.gouv.fr handling to a dedicated function.

This 55-line special case significantly contributes to cognitive complexity (flagged at 20 vs 15 allowed). Per previous review feedback, extract to a separate async function:

ToolReturn: ... try: m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url) if m: - docs_id = m.group(1) - url_transformed = ... - # ... 50+ lines + return await _fetch_docs_numerique(url, m.group(1), conversation, user)">+async def _fetch_docs_numerique(
+ url: str,
+ docs_id: str,
+ conversation,
+ user,
+) -> ToolReturn:
+ """Fetch content from docs.numerique.gouv.fr API."""
+ url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
+
+ async with httpx.AsyncClient(timeout=settings.FETCH_URL_TIMEOUT, follow_redirects=True) as client:
+ response = await _get_with_retry(client, url_transformed)
+ data = response.json()
+ content = data.get('content', '')
+
+ if not content:
+ return ToolReturn(
+ return_value={"url": url, "error": "Content empty or private", ...},
+ )
+ # ... rest of the logic
+
+
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
...
try:
m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
if m:
- docs_id = m.group(1)
- url_transformed = ...
- # ... 50+ lines
+ return await _fetch_docs_numerique(url, m.group(1), conversation, user)
Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between 1ffbd23 and 92d4938.

Files selected for processing (10)
  • src/backend/chat/agent_rag/document_converter/markitdown.py (2 hunks)
  • src/backend/chat/clients/pydantic_ai.py (9 hunks)
  • src/backend/chat/document_storage.py (1 hunks)
  • src/backend/chat/tests/tools/test_fetch_url.py (1 hunks)
  • src/backend/chat/tools/__init__.py (1 hunks)
  • src/backend/chat/tools/document_summarize.py (2 hunks)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
  • src/backend/conversations/settings.py (1 hunks)
  • src/frontend/apps/conversations/src/features/chat/components/Chat.tsx (1 hunks)
  • src/frontend/apps/conversations/src/i18n/translations.json (4 hunks)
Additional context used
Code graph analysis (5)
src/backend/chat/document_storage.py (4)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
  • collection_id (56-62)
src/backend/chat/models.py (1)
  • ChatConversationAttachment (78-133)
src/backend/core/models.py (1)
  • save (61-64)
src/backend/core/file_upload/enums.py (1)
  • AttachmentStatus (9-22)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/document_storage.py (3)
  • create_markdown_attachment (105-142)
  • ensure_collection_exists (23-40)
  • store_document_in_rag (43-102)
src/backend/chat/tools/__init__.py (1)
src/backend/chat/tools/fetch_url.py (1)
  • fetch_url (236-401)
src/backend/chat/clients/pydantic_ai.py (3)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/document_storage.py (1)
  • store_document_with_attachment (145-201)
src/backend/chat/tools/fetch_url.py (2)
  • fetch_url (236-401)
  • detect_url_in_conversation (104-130)
src/backend/chat/tests/tools/test_fetch_url.py (1)
src/backend/chat/tools/fetch_url.py (1)
  • fetch_url (236-401)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[failure] 67-67: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195


[failure] 236-236: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195


[warning] 32-32: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvJnDmVLC96644tEIS&open=AZrvJnDmVLC96644tEIS&pullRequest=195

src/backend/chat/tests/tools/test_fetch_url.py

[warning] 172-172: Remove the unused local variable "mock_storage".

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBMMJB0kz5OXAO_v&open=AZrvDBMMJB0kz5OXAO_v&pullRequest=195


[warning] 278-278: Remove the unused local variable "mock_storage".

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBMMJB0kz5OXAO_w&open=AZrvDBMMJB0kz5OXAO_w&pullRequest=195

Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test-e2e-chromium
  • GitHub Check: test-back
  • GitHub Check: build-and-push-backend
  • GitHub Check: build-and-push-frontend
Additional comments (14)
src/backend/chat/tools/document_summarize.py (1)

26-33: LGTM - Smart URL preference for document sources.

The logic correctly prioritizes the original URL (from conversion_from) when available for fetched content, which improves source traceability in summaries.

src/backend/chat/tools/__init__.py (1)

6-6: Clarify fetch_url import usage.

The fetch_url function is imported but not registered in get_pydantic_tools_by_name(). Based on the PR description, this tool is registered dynamically when URLs are detected in conversations.

If the import is solely for package-level access (e.g., from chat.tools import fetch_url), this is fine. However, if there's an expectation that it should be available via get_pydantic_tools_by_name(), it needs to be added to the tool_dict.

Can you confirm that dynamic registration (outside this module) is the intended usage pattern for fetch_url? If so, consider adding a comment explaining why it's imported but not in the registry:

from .fetch_url import fetch_url # Dynamically registered; not in static registry
src/frontend/apps/conversations/src/i18n/translations.json (1)

107-107: LGTM - Translation entries added correctly.

The new "Fetching URL..." translation has been added consistently across all four supported languages (FR, NL, RU, UK), aligning with the new fetch_url tool UI feedback.

Also applies to: 240-240, 373-373, 506-506

src/backend/conversations/settings.py (1)

844-849: Verify that 5-second timeout is sufficient for URL fetching.

The default FETCH_URL_TIMEOUT of 5 seconds may be too aggressive for:

  • Large documents (PDFs, long HTML pages)
  • Slow servers or high-latency networks
  • API endpoints that need to generate content before responding

Consider whether:

  1. This timeout applies only to connection establishment (appropriate) or the entire response (potentially too short)
  2. A higher default (e.g., 10-15 seconds) would be more reliable for production use
  3. The timeout should differentiate between connection and read operations

Review the fetch_url implementation to confirm timeout behavior and test with typical large documents to validate the 5-second default.

src/frontend/apps/conversations/src/features/chat/components/Chat.tsx (1)

794-826: LGTM - Improved active tool detection logic.

The refactored logic correctly:

  • Identifies all tool invocations (excluding document_parsing)
  • Finds the most recent active invocation (state !== 'result')
  • Handles the new fetch_url tool with appropriate UI feedback
  • Maintains fallback behavior for other search tools

This approach properly handles multiple concurrent tool invocations by showing the status of the one currently in progress.

src/backend/chat/document_storage.py (1)

138-138: Verify thread-safety of synchronous storage call in async function.

The synchronous default_storage.save() call is made within an async function. While Django's storage backends are generally thread-safe, this could block the async event loop.

Consider one of the following:

  1. If the storage backend supports async operations, use await sync_to_async(default_storage.save)(...)
  2. Verify that the storage operations are fast enough that blocking is acceptable
  3. Document why synchronous storage access is safe in this context

Additionally, ensure proper error handling if storage.save() fails, as this would leave the attachment in an inconsistent state (database record exists but file doesn't).

src/backend/chat/tools/fetch_url.py (4)

35-65: LGTM!

The randomized User-Agent pool is a reasonable approach for mitigating simple header-based blocking.


104-130: LGTM!

The URL detection logic correctly handles both message formats and returns unique URLs.


179-233: LGTM - refactored to use async document storage utilities.

The function now properly delegates to store_document_in_rag and create_markdown_attachment from the document_storage module, addressing previous blocking I/O concerns at this layer.


378-401: LGTM - comprehensive error handling.

The error handling covers HTTP errors, timeouts, and unexpected exceptions appropriately. Each error type is logged at the correct level.

src/backend/chat/clients/pydantic_ai.py (4)

61-78: LGTM - imports support new functionality.

The new imports correctly bring in the document storage utility and fetch_url tool for dynamic registration.


253-291: LGTM - cleaner document handling with shared utility.

The refactoring to use store_document_with_attachment consolidates the document storage logic and properly handles BytesIO to bytes conversion. The attachment_key and conversion_from parameters preserve document provenance correctly.


490-521: LGTM - RAG enablement correctly extended for URL-fetched content.

The logic now enables RAG when URLs are detected (has_url_in_conversation), since fetch_url stores large content in RAG. The guard on attached_documents_note (lines 514-521) appropriately prevents misleading the model about attached documents when only URLs were fetched.


559-559: LGTM - source URL deduplication prevents UI clutter.

The _added_source_urls set correctly tracks emitted URLs to prevent duplicate SourceUIPart entries when multiple tool results reference the same URL.

Also applies to: 671-688

coderabbitai bot reviewed Dec 11, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Duplicate comments (2)
src/backend/chat/tools/fetch_url.py (2)

328-328: Wrap trafilatura.extract() to avoid blocking the event loop.

trafilatura.extract() is synchronous and performs CPU-intensive HTML parsing. For large documents, this blocks the event loop. This was flagged in a previous review but remains unaddressed.

+from asgiref.sync import sync_to_async
...
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text

258-314: Extract docs.numerique.gouv.fr handling to a dedicated function.

This special-case logic spans 55+ lines and increases cognitive complexity. Based on learnings, this should be extracted to a separate function to improve maintainability and reduce the complexity of fetch_url().

Apply this pattern:

+async def _fetch_docs_numerique(
+ url: str,
+ docs_id: str,
+ conversation,
+ user,
+) -> ToolReturn:
+ """Fetch content from docs.numerique.gouv.fr via their API."""
+ url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
+ # ... rest of docs-specific logic
+
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
...
try:
- # Special handling for docs.numerique.gouv.fr
- m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
- if m:
- docs_id = m.group(1)
- url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
- ...
+ m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
+ if m:
+ return await _fetch_docs_numerique(url, m.group(1), conversation, user)

Based on learnings, this refactor addresses the maintainer's feedback.

Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between 92d4938 and 6f51b62.

Files selected for processing (1)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/document_storage.py (3)
  • create_markdown_attachment (105-142)
  • ensure_collection_exists (23-40)
  • store_document_in_rag (43-102)
GitHub Actions: Main Workflow
src/backend/chat/tools/fetch_url.py

[warning] 352-352: typo: 'ressource' -> 'resource' (codespell)


[warning] 353-353: typo: 'exemple' -> 'example' (codespell)


[warning] 356-356: typo: 'informations' -> 'information' (codespell)

GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[warning] 32-32: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvJnDmVLC96644tEIS&open=AZrvJnDmVLC96644tEIS&pullRequest=195


[failure] 67-67: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195


[failure] 230-230: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195

Additional comments (5)
src/backend/chat/tools/fetch_url.py (5)

35-65: LGTM!

The randomized header approach is well-implemented and helps mitigate blocking. The documentation clearly states the extensibility path for future enhancements.


133-176: LGTM!

The retry mechanism is well-designed with appropriate conditions for header-related blocks (403, 429) and good logging for observability.


179-227: LGTM!

The RAG storage and attachment creation are properly implemented with async/await patterns. The two-step process (safe filename for parsing, original URL for storage metadata) correctly handles API constraints.


272-272: Clarify if French UI strings are intentional.

The error message "Ce document Docs n'est pas public ou est vide." is in French. Verify if this is intentional for a French-language deployment or if internationalization should be used.


302-309: The inline content strategy is already properly implemented with appropriate size thresholds.

The current code routes large content (> 8000 chars) through RAG while keeping small content (<= 8000 chars) inline. At lines 276-300 and 334-360, content exceeding MAX_INLINE_CONTENT_CHARS is stored in RAG with a preview returned instead. Lines 302-309 and 363-371 return small content inline with explicit comments confirming this is the intended behavior: "Small textual content: keep the existing behaviour with inline content."

This threshold-based approach prevents context bloating for large documents while maintaining responsiveness for small content, and does not require changes.

Comment on lines +247 to +255
messages_for_detection = getattr(deps, "messages", None)
urls = detect_url_in_conversation(messages_for_detection)
logger.info("URLs authorized (extracted from messages): %s", urls)

# If messages are provided, enforce URL presence; otherwise skip the check.
if messages_for_detection is not None and url not in urls:
return ToolReturn(
return_value={"url": url, "error": "URL not detected in conversation", "content" : f"URL {url} not detected in conversation"},
)
Copy link

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue | Major

Use pydantic_messages for URL detection instead of frontend messages.

Line 247 retrieves messages from deps, but based on learnings, this field is Vercel SDK formatted for frontend use. The URL detection should use Pydantic AI messages from pydantic_messages instead.

Additionally, line 254 returns an error with a "content" field, which was questioned in past reviews as potentially incorrect for the expected return structure.

Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 247-255, the code reads
frontend-formatted messages via deps.messages for URL detection and returns an
error including a "content" field; replace deps.messages with
deps.pydantic_messages (the Pydantic AI message list) when calling
detect_url_in_conversation so detection uses the correct format, and update the
error return to match the ToolReturn schema (remove or rename the "content" key
to the expected field, e.g., keep only return_value with url and error or use
the proper message field) so the response structure is consistent with callers.

Comment on lines +351 to +356
"content":(
"Le contenu de cette ressource est volumineux ou de type fichier "
"(par exemple PDF). Il a ete indexe dans la base de documents de la "
"conversation. Pour l'exploiter, tu dois utiliser l'outil "
"`document_search_rag` avec une requete precise decrivant les "
"informations que tu souhaites retrouver."
Copy link

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue | Minor

Fix French typos flagged by pipeline.

The pipeline reports typos in the French instruction message:

  • "ressource" - "resource" (line 352)
  • "exemple" - "example" (line 353)
  • "informations" - "information" (line 356)

However, these are valid French words. If the message is intentionally in French, suppress the codespell warning or configure it to skip French strings. If this should be English, translate the message.

Tools
GitHub Actions: Main Workflow

[warning] 352-352: typo: 'ressource' -> 'resource' (codespell)


[warning] 353-353: typo: 'exemple' -> 'example' (codespell)


[warning] 356-356: typo: 'informations' -> 'information' (codespell)

Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 351 to 356, the pipeline
flagged typos in a French user-facing message; decide whether the string should
remain French or be English and implement one of the two fixes: if the message
is intentionally French, add a codespell suppression (e.g., wrap the string with
a comment or marker used by the repo to ignore codespell or add the specific
word exceptions in the codespell config) so the linter skips it; otherwise
translate the French string into clear English and replace the French text
accordingly (ensure punctuation and backticks around tool names are preserved).

coderabbitai bot reviewed Dec 11, 2025
Copy link

coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)

244-256: Use pydantic_messages for URL detection instead of frontend messages.

Line 248 retrieves deps.messages (Vercel SDK format for frontend), but per past review feedback, proper conversation history is in deps.pydantic_messages (Pydantic AI format). URL detection should use the correct message source to ensure reliable detection across the conversation history.

Additionally, the "content" field in the error return (line 255) was questioned in past reviews as potentially incorrect for the expected return structure.

Based on learnings, as per past review feedback from maintainers.

Apply this diff:

- messages_for_detection = getattr(deps, "messages", None)
+ messages_for_detection = getattr(deps, "pydantic_messages", None)
urls = detect_url_in_conversation(messages_for_detection)
logger.info("URLs authorized (extracted from messages): %s", urls)

# If messages are provided, enforce URL presence; otherwise skip the check.
if messages_for_detection is not None and url not in urls:
return ToolReturn(
- return_value={"url": url, "error": "URL not detected in conversation", "content" : f"URL {url} not detected in conversation"},
+ return_value={"url": url, "error": "URL not detected in conversation"},
)
Nitpick comments (2)
src/backend/chat/tools/fetch_url.py (2)

32-34: Consider simplifying the regex subpattern.

The regex contains an unnecessarily grouped subpattern (?:%[0-9a-fA-F]{2}) that could be unwrapped to %[0-9a-fA-F]{2} without changing behavior, making it slightly more readable.

Apply this diff:

URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9a-fA-F]{2})+'
)

231-397: Consider reducing cognitive complexity.

Static analysis flags this function's cognitive complexity at 20 (threshold 15). While the logic is correct and handles multiple content types and error cases appropriately, consider extracting helper functions for:

  • Docs.numerique.gouv.fr special handling (lines 259-315)
  • General URL content processing (lines 317-373)

This would improve maintainability and testability.

This is optional given the "Chill" review mode, but would make the code easier to maintain long-term.

Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Commits

Reviewing files that changed from the base of the PR and between 6f51b62 and 4c25584.

Files selected for processing (1)
  • src/backend/chat/tools/fetch_url.py (1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
  • TextUIPart (137-147)
src/backend/chat/document_storage.py (3)
  • create_markdown_attachment (105-142)
  • ensure_collection_exists (23-40)
  • store_document_in_rag (43-102)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py

[failure] 231-231: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_7&open=AZrvDBPCJB0kz5OXAO_7&pullRequest=195


[warning] 33-33: Unwrap this unnecessarily grouped subpattern.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvJnDmVLC96644tEIS&open=AZrvJnDmVLC96644tEIS&pullRequest=195


[failure] 68-68: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195

Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test-e2e-chromium
  • GitHub Check: test-back
  • GitHub Check: build-and-push-backend
  • GitHub Check: build-and-push-frontend
Additional comments (5)
src/backend/chat/tools/fetch_url.py (5)

36-66: LGTM!

The randomized User-Agent strategy is a good approach to mitigate blocking on header-sensitive endpoints.


105-131: LGTM conditionally.

The URL detection logic is sound. However, this function depends on _extract_text_from_message, which needs to be refactored to use Pydantic AI messages (see comment on lines 68-102). Once that refactor is complete, this function will automatically use the correct message format.


134-177: LGTM!

The retry logic is well-implemented with appropriate error handling and logging. The selective retry on 403/429 status codes with randomized headers is a good strategy to handle rate limiting and access restrictions.


180-228: LGTM!

The RAG storage implementation correctly handles the filename vs metadata issue with clear separation between safe parsing names and original URLs for citations. The async integration is proper.


374-397: LGTM!

Comprehensive error handling with appropriate logging levels for different error types. The structured error returns maintain consistency with the ToolReturn format.

Comment on lines +68 to +102
def _extract_text_from_message(message) -> str:
"""
Extract all text content from a message.

Args:
message: A message object (UIMessage or dict).

Returns:
str: All text content concatenated.
"""
text_parts = []

# Handle UIMessage objects (Pydantic models)
if hasattr(message, 'parts'):
for part in message.parts or []:
if isinstance(part, TextUIPart) and part.text:
text_parts.append(part.text)
# Also check the deprecated content field
if hasattr(message, 'content') and message.content:
text_parts.append(message.content)
# Handle dict-based messages (JSON deserialized)
elif isinstance(message, dict):
# Check parts
parts = message.get('parts', [])
for part in parts:
if isinstance(part, dict) and part.get('type') == 'text':
text = part.get('text', '')
if text:
text_parts.append(text)
# Check deprecated content field
content = message.get('content', '')
if content:
text_parts.append(content)

return ' '.join(text_parts)
Copy link

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue | Major

Refactor to extract text from Pydantic AI messages instead of frontend messages.

This function is designed to parse UIMessage objects and frontend-formatted dicts, but per past review feedback, the proper conversation history is in pydantic_messages (Pydantic AI format), not the Vercel SDK messages field. The deprecated content field checks (lines 86-87, 98-100) also add unnecessary complexity.

Based on learnings, messages is for frontend use and ui_messages will be removed. The function should be refactored to work with Pydantic AI message format instead.

This change would also reduce the cognitive complexity (currently 24, threshold 15) flagged by static analysis.

Tools
GitHub Check: SonarCloud Code Analysis

[failure] 68-68: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=suitenumerique_conversations&issues=AZrvDBPCJB0kz5OXAO_5&open=AZrvDBPCJB0kz5OXAO_5&pullRequest=195

Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 68 to 102, the function
currently extracts text from frontend/UI message shapes (UIMessage and dict
parts) and checks deprecated content fields; refactor it to read from the
Pydantic AI message format (pydantic_messages) instead: accept the Pydantic
message or list of Pydantic messages, iterate those models' canonical text
fields (e.g., parts or the model's text/content attributes used by the pydantic
schema), remove handling of frontend dict shapes and deprecated content checks,
and return the concatenated text; ensure type checks use the Pydantic model
classes/attributes, handle optional/missing fields defensively, and keep the
function small and focused to reduce cognitive complexity.


if not content:
return ToolReturn(
return_value={"url": url, "error": "Content empty or private", "content": "Ce document Docs n'est pas public ou est vide."},
Copy link

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue | Minor

French string flagged by codespell.

The French error message is intentional but will cause codespell warnings. Consider either translating to English or adding codespell suppressions for these strings. This also applies to other French strings in this function (lines 293-298, 314, 353-358).

If these messages should remain French, add a codespell ignore comment or configure the project to skip French strings. If they should be English, translate them.

Prompt for AI Agents In src/backend/chat/tools/fetch_url.py around line 273 (and similarly lines
293-298, 314, 353-358), French error strings (e.g., "Ce document Docs n'est pas
public ou est vide.") trigger codespell warnings; either translate these
messages to English consistently (replace French texts with their English
equivalents) or add a codespell suppression for these specific string literals
or configure the project to ignore French strings--apply the chosen approach
consistently to all flagged lines and ensure any suppression is narrowly scoped
to these messages.

Copy link

sonarqubecloud bot commented Dec 11, 2025

Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

qbey qbey left review comments

coderabbitai[bot] coderabbitai[bot] left review comments

Assignees

qbey

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants