-
Notifications
You must be signed in to change notification settings - Fork 15
(fetch_url) add tool to fetch urls and store in rag#195
(fetch_url) add tool to fetch urls and store in rag#195camilleAND wants to merge 14 commits intomainfrom
Conversation
Purpose
Add a fetch_url tool that lets the agent automatically fetch and process content from URLs detected in conversations. The tool is added dynamically when URLs are present, without prior configuration.
Proposal
- Add
detect_url_in_conversation()to detect URLs in messages (ui_messages and messages) - Implement
fetch_url()with support for multiple content types (HTML, PDF, markdown) - Special handling for
docs.numerique.gouv.frvia their API - HTML text extraction using
trafilatura - PDF support via Albert API for parsing
- Smart content handling: inline return for small content (< 8000 characters), RAG storage for large or binary content
- Retry mechanism with randomized headers (User-Agent) to avoid blocking (403, 429)
- Security: validate that the URL was detected in the conversation before fetching
- Dynamic integration: automatically add the tool to the agent when URLs are detected
- Storage in RAG backend and creation of markdown attachments for large documents
- Error handling (HTTP, timeout, exceptions)
- Test suite covering different use cases (detected/undetected URLs, different content types, errors)
Summary by CodeRabbit
-
New Features
- On-demand URL detection and fetching in conversations with fetched pages/PDFs stored as attachments, a new "Fetching URL..." UI state, and a configurable fetch timeout; translations added.
-
Bug Fixes
- Prevent duplicate URL sources in responses.
- Skip document search when no document collection exists.
-
Tests
- Added comprehensive tests for URL detection, fetching, parsing, storage, and error scenarios.
Tip: You can customize this high-level summary in your review settings.
|
Warning Rate limit exceeded@camilleAND has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 35 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Files selected for processing (1)
WalkthroughAdds a new Changes
Sequence Diagram(s)
sequenceDiagram
participant User participant Agent as Pydantic AI Agent participant ToolReg as Tool Registry participant FetchURL as fetch_url Tool participant HTTP as Remote HTTP participant Parser as Trafilatura/Parser participant RAG as Document Store (RAG) participant UI as Chat UI User->>Agent: send message (may contain URL) Agent->>Agent: detect URLs in conversation alt URLs found Agent->>ToolReg: register/wrap fetch_url dynamically Agent->>FetchURL: invoke fetch_url(url) FetchURL->>HTTP: GET (random UA, retry on 403/429) HTTP-->>FetchURL: response (HTML / PDF / binary / error) alt HTML & small FetchURL->>Parser: extract text Parser-->>FetchURL: extracted content FetchURL-->>Agent: inline content + metadata else large or binary or PDF FetchURL->>RAG: store parsed document, create markdown attachment RAG-->>FetchURL: document reference / attachment key FetchURL-->>Agent: attachment reference + metadata else error/timeout FetchURL-->>Agent: error result end end Agent-->>UI: final response with inline text / attachment / error Estimated code review effort4 (Complex) | ~45 minutes
Possibly related PRs
Suggested labels
Pre-merge checks and finishing touchesPassed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Nitpick comments (6)
src/backend/chat/tools/document_search_rag.py (1)
27-31: Move the guard before instantiating the document store.The
collection_idcheck on line 29 happens after thedocument_storeis already instantiated on line 27. This is inefficient since the backend is created unnecessarily when there's no collection to search.Apply this diff to optimize the control flow:
- document_store_backend = import_string(settings.RAG_DOCUMENT_SEARCH_BACKEND)
-
- document_store = document_store_backend(ctx.deps.conversation.collection_id)
-
if not ctx.deps.conversation.collection_id:
return ToolReturn(return_value=[], content="No documents to search in yet.")
+ document_store_backend = import_string(settings.RAG_DOCUMENT_SEARCH_BACKEND)
+ document_store = document_store_backend(ctx.deps.conversation.collection_id)
+
rag_results = document_store.search(query)src/backend/chat/clients/pydantic_ai.py (1)
406-406: Accessing protected member_function_toolsetis fragile.This relies on an internal implementation detail of
pydantic_ai.Agent. If the library updates its internals, this check will break silently or raise an AttributeError.Consider wrapping this in a try-except or using a more robust approach:
- fetch_url_exists = "fetch_url" in self.conversation_agent._function_toolset.tools # pylint: disable=protected-access
+ # Check if fetch_url tool already exists
+ try:
+ fetch_url_exists = "fetch_url" in self.conversation_agent._function_toolset.tools # pylint: disable=protected-access
+ except AttributeError:
+ fetch_url_exists = FalseAlternatively, consider maintaining a local set of dynamically added tool names to avoid relying on internals.
src/backend/chat/tools/__init__.py (1)
21-27: Clarify the purpose of this registry entry.The comment states
fetch_urlis added dynamically inpydantic_ai.pyand "won't be used via prepare". If so, this entry may be redundant or could cause confusion.Consider either:
- Removing this entry if it's truly unused (dynamic registration handles everything)
- Using a
preparefunction to conditionally enable it (similar to web search tools)- Updating the comment to explain why it's kept here (e.g., for discoverability, testing, or future configuration-based enablement)
src/backend/chat/tools/fetch_url.py (2)
26-28: Simplify and fix the URL regex pattern.Static analysis flagged several issues with this regex:
[$-_@.&+]is an overly permissive range matching ASCII 36-95 (includes%&'()*+,-./:;<=>?@etc.)- Duplicate characters and unnecessary grouping
[0-9]should use\dConsider using a simpler, more robust URL pattern:
"\'\])]+' +)">-URL_PATTERN = re.compile(
- r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
-)
+URL_PATTERN = re.compile(
+ r'https?://[^\s<>"\'\])]+'
+)Or for stricter matching:
URL_PATTERN = re.compile(
r'https?://(?:[\w\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9A-Fa-f]{2})+'
)
62-96: Consider simplifying_extract_text_from_messageto reduce cognitive complexity.Static analysis flags cognitive complexity of 24 (limit: 15). While the logic handles multiple message formats correctly, extracting helper functions could improve readability.
str: - ... + text_parts = [] + if hasattr(message, 'parts'): + text_parts.extend(_extract_text_from_parts(message.parts, is_dict=False)) + if hasattr(message, 'content') and message.content: + text_parts.append(message.content) + elif isinstance(message, dict): + text_parts.extend(_extract_text_from_parts(message.get('part s', []), is_dict=True)) + if content := message.get('content', ''): + text_parts.append(content) + return ' '.join(text_parts)">+def _extract_text_from_parts(parts, is_dict: bool) -> list[str]:
+ """Extract text from message parts."""
+ texts = []
+ for part in parts or []:
+ if is_dict:
+ if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'):
+ texts.append(part['text'])
+ elif isinstance(part, TextUIPart) and part.text:
+ texts.append(part.text)
+ return texts
+
+
def _extract_text_from_message(message) -> str:
- ...
+ text_parts = []
+ if hasattr(message, 'parts'):
+ text_parts.extend(_extract_text_from_parts(message.parts, is_dict=False))
+ if hasattr(message, 'content') and message.content:
+ text_parts.append(message.content)
+ elif isinstance(message, dict):
+ text_parts.extend(_extract_text_from_parts(message.get('part s', []), is_dict=True))
+ if content := message.get('content', ''):
+ text_parts.append(content)
+ return ' '.join(text_parts)src/backend/chat/tests/tools/test_fetch_url.py (1)
169-169: Remove unusedmock_storagevariable.Static analysis correctly flags that
mock_storageis captured but never used in assertions. If storage behavior doesn't need verification, use_to indicate intentional ignorance.- with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
- "chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
- ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save") as mock_storage:
+ with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
+ "chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
+ ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save"):Or add assertions on storage if that behavior should be verified.
Also applies to: 273-273
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (6)
.gitignore(1 hunks)src/backend/chat/clients/pydantic_ai.py(7 hunks)src/backend/chat/tests/tools/test_fetch_url.py(1 hunks)src/backend/chat/tools/__init__.py(2 hunks)src/backend/chat/tools/document_search_rag.py(1 hunks)src/backend/chat/tools/fetch_url.py(1 hunks)
Additional context used
Code graph analysis (5)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/models.py (1)
ChatConversationAttachment(78-133)
src/backend/chat/tests/tools/test_fetch_url.py (4)
src/backend/chat/factories.py (1)
ChatConversationFactory(12-18)src/backend/chat/tools/fetch_url.py (2)
fetch_url(253-415)detect_url_in_conversation(99-134)src/backend/core/factories.py (1)
UserFactory(16-30)src/backend/chat/agent_rag/document_search/albert_api.py (1)
collection_id(56-62)
src/backend/chat/clients/pydantic_ai.py (4)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/tools/fetch_url.py (2)
fetch_url(253-415)detect_url_in_conversation(99-134)src/backend/chat/agent_rag/document_rag_backends/albert_rag_backend.py (1)
search(216-257)src/backend/chat/agent_rag/document_search/albert_api.py (1)
search(172-208)
src/backend/chat/tools/document_search_rag.py (1)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
collection_id(56-62)
src/backend/chat/tools/__init__.py (1)
src/backend/chat/tools/fetch_url.py (1)
fetch_url(253-415)
GitHub Check: CodeQL
src/backend/chat/tools/fetch_url.py
[warning] 27-27: Overly permissive regular expression range
Suspicious character range that is equivalent to [$%&'()*+,-./0-9:;<=>?@a-z\[\\]^_].
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[failure] 62-62: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
[warning] 27-27: Replace this character class by the character itself.
[failure] 280-280: Define a constant instead of duplicating this literal "docs.numerique.gouv.fr" 3 times.
[warning] 27-27: Unwrap this unnecessarily grouped subpattern.
[failure] 99-99: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.
[warning] 27-27: Remove duplicates in this character class.
[warning] 27-27: Use concise character class syntax '\d' instead of '[0-9]'.
[failure] 253-253: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.
[warning] 27-27: Remove or rework this redundant alternative.
[warning] 27-27: Remove duplicates in this character class.
src/backend/chat/tests/tools/test_fetch_url.py
[warning] 273-273: Remove the unused local variable "mock_storage".
[warning] 169-169: Remove the unused local variable "mock_storage".
Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-e2e-chromium
- GitHub Check: test-back
- GitHub Check: build-and-push-backend
- GitHub Check: build-and-push-frontend
Additional comments (11)
.gitignore (1)
83-84: Good practice to exclude development configuration files.The addition of
src/backend/conversations/configuration/llm/default_dev.jsonto .gitignore is appropriate, as development-specific configuration files (especially those containing local settings or credentials) should not be committed to version control.src/backend/chat/clients/pydantic_ai.py (3)
395-414: URL detection and dynamic tool registration looks good.The implementation correctly:
- Checks the current message for URLs first (most relevant)
- Falls back to conversation history detection
- Uses a wrapper to preserve the original function's metadata
- Applies appropriate retry configuration
510-541: Good refinement of RAG enablement and system prompt conditioning.The changes correctly:
- Enable RAG when URLs are detected (since
fetch_urlmay store content there)- Only add the
attached_documents_notewhen actual documents (not just URLs) are present, preventing hallucinations
579-579: Good deduplication of URL sources.The
_added_source_urlsset effectively prevents duplicate sources from being emitted to the UI when the same URL is returned by multiple tool invocations.Also applies to: 692-695
src/backend/chat/tools/fetch_url.py (5)
30-60: Header randomization looks reasonable.The pool of User-Agent strings provides basic mitigation against simple blocking mechanisms. The approach is straightforward and extensible.
99-134: URL detection implementation is functional.The function correctly extracts URLs from both
ui_messagesandmessagesfields. The nestedextract_urls_from_messagesfunction contributes to complexity but keeps the logic encapsulated.
137-180: Retry logic with randomized headers is well-implemented.The strategy correctly:
- Retries only on header-related status codes (403, 429)
- Uses fresh random headers on each attempt
- Propagates other errors immediately
- Includes defensive code for type-checker satisfaction
183-250: RAG storage implementation handles edge cases well.Good practices:
- Creates collection lazily when needed
- Uses slugified filename to avoid API errors while preserving original URL for citations
- Creates markdown attachment for pipeline visibility
- Properly async for database operations
341-415: General URL handling and error paths are well-structured.The implementation correctly:
- Uses trafilatura with raw text fallback
- Detects binary/PDF content for RAG routing
- Returns structured error responses for different failure modes
- Uses appropriate logging levels (
warningfor HTTP errors,exceptionfor unexpected errors)src/backend/chat/tests/tools/test_fetch_url.py (2)
52-95: Good URL detection test coverage.Tests cover single URL, multiple URLs, no URLs, and None conversation cases effectively.
Consider adding a test for URL detection in
conversation.messages(not justui_messages) to ensure both paths are covered.
98-353: Comprehensive fetch_url test coverage.The test suite covers:
- URL not detected (security check)
- docs.numerique.gouv.fr success/large/empty cases
- HTML extraction via trafilatura
- PDF handling with RAG storage
- HTTP errors and timeouts
Consider adding a test for the retry mechanism (simulating 403 on first attempt, success on retry) to verify
_get_with_retrybehavior end-to-end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)
28-31: Regex character range is overly permissive.The character class
[$-_@.&+]creates a range from$(ASCII 36) to_(ASCII 95), which unintentionally includes digits, uppercase letters, and various punctuation. Consider using an explicit character class or a well-tested URL regex pattern."\')\]}>]+' +)">-URL_PATTERN = re.compile(
- r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
-)
+URL_PATTERN = re.compile(
+ r'https?://[^\s<>"\')\]}>]+'
+)Alternatively, use a more robust approach with
urllib.parsefor validation after extraction.
Nitpick comments (4)
src/backend/chat/tools/fetch_url.py (4)
65-99: Consider reducing cognitive complexity.Static analysis flags this function for high cognitive complexity (24 vs 15 allowed). The nested conditionals for different message types can be simplified.
list[str]: + """Extract text parts from a dict-based message.""" + parts = [] + for part in message.get('parts', []): + if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'): + parts.append(part['text']) + if content := message.get('content'): + parts.append(content) + return parts + + def _extract_text_from_message(message) -> str: - ... + if hasattr(message, 'parts'): + text_parts = _extract_text_from_ui_message(message) + elif isinstance(message, dict): + text_parts = _extract_text_from_dict(message) + else: + text_parts = [] + return ' '.join(text_parts)">+def _extract_text_from_ui_message(message) -> list[str]:
+ """Extract text parts from a UIMessage object."""
+ parts = []
+ for part in getattr(message, 'parts', None) or []:
+ if isinstance(part, TextUIPart) and part.text:
+ parts.append(part.text)
+ if content := getattr(message, 'content', None):
+ parts.append(content)
+ return parts
+
+
+def _extract_text_from_dict(message: dict) -> list[str]:
+ """Extract text parts from a dict-based message."""
+ parts = []
+ for part in message.get('parts', []):
+ if isinstance(part, dict) and part.get('type') == 'text' and part.get('text'):
+ parts.append(part['text'])
+ if content := message.get('content'):
+ parts.append(content)
+ return parts
+
+
def _extract_text_from_message(message) -> str:
- ...
+ if hasattr(message, 'parts'):
+ text_parts = _extract_text_from_ui_message(message)
+ elif isinstance(message, dict):
+ text_parts = _extract_text_from_dict(message)
+ else:
+ text_parts = []
+ return ' '.join(text_parts)
124-126: Remove redundant regex search.
URL_PATTERN.search()is called beforefindall(), butfindall()already returns an empty list if there are no matches. This adds unnecessary overhead.- if text_content and URL_PATTERN.search(text_content):
- matches = URL_PATTERN.findall(text_content)
- found_urls.update(matches)
+ if text_content:
+ found_urls.update(URL_PATTERN.findall(text_content))
140-183: Consider adding delay for rate-limit retries.The retry logic correctly handles 403/429 status codes, but retrying immediately after a 429 (rate limit) may not help. Consider adding a small delay (e.g.,
asyncio.sleep(1)) before retrying on 429.
256-418: Consider extracting helper functions to reduce complexity.This function has high cognitive complexity (23 vs 15 allowed) due to handling multiple content types and special cases. Consider extracting the docs.numerique.gouv.fr logic and general HTTP fetch logic into separate async helper functions.
Suggested structure:
async def _fetch_docs_numerique(client, url, conversation, user) -> ToolReturn | None:
"""Handle docs.numerique.gouv.fr URLs. Returns None if not applicable."""
...
async def _fetch_general_url(client, url, conversation, user) -> ToolReturn:
"""Handle general URL fetching with trafilatura extraction."""
...
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
# Security check
# Try docs.numerique.gouv.fr handler
# Fall back to general handler
...
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (1)
src/backend/chat/tools/fetch_url.py(1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (3)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/models.py (1)
ChatConversationAttachment(78-133)src/backend/core/file_upload/enums.py (1)
AttachmentStatus(9-22)
GitHub Check: CodeQL
src/backend/chat/tools/fetch_url.py
[warning] 30-30: Overly permissive regular expression range
Suspicious character range that is equivalent to [$%&'()*+,-./0-9:;<=>?@a-z\[\\]^_].
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[warning] 30-30: Remove duplicates in this character class.
[warning] 30-30: Remove or rework this redundant alternative.
[warning] 30-30: Replace this character class by the character itself.
[failure] 65-65: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
[failure] 102-102: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.
[failure] 256-256: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.
[warning] 30-30: Remove duplicates in this character class.
[warning] 30-30: Use concise character class syntax '\d' instead of '[0-9]'.
[warning] 30-30: Unwrap this unnecessarily grouped subpattern.
Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-e2e-chromium
- GitHub Check: test-back
- GitHub Check: build-and-push-backend
- GitHub Check: build-and-push-frontend
Additional comments (2)
src/backend/chat/tools/fetch_url.py (2)
33-63: LGTM!The header randomization approach is reasonable for mitigating basic bot detection.
275-279: Good security practice: URL validation.The check ensuring the URL was detected in the conversation before fetching is a solid security measure that prevents arbitrary URL fetching via the tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Duplicate comments (2)
src/backend/chat/tools/fetch_url.py (2)
187-254: Blocking I/O in async function.This async function contains multiple synchronous blocking I/O operations that will block the event loop:
- Lines 226-234:
document_store.parse_document()anddocument_store.store_document()make synchronous HTTP requests to the Albert API usingrequests.post().- Line 250:
default_storage.save()performs synchronous file I/O.Under load, these blocking calls will degrade performance and delay other concurrent requests.
Wrap these blocking calls with
sync_to_async:+from asgiref.sync import sync_to_async
...
- parsed_content = document_store.parse_document(
+ parsed_content = await sync_to_async(document_store.parse_document)(
name=safe_rag_name,
content_type=content_type,
content=BytesIO(content_bytes),
)
- document_store.store_document(
+ await sync_to_async(document_store.store_document)(
name=url,
content=parsed_content
)
...
- default_storage.save(key, ContentFile(parsed_content.encode("utf8")))
+ await sync_to_async(default_storage.save)(key, ContentFile(parsed_content.encode("utf8")))
345-347: Blocking HTML extraction in async function.
trafilatura.extract()is a synchronous operation that parses and extracts content from HTML. For large documents, this can block the event loop and degrade performance under concurrent load.Wrap the extraction call to run off the event loop:
+from asgiref.sync import sync_to_async
...
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text
Nitpick comments (4)
src/backend/chat/tools/fetch_url.py (4)
30-32: Simplify the regex subpattern.The grouped subpattern
(?:%[0-9a-fA-F]{2})can be simplified to%[0-9a-fA-F]{2}since the group is not captured or referenced.Apply this diff:
URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9a-fA-F]{2})+'
)
66-100: Consider refactoring to reduce cognitive complexity.The function handles multiple message formats with nested conditionals, resulting in high cognitive complexity. Extracting helper functions for UIMessage and dict handling would improve readability.
Example refactor:
str: """Extract text from UIMessage (Pydantic model).""" text_parts = [] for part in message.parts or []: if isinstance(part, TextUIPart) and part.text: text_parts.append(part.text) if hasattr(message, 'content') and message.content: text_parts.append(message.content) return ' '.join(text_parts) def _extract_from_dict_message(message: dict) -> str: """Extract text from dict-based message.""" text_parts = [] for part in message.get('parts', []): if isinstance(part, dict) and part.get('type') == 'text': if text := part.get('text', ''): text_parts.append(text) if content := message.get('content', ''): text_parts.append(content) return ' '.join(text_parts)">def _extract_text_from_message(message) -> str:
"""Extract all text content from a message."""
if hasattr(message, 'parts'):
return _extract_from_ui_message(message)
elif isinstance(message, dict):
return _extract_from_dict_message(message)
return ''
def _extract_from_ui_message(message) -> str:
"""Extract text from UIMessage (Pydantic model)."""
text_parts = []
for part in message.parts or []:
if isinstance(part, TextUIPart) and part.text:
text_parts.append(part.text)
if hasattr(message, 'content') and message.content:
text_parts.append(message.content)
return ' '.join(text_parts)
def _extract_from_dict_message(message: dict) -> str:
"""Extract text from dict-based message."""
text_parts = []
for part in message.get('parts', []):
if isinstance(part, dict) and part.get('type') == 'text':
if text := part.get('text', ''):
text_parts.append(text)
if content := message.get('content', ''):
text_parts.append(content)
return ' '.join(text_parts)
103-138: Consider simplifying the URL detection logic.The nested function and multiple conditional checks contribute to elevated cognitive complexity. Flattening the structure would improve maintainability.
Example refactor:
def detect_url_in_conversation(conversation) -> list[str]:
"""Detect URLs present in the conversation messages."""
if not conversation:
return []
found_urls = set()
# Collect all message sources
message_sources = []
if hasattr(conversation, 'ui_messages') and conversation.ui_messages:
message_sources.append(conversation.ui_messages)
if hasattr(conversation, 'messages') and conversation.messages:
message_sources.append(conversation.messages)
# Extract URLs from all sources
for messages in message_sources:
for message in messages or []:
if message:
text_content = _extract_text_from_message(message)
if text_content:
found_urls.update(URL_PATTERN.findall(text_content))
if found_urls:
logger.info("URL detected in messages: %s", found_urls)
return list(found_urls)
257-419: Consider extracting helper functions to reduce complexity.The
fetch_urlfunction handles multiple concerns: URL validation, special-case docs.numerique.gouv.fr handling, general URL fetching, content classification, and RAG storage decisions. Extracting helpers for docs.numerique.gouv.fr handling and content processing would improve maintainability.Example structure:
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
"""Main entry point - validates and delegates."""
deps = getattr(ctx, "deps", None)
conversation = getattr(deps, "conversation", None)
user = getattr(deps, "user", None)
# Validate URL was detected
if url not in detect_url_in_conversation(conversation):
return _url_not_detected_response(url)
try:
# Special case handling
if DOCS_HOST in url and "/docs/" in url:
return await _fetch_docs_numerique(url, conversation, user)
# General URL handling
return await _fetch_general_url(url, conversation, user)
except httpx.HTTPStatusError as e:
return _http_error_response(url, e)
except httpx.TimeoutException as e:
return _timeout_response(url, e)
except Exception as e:
return _generic_error_response(url, e)
async def _fetch_docs_numerique(url: str, conversation, user) -> ToolReturn:
"""Handle docs.numerique.gouv.fr URLs."""
# Extract existing lines 285-343
...
async def _fetch_general_url(url: str, conversation, user) -> ToolReturn:
"""Handle general URLs."""
# Extract existing lines 345-395
...
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (1)
src/backend/chat/tools/fetch_url.py(1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (4)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/agent_rag/document_search/albert_api.py (1)
collection_id(56-62)src/backend/chat/models.py (1)
ChatConversationAttachment(78-133)src/backend/core/models.py (1)
save(61-64)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[failure] 66-66: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
[failure] 103-103: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.
[warning] 31-31: Unwrap this unnecessarily grouped subpattern.
[failure] 257-257: Refactor this function to reduce its Cognitive Complexity from 23 to the 15 allowed.
Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-e2e-chromium
- GitHub Check: test-back
- GitHub Check: build-and-push-backend
- GitHub Check: build-and-push-frontend
Additional comments (3)
src/backend/chat/tools/fetch_url.py (3)
34-64: LGTM!The randomized header strategy is a pragmatic approach to mitigate blocking by rate-limiters or bot detection. The implementation is clean and easily extensible.
141-184: LGTM!The retry logic is well-designed, targeting only header-related errors (403, 429) and using randomized headers on each attempt. The implementation correctly handles exceptions and provides useful debug logging.
276-280: LGTM! Good security validation.The check ensures that only URLs actually detected in the conversation can be fetched, preventing potential abuse where an agent might be tricked into fetching arbitrary URLs.
qbey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should not add a new path in code to manage URL content, they should be considered as documents to avoid duplicating code and logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)
330-334:trafilatura.extract()is still synchronous and blocks the event loop.This call performs CPU-intensive HTML parsing synchronously within an async function. For large documents, this can block the event loop. Wrap with
sync_to_async:+from asgiref.sync import sync_to_async
...
# Avoid trafilatura on PDFs
if is_pdf:
extracted = ""
else:
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text
Nitpick comments (8)
src/backend/chat/tests/tools/test_fetch_url.py (1)
172-172: Remove unusedmock_storagevariables.The
mock_storagevariables in thepatchcontext managers on lines 172 and 278 are never used. Remove them to clean up the code.Apply this diff:
with patch("chat.tools.fetch_url.import_string") as mock_import, patch(
"chat.tools.fetch_url.models.ChatConversationAttachment.obje cts.acreate", new_callable=AsyncMock
- ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save") as mock_storage:
+ ) as mock_attachment_create, patch("chat.tools.fetch_url.default_storage.save"):Also applies to: 278-278
src/backend/chat/agent_rag/document_converter/markitdown.py (1)
33-42: Minor cleanup: Remove unnecessary seek attempt.The code creates a fresh
BytesIOfrom the extracted bytes, so theseek(0)attempt on the originalBytesIOis unnecessary and the conditionalif hasattr(content, 'seek')adds complexity without benefit.Simplify:
if isinstance(content, BytesIO):
- # Read the BytesIO content to bytes
content_bytes = content.read()
- # Reset position if possible (though we create a new BytesIO anyway)
- content.seek(0) if hasattr(content, 'seek') else None
else:
- # content is already bytes
content_io = BytesIO(content)The normalization logic is correct otherwise.
src/backend/chat/document_storage.py (1)
73-77: Remove unnecessary seek attempt.Similar to the markitdown.py file, the
seek(0)attempt is unnecessary since a newBytesIOis always created from the extracted bytes. This also makes the comment on line 77 obsolete.Apply this diff:
if isinstance(content, BytesIO):
- # Read the BytesIO content to bytes
content_bytes = content.read()
- # Reset position if possible (though we create a new BytesIO anyway)
- content.seek(0) if hasattr(content, 'seek') else None
elif isinstance(content, bytes):
content_bytes = contentsrc/backend/chat/tools/fetch_url.py (5)
31-33: URL regex pattern has unnecessary grouping and overly permissive character class.The regex has an unnecessarily grouped subpattern
(?:...)around the entire character class, and the character range includes more characters than typical URL patterns need. Per the static analysis hint, consider simplifying:URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]+'
)The percent-encoding alternation
(?:%[0-9a-fA-F]{2})is redundant since%, hex digits, and the encoded characters are already in the main class.
67-101: Consider reducing cognitive complexity by extracting handlers.Static analysis flags cognitive complexity of 24 (limit 15). Extract separate handlers for each message type:
list[str]: + """Extract text from a dict-based message.""" + text_parts = [] + for part in message.get('parts', []): + if isinstance(part, dict) and part.get('type') == 'text': + if text := part.get('text', ''): + text_parts.append(text) + if content := message.get('content', ''): + text_parts.append(content) + return text_parts + + def _extract_text_from_message(message) -> str: - """ - Extract all text content from a message. - ... - """ - text_parts = [] - - # Handle UIMessage objects (Pydantic models) + """Extract all text content from a message.""" if hasattr(message, 'parts'): - for part in message.parts or []: - if isinstance(part, TextUIPart) and part.text: - text_parts.append(part.text) - # Also check the deprecated content field - if hasattr(message, 'content') and message.content: - text_parts.append(message.content) - # Handle dict-based messages (JSON deserialized) + return ' '.join(_extract_text_from_ui_message(message)) elif isinstance(message, dict): - # Check parts - parts = message.get('parts', []) - for part in parts: - if isinstance(part, dict) and part.get('type') == 'text': - text = part.get('text', '') - if text: - text_parts.append(text) - # Check deprecated content field - content = message.get('content', '') - if content: - text_parts.append(content) - - return ' '.join(text_parts) + return ' '.join(_extract_text_from_dict_message(message)) + return ''">+def _extract_text_from_ui_message(message) -> list[str]:
+ """Extract text from a UIMessage Pydantic model."""
+ text_parts = []
+ for part in message.parts or []:
+ if isinstance(part, TextUIPart) and part.text:
+ text_parts.append(part.text)
+ if hasattr(message, 'content') and message.content:
+ text_parts.append(message.content)
+ return text_parts
+
+
+def _extract_text_from_dict_message(message: dict) -> list[str]:
+ """Extract text from a dict-based message."""
+ text_parts = []
+ for part in message.get('parts', []):
+ if isinstance(part, dict) and part.get('type') == 'text':
+ if text := part.get('text', ''):
+ text_parts.append(text)
+ if content := message.get('content', ''):
+ text_parts.append(content)
+ return text_parts
+
+
def _extract_text_from_message(message) -> str:
- """
- Extract all text content from a message.
- ...
- """
- text_parts = []
-
- # Handle UIMessage objects (Pydantic models)
+ """Extract all text content from a message."""
if hasattr(message, 'parts'):
- for part in message.parts or []:
- if isinstance(part, TextUIPart) and part.text:
- text_parts.append(part.text)
- # Also check the deprecated content field
- if hasattr(message, 'content') and message.content:
- text_parts.append(message.content)
- # Handle dict-based messages (JSON deserialized)
+ return ' '.join(_extract_text_from_ui_message(message))
elif isinstance(message, dict):
- # Check parts
- parts = message.get('parts', [])
- for part in parts:
- if isinstance(part, dict) and part.get('type') == 'text':
- text = part.get('text', '')
- if text:
- text_parts.append(text)
- # Check deprecated content field
- content = message.get('content', '')
- if content:
- text_parts.append(content)
-
- return ' '.join(text_parts)
+ return ' '.join(_extract_text_from_dict_message(message))
+ return ''
146-170: Consider adding a delay between retries for 429 responses.When retrying on HTTP 429 (rate limit), immediately retrying without delay may not help. Consider adding a small backoff:
+import asyncio
...
if (not should_retry) or is_last_attempt:
raise
+
+ # Small delay before retry, especially useful for rate limits
+ await asyncio.sleep(0.5 * (attempt + 1))This is optional since the User-Agent rotation may be sufficient for 403 cases.
298-303: Hardcoded French strings limit internationalization.Multiple French strings are embedded in
ToolReturnvalues (lines 278, 298-303, 319, 357-362). If these are displayed to users or if multi-language support is needed, consider using Django's translation utilities:from django.utils.translation import gettext as _
content=_(
"Le contenu de ce document est volumineux et a ete indexe dans "
"la base de documents de la conversation. ..."
)If French-only is intentional for this deployment, this can be deferred.
Also applies to: 357-362
264-320: Extract docs.numerique.gouv.fr handling to a dedicated function.This 55-line special case significantly contributes to cognitive complexity (flagged at 20 vs 15 allowed). Per previous review feedback, extract to a separate async function:
ToolReturn: ... try: m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url) if m: - docs_id = m.group(1) - url_transformed = ... - # ... 50+ lines + return await _fetch_docs_numerique(url, m.group(1), conversation, user)">+async def _fetch_docs_numerique(
+ url: str,
+ docs_id: str,
+ conversation,
+ user,
+) -> ToolReturn:
+ """Fetch content from docs.numerique.gouv.fr API."""
+ url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
+
+ async with httpx.AsyncClient(timeout=settings.FETCH_URL_TIMEOUT, follow_redirects=True) as client:
+ response = await _get_with_retry(client, url_transformed)
+ data = response.json()
+ content = data.get('content', '')
+
+ if not content:
+ return ToolReturn(
+ return_value={"url": url, "error": "Content empty or private", ...},
+ )
+ # ... rest of the logic
+
+
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
...
try:
m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
if m:
- docs_id = m.group(1)
- url_transformed = ...
- # ... 50+ lines
+ return await _fetch_docs_numerique(url, m.group(1), conversation, user)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (10)
src/backend/chat/agent_rag/document_converter/markitdown.py(2 hunks)src/backend/chat/clients/pydantic_ai.py(9 hunks)src/backend/chat/document_storage.py(1 hunks)src/backend/chat/tests/tools/test_fetch_url.py(1 hunks)src/backend/chat/tools/__init__.py(1 hunks)src/backend/chat/tools/document_summarize.py(2 hunks)src/backend/chat/tools/fetch_url.py(1 hunks)src/backend/conversations/settings.py(1 hunks)src/frontend/apps/conversations/src/features/chat/components/Chat.tsx(1 hunks)src/frontend/apps/conversations/src/i18n/translations.json(4 hunks)
Additional context used
Code graph analysis (5)
src/backend/chat/document_storage.py (4)
src/backend/chat/agent_rag/document_search/albert_api.py (1)
collection_id(56-62)src/backend/chat/models.py (1)
ChatConversationAttachment(78-133)src/backend/core/models.py (1)
save(61-64)src/backend/core/file_upload/enums.py (1)
AttachmentStatus(9-22)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/document_storage.py (3)
create_markdown_attachment(105-142)ensure_collection_exists(23-40)store_document_in_rag(43-102)
src/backend/chat/tools/__init__.py (1)
src/backend/chat/tools/fetch_url.py (1)
fetch_url(236-401)
src/backend/chat/clients/pydantic_ai.py (3)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/document_storage.py (1)
store_document_with_attachment(145-201)src/backend/chat/tools/fetch_url.py (2)
fetch_url(236-401)detect_url_in_conversation(104-130)
src/backend/chat/tests/tools/test_fetch_url.py (1)
src/backend/chat/tools/fetch_url.py (1)
fetch_url(236-401)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[failure] 67-67: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
[failure] 236-236: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.
[warning] 32-32: Unwrap this unnecessarily grouped subpattern.
src/backend/chat/tests/tools/test_fetch_url.py
[warning] 172-172: Remove the unused local variable "mock_storage".
[warning] 278-278: Remove the unused local variable "mock_storage".
Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-e2e-chromium
- GitHub Check: test-back
- GitHub Check: build-and-push-backend
- GitHub Check: build-and-push-frontend
Additional comments (14)
src/backend/chat/tools/document_summarize.py (1)
26-33: LGTM - Smart URL preference for document sources.The logic correctly prioritizes the original URL (from
conversion_from) when available for fetched content, which improves source traceability in summaries.src/backend/chat/tools/__init__.py (1)
6-6: Clarify fetch_url import usage.The
fetch_urlfunction is imported but not registered inget_pydantic_tools_by_name(). Based on the PR description, this tool is registered dynamically when URLs are detected in conversations.If the import is solely for package-level access (e.g.,
from chat.tools import fetch_url), this is fine. However, if there's an expectation that it should be available viaget_pydantic_tools_by_name(), it needs to be added to thetool_dict.Can you confirm that dynamic registration (outside this module) is the intended usage pattern for
fetch_url? If so, consider adding a comment explaining why it's imported but not in the registry:from .fetch_url import fetch_url # Dynamically registered; not in static registrysrc/frontend/apps/conversations/src/i18n/translations.json (1)
107-107: LGTM - Translation entries added correctly.The new "Fetching URL..." translation has been added consistently across all four supported languages (FR, NL, RU, UK), aligning with the new fetch_url tool UI feedback.
Also applies to: 240-240, 373-373, 506-506
src/backend/conversations/settings.py (1)
844-849: Verify that 5-second timeout is sufficient for URL fetching.The default
FETCH_URL_TIMEOUTof 5 seconds may be too aggressive for:
- Large documents (PDFs, long HTML pages)
- Slow servers or high-latency networks
- API endpoints that need to generate content before responding
Consider whether:
- This timeout applies only to connection establishment (appropriate) or the entire response (potentially too short)
- A higher default (e.g., 10-15 seconds) would be more reliable for production use
- The timeout should differentiate between connection and read operations
Review the fetch_url implementation to confirm timeout behavior and test with typical large documents to validate the 5-second default.
src/frontend/apps/conversations/src/features/chat/components/Chat.tsx (1)
794-826: LGTM - Improved active tool detection logic.The refactored logic correctly:
- Identifies all tool invocations (excluding
document_parsing)- Finds the most recent active invocation (state !== 'result')
- Handles the new
fetch_urltool with appropriate UI feedback- Maintains fallback behavior for other search tools
This approach properly handles multiple concurrent tool invocations by showing the status of the one currently in progress.
src/backend/chat/document_storage.py (1)
138-138: Verify thread-safety of synchronous storage call in async function.The synchronous
default_storage.save()call is made within an async function. While Django's storage backends are generally thread-safe, this could block the async event loop.Consider one of the following:
- If the storage backend supports async operations, use
await sync_to_async(default_storage.save)(...)- Verify that the storage operations are fast enough that blocking is acceptable
- Document why synchronous storage access is safe in this context
Additionally, ensure proper error handling if
storage.save()fails, as this would leave the attachment in an inconsistent state (database record exists but file doesn't).src/backend/chat/tools/fetch_url.py (4)
35-65: LGTM!The randomized User-Agent pool is a reasonable approach for mitigating simple header-based blocking.
104-130: LGTM!The URL detection logic correctly handles both message formats and returns unique URLs.
179-233: LGTM - refactored to use async document storage utilities.The function now properly delegates to
store_document_in_ragandcreate_markdown_attachmentfrom the document_storage module, addressing previous blocking I/O concerns at this layer.
378-401: LGTM - comprehensive error handling.The error handling covers HTTP errors, timeouts, and unexpected exceptions appropriately. Each error type is logged at the correct level.
src/backend/chat/clients/pydantic_ai.py (4)
61-78: LGTM - imports support new functionality.The new imports correctly bring in the document storage utility and fetch_url tool for dynamic registration.
253-291: LGTM - cleaner document handling with shared utility.The refactoring to use
store_document_with_attachmentconsolidates the document storage logic and properly handles BytesIO to bytes conversion. Theattachment_keyandconversion_fromparameters preserve document provenance correctly.
490-521: LGTM - RAG enablement correctly extended for URL-fetched content.The logic now enables RAG when URLs are detected (
has_url_in_conversation), sincefetch_urlstores large content in RAG. The guard onattached_documents_note(lines 514-521) appropriately prevents misleading the model about attached documents when only URLs were fetched.
559-559: LGTM - source URL deduplication prevents UI clutter.The
_added_source_urlsset correctly tracks emitted URLs to prevent duplicateSourceUIPartentries when multiple tool results reference the same URL.Also applies to: 671-688
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Duplicate comments (2)
src/backend/chat/tools/fetch_url.py (2)
328-328: Wrap trafilatura.extract() to avoid blocking the event loop.
trafilatura.extract()is synchronous and performs CPU-intensive HTML parsing. For large documents, this blocks the event loop. This was flagged in a previous review but remains unaddressed.+from asgiref.sync import sync_to_async
...
- extracted = trafilatura.extract(response.text) or response.text
+ extracted = await sync_to_async(trafilatura.extract)(response.text) or response.text
258-314: Extract docs.numerique.gouv.fr handling to a dedicated function.This special-case logic spans 55+ lines and increases cognitive complexity. Based on learnings, this should be extracted to a separate function to improve maintainability and reduce the complexity of
fetch_url().Apply this pattern:
+async def _fetch_docs_numerique(
+ url: str,
+ docs_id: str,
+ conversation,
+ user,
+) -> ToolReturn:
+ """Fetch content from docs.numerique.gouv.fr via their API."""
+ url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
+ # ... rest of docs-specific logic
+
async def fetch_url(ctx: RunContext, url: str) -> ToolReturn:
...
try:
- # Special handling for docs.numerique.gouv.fr
- m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
- if m:
- docs_id = m.group(1)
- url_transformed = f"https://{DOCS_HOST}/api/v1.0/documents/{docs_id}/content/?content_format=markdown"
- ...
+ m = re.search(r"https?://(?:www\.)?docs\.numerique\.gouv\.fr/doc s/([^/?#]+)", url)
+ if m:
+ return await _fetch_docs_numerique(url, m.group(1), conversation, user)Based on learnings, this refactor addresses the maintainer's feedback.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (1)
src/backend/chat/tools/fetch_url.py(1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/document_storage.py (3)
create_markdown_attachment(105-142)ensure_collection_exists(23-40)store_document_in_rag(43-102)
GitHub Actions: Main Workflow
src/backend/chat/tools/fetch_url.py
[warning] 352-352: typo: 'ressource' -> 'resource' (codespell)
[warning] 353-353: typo: 'exemple' -> 'example' (codespell)
[warning] 356-356: typo: 'informations' -> 'information' (codespell)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[warning] 32-32: Unwrap this unnecessarily grouped subpattern.
[failure] 67-67: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
[failure] 230-230: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.
Additional comments (5)
src/backend/chat/tools/fetch_url.py (5)
35-65: LGTM!The randomized header approach is well-implemented and helps mitigate blocking. The documentation clearly states the extensibility path for future enhancements.
133-176: LGTM!The retry mechanism is well-designed with appropriate conditions for header-related blocks (403, 429) and good logging for observability.
179-227: LGTM!The RAG storage and attachment creation are properly implemented with async/await patterns. The two-step process (safe filename for parsing, original URL for storage metadata) correctly handles API constraints.
272-272: Clarify if French UI strings are intentional.The error message "Ce document Docs n'est pas public ou est vide." is in French. Verify if this is intentional for a French-language deployment or if internationalization should be used.
302-309: The inline content strategy is already properly implemented with appropriate size thresholds.The current code routes large content (> 8000 chars) through RAG while keeping small content (<= 8000 chars) inline. At lines 276-300 and 334-360, content exceeding MAX_INLINE_CONTENT_CHARS is stored in RAG with a preview returned instead. Lines 302-309 and 363-371 return small content inline with explicit comments confirming this is the intended behavior: "Small textual content: keep the existing behaviour with inline content."
This threshold-based approach prevents context bloating for large documents while maintaining responsiveness for small content, and does not require changes.
| messages_for_detection = getattr(deps, "messages", None) | ||
| urls = detect_url_in_conversation(messages_for_detection) | ||
| logger.info("URLs authorized (extracted from messages): %s", urls) | ||
|
|
||
| # If messages are provided, enforce URL presence; otherwise skip the check. | ||
| if messages_for_detection is not None and url not in urls: | ||
| return ToolReturn( | ||
| return_value={"url": url, "error": "URL not detected in conversation", "content" : f"URL {url} not detected in conversation"}, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pydantic_messages for URL detection instead of frontend messages.
Line 247 retrieves messages from deps, but based on learnings, this field is Vercel SDK formatted for frontend use. The URL detection should use Pydantic AI messages from pydantic_messages instead.
Additionally, line 254 returns an error with a "content" field, which was questioned in past reviews as potentially incorrect for the expected return structure.
Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 247-255, the code reads
frontend-formatted messages via deps.messages for URL detection and returns an
error including a "content" field; replace deps.messages with
deps.pydantic_messages (the Pydantic AI message list) when calling
detect_url_in_conversation so detection uses the correct format, and update the
error return to match the ToolReturn schema (remove or rename the "content" key
to the expected field, e.g., keep only return_value with url and error or use
the proper message field) so the response structure is consistent with callers.
| "content":( | ||
| "Le contenu de cette ressource est volumineux ou de type fichier " | ||
| "(par exemple PDF). Il a ete indexe dans la base de documents de la " | ||
| "conversation. Pour l'exploiter, tu dois utiliser l'outil " | ||
| "`document_search_rag` avec une requete precise decrivant les " | ||
| "informations que tu souhaites retrouver." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix French typos flagged by pipeline.
The pipeline reports typos in the French instruction message:
- "ressource" - "resource" (line 352)
- "exemple" - "example" (line 353)
- "informations" - "information" (line 356)
However, these are valid French words. If the message is intentionally in French, suppress the codespell warning or configure it to skip French strings. If this should be English, translate the message.
Tools
GitHub Actions: Main Workflow
[warning] 352-352: typo: 'ressource' -> 'resource' (codespell)
[warning] 353-353: typo: 'exemple' -> 'example' (codespell)
[warning] 356-356: typo: 'informations' -> 'information' (codespell)
Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 351 to 356, the pipeline
flagged typos in a French user-facing message; decide whether the string should
remain French or be English and implement one of the two fixes: if the message
is intentionally French, add a codespell suppression (e.g., wrap the string with
a comment or marker used by the repo to ignore codespell or add the specific
word exceptions in the codespell config) so the linter skips it; otherwise
translate the French string into clear English and replace the French text
accordingly (ensure punctuation and backticks around tool names are preserved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Duplicate comments (1)
src/backend/chat/tools/fetch_url.py (1)
244-256: Use pydantic_messages for URL detection instead of frontend messages.Line 248 retrieves
deps.messages(Vercel SDK format for frontend), but per past review feedback, proper conversation history is indeps.pydantic_messages(Pydantic AI format). URL detection should use the correct message source to ensure reliable detection across the conversation history.Additionally, the "content" field in the error return (line 255) was questioned in past reviews as potentially incorrect for the expected return structure.
Based on learnings, as per past review feedback from maintainers.
Apply this diff:
- messages_for_detection = getattr(deps, "messages", None)
+ messages_for_detection = getattr(deps, "pydantic_messages", None)
urls = detect_url_in_conversation(messages_for_detection)
logger.info("URLs authorized (extracted from messages): %s", urls)
# If messages are provided, enforce URL presence; otherwise skip the check.
if messages_for_detection is not None and url not in urls:
return ToolReturn(
- return_value={"url": url, "error": "URL not detected in conversation", "content" : f"URL {url} not detected in conversation"},
+ return_value={"url": url, "error": "URL not detected in conversation"},
)
Nitpick comments (2)
src/backend/chat/tools/fetch_url.py (2)
32-34: Consider simplifying the regex subpattern.The regex contains an unnecessarily grouped subpattern
(?:%[0-9a-fA-F]{2})that could be unwrapped to%[0-9a-fA-F]{2}without changing behavior, making it slightly more readable.Apply this diff:
URL_PATTERN = re.compile(
- r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+'
+ r'https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!$&\'()*+,;=%]|%[0-9a-fA-F]{2})+'
)
231-397: Consider reducing cognitive complexity.Static analysis flags this function's cognitive complexity at 20 (threshold 15). While the logic is correct and handles multiple content types and error cases appropriately, consider extracting helper functions for:
- Docs.numerique.gouv.fr special handling (lines 259-315)
- General URL content processing (lines 317-373)
This would improve maintainability and testability.
This is optional given the "Chill" review mode, but would make the code easier to maintain long-term.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Files selected for processing (1)
src/backend/chat/tools/fetch_url.py(1 hunks)
Additional context used
Code graph analysis (1)
src/backend/chat/tools/fetch_url.py (2)
src/backend/chat/ai_sdk_types.py (1)
TextUIPart(137-147)src/backend/chat/document_storage.py (3)
create_markdown_attachment(105-142)ensure_collection_exists(23-40)store_document_in_rag(43-102)
GitHub Check: SonarCloud Code Analysis
src/backend/chat/tools/fetch_url.py
[failure] 231-231: Refactor this function to reduce its Cognitive Complexity from 20 to the 15 allowed.
[warning] 33-33: Unwrap this unnecessarily grouped subpattern.
[failure] 68-68: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-e2e-chromium
- GitHub Check: test-back
- GitHub Check: build-and-push-backend
- GitHub Check: build-and-push-frontend
Additional comments (5)
src/backend/chat/tools/fetch_url.py (5)
36-66: LGTM!The randomized User-Agent strategy is a good approach to mitigate blocking on header-sensitive endpoints.
105-131: LGTM conditionally.The URL detection logic is sound. However, this function depends on
_extract_text_from_message, which needs to be refactored to use Pydantic AI messages (see comment on lines 68-102). Once that refactor is complete, this function will automatically use the correct message format.
134-177: LGTM!The retry logic is well-implemented with appropriate error handling and logging. The selective retry on 403/429 status codes with randomized headers is a good strategy to handle rate limiting and access restrictions.
180-228: LGTM!The RAG storage implementation correctly handles the filename vs metadata issue with clear separation between safe parsing names and original URLs for citations. The async integration is proper.
374-397: LGTM!Comprehensive error handling with appropriate logging levels for different error types. The structured error returns maintain consistency with the ToolReturn format.
| def _extract_text_from_message(message) -> str: | ||
| """ | ||
| Extract all text content from a message. | ||
|
|
||
| Args: | ||
| message: A message object (UIMessage or dict). | ||
|
|
||
| Returns: | ||
| str: All text content concatenated. | ||
| """ | ||
| text_parts = [] | ||
|
|
||
| # Handle UIMessage objects (Pydantic models) | ||
| if hasattr(message, 'parts'): | ||
| for part in message.parts or []: | ||
| if isinstance(part, TextUIPart) and part.text: | ||
| text_parts.append(part.text) | ||
| # Also check the deprecated content field | ||
| if hasattr(message, 'content') and message.content: | ||
| text_parts.append(message.content) | ||
| # Handle dict-based messages (JSON deserialized) | ||
| elif isinstance(message, dict): | ||
| # Check parts | ||
| parts = message.get('parts', []) | ||
| for part in parts: | ||
| if isinstance(part, dict) and part.get('type') == 'text': | ||
| text = part.get('text', '') | ||
| if text: | ||
| text_parts.append(text) | ||
| # Check deprecated content field | ||
| content = message.get('content', '') | ||
| if content: | ||
| text_parts.append(content) | ||
|
|
||
| return ' '.join(text_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to extract text from Pydantic AI messages instead of frontend messages.
This function is designed to parse UIMessage objects and frontend-formatted dicts, but per past review feedback, the proper conversation history is in pydantic_messages (Pydantic AI format), not the Vercel SDK messages field. The deprecated content field checks (lines 86-87, 98-100) also add unnecessary complexity.
Based on learnings, messages is for frontend use and ui_messages will be removed. The function should be refactored to work with Pydantic AI message format instead.
This change would also reduce the cognitive complexity (currently 24, threshold 15) flagged by static analysis.
Tools
GitHub Check: SonarCloud Code Analysis
[failure] 68-68: Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.
Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around lines 68 to 102, the function
currently extracts text from frontend/UI message shapes (UIMessage and dict
parts) and checks deprecated content fields; refactor it to read from the
Pydantic AI message format (pydantic_messages) instead: accept the Pydantic
message or list of Pydantic messages, iterate those models' canonical text
fields (e.g., parts or the model's text/content attributes used by the pydantic
schema), remove handling of frontend dict shapes and deprecated content checks,
and return the concatenated text; ensure type checks use the Pydantic model
classes/attributes, handle optional/missing fields defensively, and keep the
function small and focused to reduce cognitive complexity.
|
|
||
| if not content: | ||
| return ToolReturn( | ||
| return_value={"url": url, "error": "Content empty or private", "content": "Ce document Docs n'est pas public ou est vide."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
French string flagged by codespell.
The French error message is intentional but will cause codespell warnings. Consider either translating to English or adding codespell suppressions for these strings. This also applies to other French strings in this function (lines 293-298, 314, 353-358).
If these messages should remain French, add a codespell ignore comment or configure the project to skip French strings. If they should be English, translate them.
Prompt for AI Agents
In src/backend/chat/tools/fetch_url.py around line 273 (and similarly lines
293-298, 314, 353-358), French error strings (e.g., "Ce document Docs n'est pas
public ou est vide.") trigger codespell warnings; either translate these
messages to English consistently (replace French texts with their English
equivalents) or add a codespell suppression for these specific string literals
or configure the project to ignore French strings--apply the chosen approach
consistently to all flagged lines and ensure any suppression is narrowly scoped
to these messages.
Quality Gate failedFailed conditions |