Joseph Pollack commited on
Commit
f5a06d4
·
1 Parent(s): 85f2fd9

attempts to solve the websearch , adds serper , adds tools , adds adapter , solves settings issue , adds some more stuff basically

Browse files
SERPER_WEBSEARCH_IMPLEMENTATION_PLAN.md ADDED
@@ -0,0 +1,396 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SERPER Web Search Implementation Plan
2
+
3
+ ## Executive Summary
4
+
5
+ This plan details the implementation of SERPER-based web search by vendoring code from `folder/tools/web_search.py` into `src/tools/`, creating a protocol-compliant `SerperWebSearchTool`, fixing the existing `WebSearchTool`, and integrating both into the main search flow.
6
+
7
+ ## Project Structure
8
+
9
+ ### Project 1: Vendor and Refactor Core Web Search Components
10
+ **Goal**: Extract and vendor Serper/SearchXNG search logic from `folder/tools/web_search.py` into `src/tools/`
11
+
12
+ ### Project 2: Create Protocol-Compliant SerperWebSearchTool
13
+ **Goal**: Implement `SerperWebSearchTool` class that fully complies with `SearchTool` protocol
14
+
15
+ ### Project 3: Fix Existing WebSearchTool Protocol Compliance
16
+ **Goal**: Make existing `WebSearchTool` (DuckDuckGo) protocol-compliant
17
+
18
+ ### Project 4: Integrate Web Search into SearchHandler
19
+ **Goal**: Add web search tools to main search flow in `src/app.py`
20
+
21
+ ### Project 5: Update Callers and Dependencies
22
+ **Goal**: Update all code that uses web search to work with new implementation
23
+
24
+ ### Project 6: Testing and Validation
25
+ **Goal**: Add comprehensive tests for all web search implementations
26
+
27
+ ---
28
+
29
+ ## Detailed Implementation Plan
30
+
31
+ ### PROJECT 1: Vendor and Refactor Core Web Search Components
32
+
33
+ #### Activity 1.1: Create Vendor Module Structure
34
+ **File**: `src/tools/vendored/__init__.py`
35
+ - **Task 1.1.1**: Create `src/tools/vendored/` directory
36
+ - **Task 1.1.2**: Create `__init__.py` with exports
37
+
38
+ **File**: `src/tools/vendored/web_search_core.py`
39
+ - **Task 1.1.3**: Vendor `ScrapeResult`, `WebpageSnippet`, `SearchResults` models from `folder/tools/web_search.py` (lines 23-37)
40
+ - **Task 1.1.4**: Vendor `scrape_urls()` function (lines 274-299)
41
+ - **Task 1.1.5**: Vendor `fetch_and_process_url()` function (lines 302-348)
42
+ - **Task 1.1.6**: Vendor `html_to_text()` function (lines 351-368)
43
+ - **Task 1.1.7**: Vendor `is_valid_url()` function (lines 371-410)
44
+ - **Task 1.1.8**: Vendor `ssl_context` setup (lines 115-120)
45
+ - **Task 1.1.9**: Add imports: `aiohttp`, `asyncio`, `BeautifulSoup`, `ssl`
46
+ - **Task 1.1.10**: Add `CONTENT_LENGTH_LIMIT = 10000` constant
47
+ - **Task 1.1.11**: Add type hints following project standards
48
+ - **Task 1.1.12**: Add structlog logging
49
+ - **Task 1.1.13**: Replace `print()` statements with `logger` calls
50
+
51
+ **File**: `src/tools/vendored/serper_client.py`
52
+ - **Task 1.1.14**: Vendor `SerperClient` class from `folder/tools/web_search.py` (lines 123-196)
53
+ - **Task 1.1.15**: Remove dependency on `ResearchAgent` and `ResearchRunner`
54
+ - **Task 1.1.16**: Replace filter agent with simple relevance filtering or remove it
55
+ - **Task 1.1.17**: Add `__init__` that takes `api_key: str | None` parameter
56
+ - **Task 1.1.18**: Update `search()` method to return `list[WebpageSnippet]` without filtering
57
+ - **Task 1.1.19**: Remove `_filter_results()` method (or make it optional)
58
+ - **Task 1.1.20**: Add error handling with `SearchError` and `RateLimitError`
59
+ - **Task 1.1.21**: Add structlog logging
60
+ - **Task 1.1.22**: Add type hints
61
+
62
+ **File**: `src/tools/vendored/searchxng_client.py`
63
+ - **Task 1.1.23**: Vendor `SearchXNGClient` class from `folder/tools/web_search.py` (lines 199-271)
64
+ - **Task 1.1.24**: Remove dependency on `ResearchAgent` and `ResearchRunner`
65
+ - **Task 1.1.25**: Replace filter agent with simple relevance filtering or remove it
66
+ - **Task 1.1.26**: Add `__init__` that takes `host: str` parameter
67
+ - **Task 1.1.27**: Update `search()` method to return `list[WebpageSnippet]` without filtering
68
+ - **Task 1.1.28**: Remove `_filter_results()` method (or make it optional)
69
+ - **Task 1.1.29**: Add error handling with `SearchError` and `RateLimitError`
70
+ - **Task 1.1.30**: Add structlog logging
71
+ - **Task 1.1.31**: Add type hints
72
+
73
+ #### Activity 1.2: Create Rate Limiting for Web Search
74
+ **File**: `src/tools/rate_limiter.py`
75
+ - **Task 1.2.1**: Add `get_serper_limiter()` function (rate: "10/second" with API key)
76
+ - **Task 1.2.2**: Add `get_searchxng_limiter()` function (rate: "5/second")
77
+ - **Task 1.2.3**: Use `RateLimiterFactory.get()` pattern
78
+
79
+ ---
80
+
81
+ ### PROJECT 2: Create Protocol-Compliant SerperWebSearchTool
82
+
83
+ #### Activity 2.1: Implement SerperWebSearchTool Class
84
+ **File**: `src/tools/serper_web_search.py`
85
+ - **Task 2.1.1**: Create new file `src/tools/serper_web_search.py`
86
+ - **Task 2.1.2**: Add imports:
87
+ - `from src.tools.base import SearchTool`
88
+ - `from src.tools.vendored.serper_client import SerperClient`
89
+ - `from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet`
90
+ - `from src.tools.rate_limiter import get_serper_limiter`
91
+ - `from src.tools.query_utils import preprocess_query`
92
+ - `from src.utils.config import settings`
93
+ - `from src.utils.exceptions import SearchError, RateLimitError`
94
+ - `from src.utils.models import Citation, Evidence`
95
+ - `import structlog`
96
+ - `from tenacity import retry, stop_after_attempt, wait_exponential`
97
+
98
+ - **Task 2.1.3**: Create `SerperWebSearchTool` class
99
+ - **Task 2.1.4**: Add `__init__(self, api_key: str | None = None)` method
100
+ - Line 2.1.4.1: Get API key from parameter or `settings.serper_api_key`
101
+ - Line 2.1.4.2: Validate API key is not None, raise `ConfigurationError` if missing
102
+ - Line 2.1.4.3: Initialize `SerperClient(api_key=self.api_key)`
103
+ - Line 2.1.4.4: Get rate limiter: `self._limiter = get_serper_limiter(self.api_key)`
104
+
105
+ - **Task 2.1.5**: Add `@property def name(self) -> str:` returning `"serper"`
106
+
107
+ - **Task 2.1.6**: Add `async def _rate_limit(self) -> None:` method
108
+ - Line 2.1.6.1: Call `await self._limiter.acquire()`
109
+
110
+ - **Task 2.1.7**: Add `@retry(...)` decorator with exponential backoff
111
+
112
+ - **Task 2.1.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
113
+ - Line 2.1.8.1: Call `await self._rate_limit()`
114
+ - Line 2.1.8.2: Preprocess query: `clean_query = preprocess_query(query)`
115
+ - Line 2.1.8.3: Use `clean_query if clean_query else query`
116
+ - Line 2.1.8.4: Call `search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)`
117
+ - Line 2.1.8.5: Call `scraped = await scrape_urls(search_results)`
118
+ - Line 2.1.8.6: Convert `ScrapeResult` to `Evidence` objects:
119
+ - Line 2.1.8.6.1: Create `Citation` with `title`, `url`, `source="serper"`, `date="Unknown"`, `authors=[]`
120
+ - Line 2.1.8.6.2: Create `Evidence` with `content=scraped.text`, `citation`, `relevance=0.0`
121
+ - Line 2.1.8.7: Return `list[Evidence]`
122
+ - Line 2.1.8.8: Add try/except for `httpx.HTTPStatusError`:
123
+ - Line 2.1.8.8.1: Check for 429 status, raise `RateLimitError`
124
+ - Line 2.1.8.8.2: Otherwise raise `SearchError`
125
+ - Line 2.1.8.9: Add try/except for `httpx.TimeoutException`, raise `SearchError`
126
+ - Line 2.1.8.10: Add generic exception handler, log and raise `SearchError`
127
+
128
+ #### Activity 2.2: Implement SearchXNGWebSearchTool Class
129
+ **File**: `src/tools/searchxng_web_search.py`
130
+ - **Task 2.2.1**: Create new file `src/tools/searchxng_web_search.py`
131
+ - **Task 2.2.2**: Add imports (similar to SerperWebSearchTool)
132
+ - **Task 2.2.3**: Create `SearchXNGWebSearchTool` class
133
+ - **Task 2.2.4**: Add `__init__(self, host: str | None = None)` method
134
+ - Line 2.2.4.1: Get host from parameter or `settings.searchxng_host`
135
+ - Line 2.2.4.2: Validate host is not None, raise `ConfigurationError` if missing
136
+ - Line 2.2.4.3: Initialize `SearchXNGClient(host=self.host)`
137
+ - Line 2.2.4.4: Get rate limiter: `self._limiter = get_searchxng_limiter()`
138
+
139
+ - **Task 2.2.5**: Add `@property def name(self) -> str:` returning `"searchxng"`
140
+
141
+ - **Task 2.2.6**: Add `async def _rate_limit(self) -> None:` method
142
+
143
+ - **Task 2.2.7**: Add `@retry(...)` decorator
144
+
145
+ - **Task 2.2.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
146
+ - Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool
147
+
148
+ ---
149
+
150
+ ### PROJECT 3: Fix Existing WebSearchTool Protocol Compliance
151
+
152
+ #### Activity 3.1: Update WebSearchTool Class
153
+ **File**: `src/tools/web_search.py`
154
+ - **Task 3.1.1**: Add `@property def name(self) -> str:` method returning `"duckduckgo"` (after line 17)
155
+
156
+ - **Task 3.1.2**: Change `search()` return type from `SearchResult` to `list[Evidence]` (line 19)
157
+
158
+ - **Task 3.1.3**: Update `search()` method body:
159
+ - Line 3.1.3.1: Keep existing search logic (lines 21-43)
160
+ - Line 3.1.3.2: Instead of returning `SearchResult`, return `evidence` list directly (line 44)
161
+ - Line 3.1.3.3: Update exception handler to return empty list `[]` instead of `SearchResult` (line 51)
162
+
163
+ - **Task 3.1.4**: Add imports if needed:
164
+ - Line 3.1.4.1: `from src.utils.exceptions import SearchError`
165
+ - Line 3.1.4.2: Update exception handling to raise `SearchError` instead of returning error `SearchResult`
166
+
167
+ - **Task 3.1.5**: Add query preprocessing:
168
+ - Line 3.1.5.1: Import `from src.tools.query_utils import preprocess_query`
169
+ - Line 3.1.5.2: Add `clean_query = preprocess_query(query)` before search
170
+ - Line 3.1.5.3: Use `clean_query if clean_query else query`
171
+
172
+ #### Activity 3.2: Update Retrieval Agent Caller
173
+ **File**: `src/agents/retrieval_agent.py`
174
+ - **Task 3.2.1**: Update `search_web()` function (line 31):
175
+ - Line 3.2.1.1: Change `results = await _web_search.search(query, max_results)`
176
+ - Line 3.2.1.2: Change to `evidence = await _web_search.search(query, max_results)`
177
+ - Line 3.2.1.3: Update check: `if not evidence:` instead of `if not results.evidence:`
178
+ - Line 3.2.1.4: Update state update: `new_count = state.add_evidence(evidence)` instead of `results.evidence`
179
+ - Line 3.2.1.5: Update logging: `results_found=len(evidence)` instead of `len(results.evidence)`
180
+ - Line 3.2.1.6: Update output formatting: `for i, r in enumerate(evidence[:max_results], 1):` instead of `results.evidence[:max_results]`
181
+ - Line 3.2.1.7: Update deduplication: `await state.embedding_service.deduplicate(evidence)` instead of `results.evidence`
182
+ - Line 3.2.1.8: Update output message: `Found {len(evidence)} web results` instead of `len(results.evidence)`
183
+
184
+ ---
185
+
186
+ ### PROJECT 4: Integrate Web Search into SearchHandler
187
+
188
+ #### Activity 4.1: Create Web Search Tool Factory
189
+ **File**: `src/tools/web_search_factory.py`
190
+ - **Task 4.1.1**: Create new file `src/tools/web_search_factory.py`
191
+ - **Task 4.1.2**: Add imports:
192
+ - `from src.tools.web_search import WebSearchTool`
193
+ - `from src.tools.serper_web_search import SerperWebSearchTool`
194
+ - `from src.tools.searchxng_web_search import SearchXNGWebSearchTool`
195
+ - `from src.utils.config import settings`
196
+ - `from src.utils.exceptions import ConfigurationError`
197
+ - `import structlog`
198
+
199
+ - **Task 4.1.3**: Add `logger = structlog.get_logger()`
200
+
201
+ - **Task 4.1.4**: Create `def create_web_search_tool() -> SearchTool | None:` function
202
+ - Line 4.1.4.1: Check `settings.web_search_provider`
203
+ - Line 4.1.4.2: If `"serper"`:
204
+ - Line 4.1.4.2.1: Check `settings.serper_api_key` or `settings.web_search_available()`
205
+ - Line 4.1.4.2.2: If available, return `SerperWebSearchTool()`
206
+ - Line 4.1.4.2.3: Else log warning and return `None`
207
+ - Line 4.1.4.3: If `"searchxng"`:
208
+ - Line 4.1.4.3.1: Check `settings.searchxng_host` or `settings.web_search_available()`
209
+ - Line 4.1.4.3.2: If available, return `SearchXNGWebSearchTool()`
210
+ - Line 4.1.4.3.3: Else log warning and return `None`
211
+ - Line 4.1.4.4: If `"duckduckgo"`:
212
+ - Line 4.1.4.4.1: Return `WebSearchTool()` (always available)
213
+ - Line 4.1.4.5: If `"brave"` or `"tavily"`:
214
+ - Line 4.1.4.5.1: Log warning "Not yet implemented"
215
+ - Line 4.1.4.5.2: Return `None`
216
+ - Line 4.1.4.6: Default: return `WebSearchTool()` (fallback to DuckDuckGo)
217
+
218
+ #### Activity 4.2: Update SearchHandler Initialization
219
+ **File**: `src/app.py`
220
+ - **Task 4.2.1**: Add import: `from src.tools.web_search_factory import create_web_search_tool`
221
+
222
+ - **Task 4.2.2**: Update `configure_orchestrator()` function (around line 73):
223
+ - Line 4.2.2.1: Before creating `SearchHandler`, call `web_search_tool = create_web_search_tool()`
224
+ - Line 4.2.2.2: Create tools list: `tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]`
225
+ - Line 4.2.2.3: If `web_search_tool is not None`:
226
+ - Line 4.2.2.3.1: Append `web_search_tool` to tools list
227
+ - Line 4.2.2.3.2: Log info: "Web search tool added to search handler"
228
+ - Line 4.2.2.4: Update `SearchHandler` initialization to use `tools` list
229
+
230
+ ---
231
+
232
+ ### PROJECT 5: Update Callers and Dependencies
233
+
234
+ #### Activity 5.1: Update web_search_adapter
235
+ **File**: `src/tools/web_search_adapter.py`
236
+ - **Task 5.1.1**: Update `web_search()` function to use new implementation:
237
+ - Line 5.1.1.1: Import `from src.tools.web_search_factory import create_web_search_tool`
238
+ - Line 5.1.1.2: Remove dependency on `folder.tools.web_search`
239
+ - Line 5.1.1.3: Get tool: `tool = create_web_search_tool()`
240
+ - Line 5.1.1.4: If `tool is None`, return error message
241
+ - Line 5.1.1.5: Call `evidence = await tool.search(query, max_results=5)`
242
+ - Line 5.1.1.6: Convert `Evidence` objects to formatted string:
243
+ - Line 5.1.1.6.1: Format each evidence with title, URL, content preview
244
+ - Line 5.1.1.7: Return formatted string
245
+
246
+ #### Activity 5.2: Update Tool Executor
247
+ **File**: `src/tools/tool_executor.py`
248
+ - **Task 5.2.1**: Verify `web_search_adapter.web_search()` usage (line 86) still works
249
+ - **Task 5.2.2**: No changes needed if adapter is updated correctly
250
+
251
+ #### Activity 5.3: Update Planner Agent
252
+ **File**: `src/orchestrator/planner_agent.py`
253
+ - **Task 5.3.1**: Verify `web_search_adapter.web_search()` usage (line 14) still works
254
+ - **Task 5.3.2**: No changes needed if adapter is updated correctly
255
+
256
+ #### Activity 5.4: Remove Legacy Dependencies
257
+ **File**: `src/tools/web_search_adapter.py`
258
+ - **Task 5.4.1**: Remove import of `folder.llm_config` and `folder.tools.web_search`
259
+ - **Task 5.4.2**: Update error messages to reflect new implementation
260
+
261
+ ---
262
+
263
+ ### PROJECT 6: Testing and Validation
264
+
265
+ #### Activity 6.1: Unit Tests for Vendored Components
266
+ **File**: `tests/unit/tools/test_vendored_web_search_core.py`
267
+ - **Task 6.1.1**: Test `scrape_urls()` function
268
+ - **Task 6.1.2**: Test `fetch_and_process_url()` function
269
+ - **Task 6.1.3**: Test `html_to_text()` function
270
+ - **Task 6.1.4**: Test `is_valid_url()` function
271
+
272
+ **File**: `tests/unit/tools/test_vendored_serper_client.py`
273
+ - **Task 6.1.5**: Mock SerperClient API calls
274
+ - **Task 6.1.6**: Test successful search
275
+ - **Task 6.1.7**: Test error handling
276
+ - **Task 6.1.8**: Test rate limiting
277
+
278
+ **File**: `tests/unit/tools/test_vendored_searchxng_client.py`
279
+ - **Task 6.1.9**: Mock SearchXNGClient API calls
280
+ - **Task 6.1.10**: Test successful search
281
+ - **Task 6.1.11**: Test error handling
282
+ - **Task 6.1.12**: Test rate limiting
283
+
284
+ #### Activity 6.2: Unit Tests for Web Search Tools
285
+ **File**: `tests/unit/tools/test_serper_web_search.py`
286
+ - **Task 6.2.1**: Test `SerperWebSearchTool.__init__()` with valid API key
287
+ - **Task 6.2.2**: Test `SerperWebSearchTool.__init__()` without API key (should raise)
288
+ - **Task 6.2.3**: Test `name` property returns `"serper"`
289
+ - **Task 6.2.4**: Test `search()` returns `list[Evidence]`
290
+ - **Task 6.2.5**: Test `search()` with mocked SerperClient
291
+ - **Task 6.2.6**: Test error handling (SearchError, RateLimitError)
292
+ - **Task 6.2.7**: Test query preprocessing
293
+ - **Task 6.2.8**: Test rate limiting
294
+
295
+ **File**: `tests/unit/tools/test_searchxng_web_search.py`
296
+ - **Task 6.2.9**: Similar tests for SearchXNGWebSearchTool
297
+
298
+ **File**: `tests/unit/tools/test_web_search.py`
299
+ - **Task 6.2.10**: Test `WebSearchTool.name` property returns `"duckduckgo"`
300
+ - **Task 6.2.11**: Test `WebSearchTool.search()` returns `list[Evidence]`
301
+ - **Task 6.2.12**: Test `WebSearchTool.search()` with mocked DDGS
302
+ - **Task 6.2.13**: Test error handling
303
+ - **Task 6.2.14**: Test query preprocessing
304
+
305
+ #### Activity 6.3: Integration Tests
306
+ **File**: `tests/integration/test_web_search_integration.py`
307
+ - **Task 6.3.1**: Test `SerperWebSearchTool` with real API (marked `@pytest.mark.integration`)
308
+ - **Task 6.3.2**: Test `SearchXNGWebSearchTool` with real API (marked `@pytest.mark.integration`)
309
+ - **Task 6.3.3**: Test `WebSearchTool` with real DuckDuckGo (marked `@pytest.mark.integration`)
310
+ - **Task 6.3.4**: Test `create_web_search_tool()` factory function
311
+ - **Task 6.3.5**: Test SearchHandler with web search tool
312
+
313
+ #### Activity 6.4: Update Existing Tests
314
+ **File**: `tests/unit/agents/test_retrieval_agent.py`
315
+ - **Task 6.4.1**: Update tests to expect `list[Evidence]` instead of `SearchResult`
316
+ - **Task 6.4.2**: Mock `WebSearchTool.search()` to return `list[Evidence]`
317
+
318
+ **File**: `tests/unit/tools/test_tool_executor.py`
319
+ - **Task 6.4.3**: Verify tests still pass with updated `web_search_adapter`
320
+
321
+ ---
322
+
323
+ ## Implementation Order
324
+
325
+ 1. **PROJECT 1**: Vendor core components (foundation)
326
+ 2. **PROJECT 3**: Fix existing WebSearchTool (quick win, unblocks retrieval agent)
327
+ 3. **PROJECT 2**: Create SerperWebSearchTool (new functionality)
328
+ 4. **PROJECT 4**: Integrate into SearchHandler (main integration)
329
+ 5. **PROJECT 5**: Update callers (cleanup dependencies)
330
+ 6. **PROJECT 6**: Testing (validation)
331
+
332
+ ---
333
+
334
+ ## Dependencies and Prerequisites
335
+
336
+ ### External Dependencies
337
+ - `aiohttp` - Already in requirements
338
+ - `beautifulsoup4` - Already in requirements
339
+ - `duckduckgo-search` - Already in requirements
340
+ - `tenacity` - Already in requirements
341
+ - `structlog` - Already in requirements
342
+
343
+ ### Internal Dependencies
344
+ - `src/tools/base.py` - SearchTool protocol
345
+ - `src/tools/rate_limiter.py` - Rate limiting utilities
346
+ - `src/tools/query_utils.py` - Query preprocessing
347
+ - `src/utils/config.py` - Settings and configuration
348
+ - `src/utils/exceptions.py` - Custom exceptions
349
+ - `src/utils/models.py` - Evidence, Citation models
350
+
351
+ ### Configuration Requirements
352
+ - `SERPER_API_KEY` - For Serper provider
353
+ - `SEARCHXNG_HOST` - For SearchXNG provider
354
+ - `WEB_SEARCH_PROVIDER` - Environment variable (default: "duckduckgo")
355
+
356
+ ---
357
+
358
+ ## Risk Assessment
359
+
360
+ ### High Risk
361
+ - **Breaking changes to retrieval_agent.py**: Must update carefully to handle `list[Evidence]` instead of `SearchResult`
362
+ - **Legacy folder dependencies**: Need to ensure all code is properly vendored
363
+
364
+ ### Medium Risk
365
+ - **Rate limiting**: Serper API may have different limits than expected
366
+ - **Error handling**: Need to handle API failures gracefully
367
+
368
+ ### Low Risk
369
+ - **Query preprocessing**: May need adjustment for web search vs PubMed
370
+ - **Testing**: Integration tests require API keys
371
+
372
+ ---
373
+
374
+ ## Success Criteria
375
+
376
+ 1. ✅ `SerperWebSearchTool` implements `SearchTool` protocol correctly
377
+ 2. ✅ `WebSearchTool` implements `SearchTool` protocol correctly
378
+ 3. ✅ Both tools can be added to `SearchHandler`
379
+ 4. ✅ `web_search_adapter` works with new implementation
380
+ 5. ✅ `retrieval_agent` works with updated `WebSearchTool`
381
+ 6. ✅ All unit tests pass
382
+ 7. ✅ Integration tests pass (with API keys)
383
+ 8. ✅ No dependencies on `folder/tools/web_search.py` in `src/` code
384
+ 9. ✅ Configuration supports multiple providers
385
+ 10. ✅ Error handling is robust
386
+
387
+ ---
388
+
389
+ ## Notes
390
+
391
+ - The vendored code should be self-contained and not depend on `folder/` modules
392
+ - Filter agent functionality from original code is removed (can be added later if needed)
393
+ - Rate limiting follows the same pattern as PubMed tool
394
+ - Query preprocessing may need web-specific adjustments (less aggressive than PubMed)
395
+ - Consider adding relevance scoring in the future
396
+
requirements.txt CHANGED
@@ -35,9 +35,6 @@ pydantic-graph>=1.22.0
35
  # Web search
36
  duckduckgo-search>=5.0
37
 
38
- # Multi-agent orchestration (Advanced mode)
39
- agent-framework-core>=1.0.0b251120,<2.0.0
40
-
41
  # LlamaIndex RAG
42
  llama-index-llms-huggingface>=0.6.1
43
  llama-index-llms-huggingface-api>=0.6.1
@@ -51,28 +48,23 @@ pillow>=10.0.0 # For image processing
51
 
52
  # TTS dependencies (for Modal GPU TTS)
53
  torch>=2.0.0 # Required by Kokoro TTS
54
- transformers>=4.30.0 # Required by Kokoro TTS
55
  modal>=0.63.0 # Required for TTS GPU execution
56
  # Note: Kokoro is installed in Modal image from: git+https://github.com/hexgrad/kokoro.git
57
 
58
- # Multi-agent orchestration (Advanced mode) - from optional magentic
59
- agent-framework-core>=1.0.0b251120,<2.0.0
60
- llama-index-llms-openai>=0.6.9
61
- llama-index-embeddings-openai>=0.5.1
62
-
63
  # Embeddings & Vector Store
64
  tokenizers>=0.22.0,<=0.23.0
65
- transformers>=4.57.2
66
- chromadb>=0.4.0
67
  rpds-py>=0.29.0 # Python implementation of rpds (required by chromadb on Windows)
 
68
  sentence-transformers>=2.2.0
69
- numpy<2.0
70
 
71
- # Optional: Modal for code execution
72
- modal>=0.63.0
73
 
74
- # LlamaIndex RAG - from optional modal
75
- llama-index-llms-openai
76
- llama-index-embeddings-openai
77
 
78
- pydantic-ai-slim[huggingface]>=0.0.18
 
 
 
35
  # Web search
36
  duckduckgo-search>=5.0
37
 
 
 
 
38
  # LlamaIndex RAG
39
  llama-index-llms-huggingface>=0.6.1
40
  llama-index-llms-huggingface-api>=0.6.1
 
48
 
49
  # TTS dependencies (for Modal GPU TTS)
50
  torch>=2.0.0 # Required by Kokoro TTS
51
+ transformers>=4.57.2 # Required by Kokoro TTS
52
  modal>=0.63.0 # Required for TTS GPU execution
53
  # Note: Kokoro is installed in Modal image from: git+https://github.com/hexgrad/kokoro.git
54
 
 
 
 
 
 
55
  # Embeddings & Vector Store
56
  tokenizers>=0.22.0,<=0.23.0
 
 
57
  rpds-py>=0.29.0 # Python implementation of rpds (required by chromadb on Windows)
58
+ chromadb>=0.4.0
59
  sentence-transformers>=2.2.0
60
+ numpy<2.0 # chromadb compatibility: uses np.float_ removed in NumPy 2.0
61
 
62
+ # Pydantic AI with HuggingFace support
63
+ pydantic-ai-slim[huggingface]>=0.0.18
64
 
65
+ # Multi-agent orchestration (Advanced mode)
66
+ agent-framework-core>=1.0.0b251120,<2.0.0
 
67
 
68
+ # LlamaIndex RAG - OpenAI
69
+ llama-index-llms-openai>=0.6.9
70
+ llama-index-embeddings-openai>=0.5.1
src/agent_factory/judges.py CHANGED
@@ -38,54 +38,27 @@ def get_model(oauth_token: str | None = None) -> Any:
38
  Args:
39
  oauth_token: Optional OAuth token from HuggingFace login (takes priority over env vars)
40
  """
41
- # Priority: oauth_token > env vars
42
  effective_hf_token = oauth_token or settings.hf_token or settings.huggingface_api_key
43
 
44
- # If OAuth token is available, prefer HuggingFace (free tier on Spaces)
45
- if effective_hf_token:
46
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
47
- hf_provider = HuggingFaceProvider(api_key=effective_hf_token)
48
- logger.info(
49
- "using_huggingface_with_token",
50
- has_oauth=bool(oauth_token),
51
- model=model_name,
52
  )
53
- return HuggingFaceModel(model_name, provider=hf_provider)
54
-
55
- llm_provider = settings.llm_provider
56
-
57
- if llm_provider == "anthropic":
58
- if not settings.anthropic_api_key:
59
- logger.warning("Anthropic provider selected but no API key available, defaulting to HuggingFace")
60
- # Fallback to HuggingFace without token (public models)
61
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
62
- hf_provider = HuggingFaceProvider(api_key=None)
63
- return HuggingFaceModel(model_name, provider=hf_provider)
64
- provider = AnthropicProvider(api_key=settings.anthropic_api_key)
65
- return AnthropicModel(settings.anthropic_model, provider=provider)
66
-
67
- if llm_provider == "huggingface":
68
- # No token available, use public models
69
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
70
- hf_provider = HuggingFaceProvider(api_key=None)
71
- return HuggingFaceModel(model_name, provider=hf_provider)
72
-
73
- if llm_provider == "openai":
74
- if not settings.openai_api_key:
75
- logger.warning("OpenAI provider selected but no API key available, defaulting to HuggingFace")
76
- # Fallback to HuggingFace without token (public models)
77
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
78
- hf_provider = HuggingFaceProvider(api_key=None)
79
- return HuggingFaceModel(model_name, provider=hf_provider)
80
- openai_provider = OpenAIProvider(api_key=settings.openai_api_key)
81
- return OpenAIModel(settings.openai_model, provider=openai_provider)
82
-
83
- # Default to HuggingFace if provider is unknown or not specified
84
- if llm_provider not in ("huggingface", "openai", "anthropic"):
85
- logger.warning("Unknown LLM provider, defaulting to HuggingFace", provider=llm_provider)
86
 
 
87
  model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
88
- hf_provider = HuggingFaceProvider(api_key=None) # Public models
 
 
 
 
 
 
89
  return HuggingFaceModel(model_name, provider=hf_provider)
90
 
91
 
 
38
  Args:
39
  oauth_token: Optional OAuth token from HuggingFace login (takes priority over env vars)
40
  """
41
+ # Priority: oauth_token > settings.hf_token > settings.huggingface_api_key
42
  effective_hf_token = oauth_token or settings.hf_token or settings.huggingface_api_key
43
 
44
+ # HuggingFaceProvider requires a token - cannot use None
45
+ if not effective_hf_token:
46
+ raise ConfigurationError(
47
+ "HuggingFace token required. Please either:\n"
48
+ "1. Log in via HuggingFace OAuth (recommended for Spaces)\n"
49
+ "2. Set HF_TOKEN environment variable\n"
50
+ "3. Set huggingface_api_key in settings"
 
51
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ # Always use HuggingFace with available token
54
  model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
55
+ hf_provider = HuggingFaceProvider(api_key=effective_hf_token)
56
+ logger.info(
57
+ "using_huggingface_with_token",
58
+ has_oauth=bool(oauth_token),
59
+ has_settings_token=bool(settings.hf_token or settings.huggingface_api_key),
60
+ model=model_name,
61
+ )
62
  return HuggingFaceModel(model_name, provider=hf_provider)
63
 
64
 
src/agents/retrieval_agent.py CHANGED
@@ -28,28 +28,28 @@ async def search_web(query: str, max_results: int = 10) -> str:
28
  logger.info("Web search starting", query=query, max_results=max_results)
29
  state = get_magentic_state()
30
 
31
- results = await _web_search.search(query, max_results)
32
- if not results.evidence:
33
  logger.info("Web search returned no results", query=query)
34
  return f"No web results found for: {query}"
35
 
36
  # Update state
37
  # We add *all* found results to state
38
- new_count = state.add_evidence(results.evidence)
39
  logger.info(
40
  "Web search complete",
41
  query=query,
42
- results_found=len(results.evidence),
43
  new_evidence=new_count,
44
  )
45
 
46
  # Use embedding service for deduplication/indexing if available
47
  if state.embedding_service:
48
  # This method also adds to vector DB as a side effect for unique items
49
- await state.embedding_service.deduplicate(results.evidence)
50
 
51
- output = [f"Found {len(results.evidence)} web results ({new_count} new stored):\n"]
52
- for i, r in enumerate(results.evidence[:max_results], 1):
53
  output.append(f"{i}. **{r.citation.title}**")
54
  output.append(f" Source: {r.citation.url}")
55
  output.append(f" {r.content[:300]}...\n")
 
28
  logger.info("Web search starting", query=query, max_results=max_results)
29
  state = get_magentic_state()
30
 
31
+ evidence = await _web_search.search(query, max_results)
32
+ if not evidence:
33
  logger.info("Web search returned no results", query=query)
34
  return f"No web results found for: {query}"
35
 
36
  # Update state
37
  # We add *all* found results to state
38
+ new_count = state.add_evidence(evidence)
39
  logger.info(
40
  "Web search complete",
41
  query=query,
42
+ results_found=len(evidence),
43
  new_evidence=new_count,
44
  )
45
 
46
  # Use embedding service for deduplication/indexing if available
47
  if state.embedding_service:
48
  # This method also adds to vector DB as a side effect for unique items
49
+ await state.embedding_service.deduplicate(evidence)
50
 
51
+ output = [f"Found {len(evidence)} web results ({new_count} new stored):\n"]
52
+ for i, r in enumerate(evidence[:max_results], 1):
53
  output.append(f"{i}. **{r.citation.title}**")
54
  output.append(f" Source: {r.citation.url}")
55
  output.append(f" {r.content[:300]}...\n")
src/app.py CHANGED
@@ -30,6 +30,7 @@ from src.agent_factory.judges import HFInferenceJudgeHandler, JudgeHandler, Mock
30
  from src.orchestrator_factory import create_orchestrator
31
  from src.services.audio_processing import get_audio_service
32
  from src.services.multimodal_processing import get_multimodal_service
 
33
  from src.tools.clinicaltrials import ClinicalTrialsTool
34
  from src.tools.europepmc import EuropePMCTool
35
  from src.tools.pubmed import PubMedTool
@@ -37,6 +38,8 @@ from src.tools.search_handler import SearchHandler
37
  from src.utils.config import settings
38
  from src.utils.models import AgentEvent, OrchestratorConfig
39
 
 
 
40
 
41
  def configure_orchestrator(
42
  use_mock: bool = False,
@@ -70,8 +73,18 @@ def configure_orchestrator(
70
 
71
  # Create search tools with RAG enabled
72
  # Pass OAuth token to SearchHandler so it can be used by RAG service
 
 
 
 
 
 
 
 
 
 
73
  search_handler = SearchHandler(
74
- tools=[PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()],
75
  timeout=config.search_timeout,
76
  include_rag=True,
77
  auto_ingest_to_rag=True,
@@ -150,6 +163,49 @@ def configure_orchestrator(
150
  return orchestrator, backend_info
151
 
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  def event_to_chat_message(event: AgentEvent) -> dict[str, Any]:
154
  """
155
  Convert AgentEvent to gr.ChatMessage with metadata for accordion display.
@@ -183,11 +239,61 @@ def event_to_chat_message(event: AgentEvent) -> dict[str, Any]:
183
 
184
  # For complete events, return main response without accordion
185
  if event.type == "complete":
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  # Return as dict format for Gradio Chatbot compatibility
187
- return {
188
  "role": "assistant",
189
- "content": event.message,
190
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
  # Build metadata for accordion according to Gradio ChatMessage spec
193
  # Metadata keys: title (str), status ("pending"|"done"), log (str), duration (float)
 
30
  from src.orchestrator_factory import create_orchestrator
31
  from src.services.audio_processing import get_audio_service
32
  from src.services.multimodal_processing import get_multimodal_service
33
+ # import structlog
34
  from src.tools.clinicaltrials import ClinicalTrialsTool
35
  from src.tools.europepmc import EuropePMCTool
36
  from src.tools.pubmed import PubMedTool
 
38
  from src.utils.config import settings
39
  from src.utils.models import AgentEvent, OrchestratorConfig
40
 
41
+ # logger = structlog.get_logger()
42
+
43
 
44
  def configure_orchestrator(
45
  use_mock: bool = False,
 
73
 
74
  # Create search tools with RAG enabled
75
  # Pass OAuth token to SearchHandler so it can be used by RAG service
76
+ tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]
77
+
78
+ # Add web search tool if available
79
+ from src.tools.web_search_factory import create_web_search_tool
80
+
81
+ web_search_tool = create_web_search_tool()
82
+ if web_search_tool is not None:
83
+ tools.append(web_search_tool)
84
+ logger.info("Web search tool added to search handler", provider=web_search_tool.name)
85
+
86
  search_handler = SearchHandler(
87
+ tools=tools,
88
  timeout=config.search_timeout,
89
  include_rag=True,
90
  auto_ingest_to_rag=True,
 
163
  return orchestrator, backend_info
164
 
165
 
166
+ def _is_file_path(text: str) -> bool:
167
+ """Check if text appears to be a file path.
168
+
169
+ Args:
170
+ text: Text to check
171
+
172
+ Returns:
173
+ True if text looks like a file path
174
+ """
175
+ import os
176
+ # Check for common file extensions
177
+ file_extensions = ['.md', '.pdf', '.txt', '.json', '.csv', '.xlsx', '.docx', '.html']
178
+ text_lower = text.lower().strip()
179
+
180
+ # Check if it ends with a file extension
181
+ if any(text_lower.endswith(ext) for ext in file_extensions):
182
+ # Check if it's a valid path (absolute or relative)
183
+ if os.path.sep in text or '/' in text or '\\' in text:
184
+ return True
185
+ # Or if it's just a filename with extension
186
+ if '.' in text and len(text.split('.')) == 2:
187
+ return True
188
+
189
+ # Check if it's an absolute path
190
+ if os.path.isabs(text):
191
+ return True
192
+
193
+ return False
194
+
195
+
196
+ def _get_file_name(file_path: str) -> str:
197
+ """Extract filename from file path.
198
+
199
+ Args:
200
+ file_path: Full file path
201
+
202
+ Returns:
203
+ Filename with extension
204
+ """
205
+ import os
206
+ return os.path.basename(file_path)
207
+
208
+
209
  def event_to_chat_message(event: AgentEvent) -> dict[str, Any]:
210
  """
211
  Convert AgentEvent to gr.ChatMessage with metadata for accordion display.
 
239
 
240
  # For complete events, return main response without accordion
241
  if event.type == "complete":
242
+ # Check if event contains file information
243
+ content = event.message
244
+ files: list[str] | None = None
245
+
246
+ # Check event.data for file paths
247
+ if event.data and isinstance(event.data, dict):
248
+ # Support both "files" (list) and "file" (single path) keys
249
+ if "files" in event.data:
250
+ files = event.data["files"]
251
+ if isinstance(files, str):
252
+ files = [files]
253
+ elif not isinstance(files, list):
254
+ files = None
255
+ else:
256
+ # Filter to only valid file paths
257
+ files = [f for f in files if isinstance(f, str) and _is_file_path(f)]
258
+ elif "file" in event.data:
259
+ file_path = event.data["file"]
260
+ if isinstance(file_path, str) and _is_file_path(file_path):
261
+ files = [file_path]
262
+
263
+ # Also check if message itself is a file path (less common, but possible)
264
+ if not files and isinstance(event.message, str) and _is_file_path(event.message):
265
+ files = [event.message]
266
+ # Keep message as text description
267
+ content = "Report generated. Download available below."
268
+
269
  # Return as dict format for Gradio Chatbot compatibility
270
+ result: dict[str, Any] = {
271
  "role": "assistant",
272
+ "content": content,
273
  }
274
+
275
+ # Add files if present
276
+ # Gradio Chatbot supports file paths in content as markdown links
277
+ # The links will be clickable and downloadable
278
+ if files:
279
+ # Validate files exist before including them
280
+ import os
281
+ valid_files = [f for f in files if os.path.exists(f)]
282
+
283
+ if valid_files:
284
+ # Format files for Gradio: include as markdown download links
285
+ file_links = "\n\n".join([
286
+ f"📎 [Download: {_get_file_name(f)}]({f})"
287
+ for f in valid_files
288
+ ])
289
+ result["content"] = f"{content}\n\n{file_links}"
290
+
291
+ # Also store in metadata for potential future use
292
+ if "metadata" not in result:
293
+ result["metadata"] = {}
294
+ result["metadata"]["files"] = valid_files
295
+
296
+ return result
297
 
298
  # Build metadata for accordion according to Gradio ChatMessage spec
299
  # Metadata keys: title (str), status ("pending"|"done"), log (str), duration (float)
src/orchestrator/graph_orchestrator.py CHANGED
@@ -533,10 +533,33 @@ class GraphOrchestrator:
533
 
534
  # Final event
535
  final_result = context.get_node_result(current_node_id) if current_node_id else None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
536
  yield AgentEvent(
537
  type="complete",
538
- message=final_result if isinstance(final_result, str) else "Research completed",
539
- data={"mode": self.mode, "iterations": iteration},
540
  iteration=iteration,
541
  )
542
 
 
533
 
534
  # Final event
535
  final_result = context.get_node_result(current_node_id) if current_node_id else None
536
+
537
+ # Check if final result contains file information
538
+ event_data: dict[str, Any] = {"mode": self.mode, "iterations": iteration}
539
+ message: str = "Research completed"
540
+
541
+ if isinstance(final_result, str):
542
+ message = final_result
543
+ elif isinstance(final_result, dict):
544
+ # If result is a dict, check for file paths
545
+ if "file" in final_result:
546
+ file_path = final_result["file"]
547
+ if isinstance(file_path, str):
548
+ event_data["file"] = file_path
549
+ message = final_result.get("message", "Report generated. Download available.")
550
+ elif "files" in final_result:
551
+ files = final_result["files"]
552
+ if isinstance(files, list):
553
+ event_data["files"] = files
554
+ message = final_result.get("message", "Report generated. Downloads available.")
555
+ elif isinstance(files, str):
556
+ event_data["files"] = [files]
557
+ message = final_result.get("message", "Report generated. Download available.")
558
+
559
  yield AgentEvent(
560
  type="complete",
561
+ message=message,
562
+ data=event_data,
563
  iteration=iteration,
564
  )
565
 
src/tools/rate_limiter.py CHANGED
@@ -93,6 +93,33 @@ def reset_pubmed_limiter() -> None:
93
  _pubmed_limiter = None
94
 
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  # Factory for other APIs
97
  class RateLimiterFactory:
98
  """Factory for creating/getting rate limiters for different APIs."""
 
93
  _pubmed_limiter = None
94
 
95
 
96
+ def get_serper_limiter(api_key: str | None = None) -> RateLimiter:
97
+ """
98
+ Get the shared Serper API rate limiter.
99
+
100
+ Rate: 10 requests/second (Serper API limit)
101
+
102
+ Args:
103
+ api_key: Serper API key (optional, for consistency with other limiters)
104
+
105
+ Returns:
106
+ Shared RateLimiter instance
107
+ """
108
+ return RateLimiterFactory.get("serper", "10/second")
109
+
110
+
111
+ def get_searchxng_limiter() -> RateLimiter:
112
+ """
113
+ Get the shared SearchXNG API rate limiter.
114
+
115
+ Rate: 5 requests/second (conservative limit)
116
+
117
+ Returns:
118
+ Shared RateLimiter instance
119
+ """
120
+ return RateLimiterFactory.get("searchxng", "5/second")
121
+
122
+
123
  # Factory for other APIs
124
  class RateLimiterFactory:
125
  """Factory for creating/getting rate limiters for different APIs."""
src/tools/searchxng_web_search.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """SearchXNG web search tool using SearchXNG API for Google searches."""
2
+
3
+ from typing import Any
4
+
5
+ import structlog
6
+ from tenacity import retry, stop_after_attempt, wait_exponential
7
+
8
+ from src.tools.base import SearchTool
9
+ from src.tools.query_utils import preprocess_query
10
+ from src.tools.rate_limiter import get_searchxng_limiter
11
+ from src.tools.vendored.searchxng_client import SearchXNGClient
12
+ from src.tools.vendored.web_search_core import scrape_urls
13
+ from src.utils.config import settings
14
+ from src.utils.exceptions import ConfigurationError, RateLimitError, SearchError
15
+ from src.utils.models import Citation, Evidence
16
+
17
+ logger = structlog.get_logger()
18
+
19
+
20
+ class SearchXNGWebSearchTool:
21
+ """Tool for searching the web using SearchXNG API (Google search)."""
22
+
23
+ def __init__(self, host: str | None = None) -> None:
24
+ """Initialize SearchXNG web search tool.
25
+
26
+ Args:
27
+ host: SearchXNG host URL. If None, reads from settings.
28
+
29
+ Raises:
30
+ ConfigurationError: If no host is available.
31
+ """
32
+ self.host = host or settings.searchxng_host
33
+ if not self.host:
34
+ raise ConfigurationError(
35
+ "SearchXNG host required. Set SEARCHXNG_HOST environment variable or searchxng_host in settings."
36
+ )
37
+
38
+ self._client = SearchXNGClient(host=self.host)
39
+ self._limiter = get_searchxng_limiter()
40
+
41
+ @property
42
+ def name(self) -> str:
43
+ """Return the name of this search tool."""
44
+ return "searchxng"
45
+
46
+ async def _rate_limit(self) -> None:
47
+ """Enforce SearchXNG API rate limiting."""
48
+ await self._limiter.acquire()
49
+
50
+ @retry(
51
+ stop=stop_after_attempt(3),
52
+ wait=wait_exponential(multiplier=1, min=1, max=10),
53
+ reraise=True,
54
+ )
55
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
56
+ """Execute a web search using SearchXNG API.
57
+
58
+ Args:
59
+ query: The search query string
60
+ max_results: Maximum number of results to return
61
+
62
+ Returns:
63
+ List of Evidence objects
64
+
65
+ Raises:
66
+ SearchError: If the search fails
67
+ RateLimitError: If rate limit is exceeded
68
+ """
69
+ await self._rate_limit()
70
+
71
+ # Preprocess query to remove noise
72
+ clean_query = preprocess_query(query)
73
+ final_query = clean_query if clean_query else query
74
+
75
+ try:
76
+ # Get search results (snippets)
77
+ search_results = await self._client.search(
78
+ final_query, filter_for_relevance=False, max_results=max_results
79
+ )
80
+
81
+ if not search_results:
82
+ logger.info("No search results found", query=final_query)
83
+ return []
84
+
85
+ # Scrape URLs to get full content
86
+ scraped = await scrape_urls(search_results)
87
+
88
+ # Convert ScrapeResult to Evidence objects
89
+ evidence = []
90
+ for result in scraped:
91
+ ev = Evidence(
92
+ content=result.text,
93
+ citation=Citation(
94
+ title=result.title,
95
+ url=result.url,
96
+ source="searchxng",
97
+ date="Unknown",
98
+ authors=[],
99
+ ),
100
+ relevance=0.0,
101
+ )
102
+ evidence.append(ev)
103
+
104
+ logger.info(
105
+ "SearchXNG search complete",
106
+ query=final_query,
107
+ results_found=len(evidence),
108
+ )
109
+
110
+ return evidence
111
+
112
+ except RateLimitError:
113
+ raise
114
+ except SearchError:
115
+ raise
116
+ except Exception as e:
117
+ logger.error("Unexpected error in SearchXNG search", error=str(e), query=final_query)
118
+ raise SearchError(f"SearchXNG search failed: {e}") from e
119
+
src/tools/serper_web_search.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Serper web search tool using Serper API for Google searches."""
2
+
3
+ from typing import Any
4
+
5
+ import structlog
6
+ from tenacity import retry, stop_after_attempt, wait_exponential
7
+
8
+ from src.tools.base import SearchTool
9
+ from src.tools.query_utils import preprocess_query
10
+ from src.tools.rate_limiter import get_serper_limiter
11
+ from src.tools.vendored.serper_client import SerperClient
12
+ from src.tools.vendored.web_search_core import scrape_urls
13
+ from src.utils.config import settings
14
+ from src.utils.exceptions import ConfigurationError, RateLimitError, SearchError
15
+ from src.utils.models import Citation, Evidence
16
+
17
+ logger = structlog.get_logger()
18
+
19
+
20
+ class SerperWebSearchTool:
21
+ """Tool for searching the web using Serper API (Google search)."""
22
+
23
+ def __init__(self, api_key: str | None = None) -> None:
24
+ """Initialize Serper web search tool.
25
+
26
+ Args:
27
+ api_key: Serper API key. If None, reads from settings.
28
+
29
+ Raises:
30
+ ConfigurationError: If no API key is available.
31
+ """
32
+ self.api_key = api_key or settings.serper_api_key
33
+ if not self.api_key:
34
+ raise ConfigurationError(
35
+ "Serper API key required. Set SERPER_API_KEY environment variable or serper_api_key in settings."
36
+ )
37
+
38
+ self._client = SerperClient(api_key=self.api_key)
39
+ self._limiter = get_serper_limiter(self.api_key)
40
+
41
+ @property
42
+ def name(self) -> str:
43
+ """Return the name of this search tool."""
44
+ return "serper"
45
+
46
+ async def _rate_limit(self) -> None:
47
+ """Enforce Serper API rate limiting."""
48
+ await self._limiter.acquire()
49
+
50
+ @retry(
51
+ stop=stop_after_attempt(3),
52
+ wait=wait_exponential(multiplier=1, min=1, max=10),
53
+ reraise=True,
54
+ )
55
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
56
+ """Execute a web search using Serper API.
57
+
58
+ Args:
59
+ query: The search query string
60
+ max_results: Maximum number of results to return
61
+
62
+ Returns:
63
+ List of Evidence objects
64
+
65
+ Raises:
66
+ SearchError: If the search fails
67
+ RateLimitError: If rate limit is exceeded
68
+ """
69
+ await self._rate_limit()
70
+
71
+ # Preprocess query to remove noise
72
+ clean_query = preprocess_query(query)
73
+ final_query = clean_query if clean_query else query
74
+
75
+ try:
76
+ # Get search results (snippets)
77
+ search_results = await self._client.search(
78
+ final_query, filter_for_relevance=False, max_results=max_results
79
+ )
80
+
81
+ if not search_results:
82
+ logger.info("No search results found", query=final_query)
83
+ return []
84
+
85
+ # Scrape URLs to get full content
86
+ scraped = await scrape_urls(search_results)
87
+
88
+ # Convert ScrapeResult to Evidence objects
89
+ evidence = []
90
+ for result in scraped:
91
+ ev = Evidence(
92
+ content=result.text,
93
+ citation=Citation(
94
+ title=result.title,
95
+ url=result.url,
96
+ source="serper",
97
+ date="Unknown",
98
+ authors=[],
99
+ ),
100
+ relevance=0.0,
101
+ )
102
+ evidence.append(ev)
103
+
104
+ logger.info(
105
+ "Serper search complete",
106
+ query=final_query,
107
+ results_found=len(evidence),
108
+ )
109
+
110
+ return evidence
111
+
112
+ except RateLimitError:
113
+ raise
114
+ except SearchError:
115
+ raise
116
+ except Exception as e:
117
+ logger.error("Unexpected error in Serper search", error=str(e), query=final_query)
118
+ raise SearchError(f"Serper search failed: {e}") from e
119
+
src/tools/vendored/__init__.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Vendored web search components from folder/tools/web_search.py."""
2
+
3
+ from src.tools.vendored.web_search_core import (
4
+ CONTENT_LENGTH_LIMIT,
5
+ ScrapeResult,
6
+ WebpageSnippet,
7
+ scrape_urls,
8
+ fetch_and_process_url,
9
+ html_to_text,
10
+ is_valid_url,
11
+ )
12
+ from src.tools.vendored.serper_client import SerperClient
13
+ from src.tools.vendored.searchxng_client import SearchXNGClient
14
+
15
+ __all__ = [
16
+ "CONTENT_LENGTH_LIMIT",
17
+ "ScrapeResult",
18
+ "WebpageSnippet",
19
+ "SerperClient",
20
+ "SearchXNGClient",
21
+ "scrape_urls",
22
+ "fetch_and_process_url",
23
+ "html_to_text",
24
+ "is_valid_url",
25
+ ]
26
+
src/tools/vendored/searchxng_client.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """SearchXNG API client for Google searches.
2
+
3
+ Vendored and adapted from folder/tools/web_search.py.
4
+ """
5
+
6
+ import os
7
+ from typing import List, Optional
8
+
9
+ import aiohttp
10
+ import structlog
11
+
12
+ from src.tools.vendored.web_search_core import WebpageSnippet, ssl_context
13
+ from src.utils.exceptions import RateLimitError, SearchError
14
+
15
+ logger = structlog.get_logger()
16
+
17
+
18
+ class SearchXNGClient:
19
+ """A client for the SearchXNG API to perform Google searches."""
20
+
21
+ def __init__(self, host: Optional[str] = None) -> None:
22
+ """Initialize SearchXNG client.
23
+
24
+ Args:
25
+ host: SearchXNG host URL. If None, reads from SEARCHXNG_HOST env var.
26
+
27
+ Raises:
28
+ ConfigurationError: If no host is provided.
29
+ """
30
+ host = host or os.getenv("SEARCHXNG_HOST")
31
+ if not host:
32
+ from src.utils.exceptions import ConfigurationError
33
+
34
+ raise ConfigurationError("SEARCHXNG_HOST environment variable is not set")
35
+
36
+ # Ensure host ends with /search
37
+ if not host.endswith("/search"):
38
+ host = f"{host}/search" if not host.endswith("/") else f"{host}search"
39
+
40
+ self.host: str = host
41
+
42
+ async def search(
43
+ self, query: str, filter_for_relevance: bool = False, max_results: int = 5
44
+ ) -> List[WebpageSnippet]:
45
+ """Perform a search using SearchXNG API.
46
+
47
+ Args:
48
+ query: The search query
49
+ filter_for_relevance: Whether to filter results (currently not implemented)
50
+ max_results: Maximum number of results to return
51
+
52
+ Returns:
53
+ List of WebpageSnippet objects with search results
54
+
55
+ Raises:
56
+ SearchError: If the search fails
57
+ RateLimitError: If rate limit is exceeded
58
+ """
59
+ connector = aiohttp.TCPConnector(ssl=ssl_context)
60
+ try:
61
+ async with aiohttp.ClientSession(connector=connector) as session:
62
+ params = {
63
+ "q": query,
64
+ "format": "json",
65
+ }
66
+
67
+ async with session.get(self.host, params=params) as response:
68
+ if response.status == 429:
69
+ raise RateLimitError("SearchXNG API rate limit exceeded")
70
+
71
+ response.raise_for_status()
72
+ results = await response.json()
73
+
74
+ results_list = [
75
+ WebpageSnippet(
76
+ url=result.get("url", ""),
77
+ title=result.get("title", ""),
78
+ description=result.get("content", ""),
79
+ )
80
+ for result in results.get("results", [])
81
+ ]
82
+
83
+ if not results_list:
84
+ logger.info("No search results found", query=query)
85
+ return []
86
+
87
+ # Return results up to max_results
88
+ return results_list[:max_results]
89
+
90
+ except aiohttp.ClientError as e:
91
+ logger.error("SearchXNG API request failed", error=str(e), query=query)
92
+ raise SearchError(f"SearchXNG API request failed: {e}") from e
93
+ except RateLimitError:
94
+ raise
95
+ except Exception as e:
96
+ logger.error("Unexpected error in SearchXNG search", error=str(e), query=query)
97
+ raise SearchError(f"SearchXNG search failed: {e}") from e
98
+
src/tools/vendored/serper_client.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Serper API client for Google searches.
2
+
3
+ Vendored and adapted from folder/tools/web_search.py.
4
+ """
5
+
6
+ import os
7
+ from typing import List, Optional
8
+
9
+ import aiohttp
10
+ import structlog
11
+
12
+ from src.tools.vendored.web_search_core import WebpageSnippet, ssl_context
13
+ from src.utils.exceptions import RateLimitError, SearchError
14
+
15
+ logger = structlog.get_logger()
16
+
17
+
18
+ class SerperClient:
19
+ """A client for the Serper API to perform Google searches."""
20
+
21
+ def __init__(self, api_key: Optional[str] = None) -> None:
22
+ """Initialize Serper client.
23
+
24
+ Args:
25
+ api_key: Serper API key. If None, reads from SERPER_API_KEY env var.
26
+
27
+ Raises:
28
+ ConfigurationError: If no API key is provided.
29
+ """
30
+ self.api_key = api_key or os.getenv("SERPER_API_KEY")
31
+ if not self.api_key:
32
+ from src.utils.exceptions import ConfigurationError
33
+
34
+ raise ConfigurationError(
35
+ "No API key provided. Set SERPER_API_KEY environment variable."
36
+ )
37
+
38
+ self.url = "https://google.serper.dev/search"
39
+ self.headers = {"X-API-KEY": self.api_key, "Content-Type": "application/json"}
40
+
41
+ async def search(
42
+ self, query: str, filter_for_relevance: bool = False, max_results: int = 5
43
+ ) -> List[WebpageSnippet]:
44
+ """Perform a Google search using Serper API.
45
+
46
+ Args:
47
+ query: The search query
48
+ filter_for_relevance: Whether to filter results (currently not implemented)
49
+ max_results: Maximum number of results to return
50
+
51
+ Returns:
52
+ List of WebpageSnippet objects with search results
53
+
54
+ Raises:
55
+ SearchError: If the search fails
56
+ RateLimitError: If rate limit is exceeded
57
+ """
58
+ connector = aiohttp.TCPConnector(ssl=ssl_context)
59
+ try:
60
+ async with aiohttp.ClientSession(connector=connector) as session:
61
+ async with session.post(
62
+ self.url, headers=self.headers, json={"q": query, "autocorrect": False}
63
+ ) as response:
64
+ if response.status == 429:
65
+ raise RateLimitError("Serper API rate limit exceeded")
66
+
67
+ response.raise_for_status()
68
+ results = await response.json()
69
+
70
+ results_list = [
71
+ WebpageSnippet(
72
+ url=result.get("link", ""),
73
+ title=result.get("title", ""),
74
+ description=result.get("snippet", ""),
75
+ )
76
+ for result in results.get("organic", [])
77
+ ]
78
+
79
+ if not results_list:
80
+ logger.info("No search results found", query=query)
81
+ return []
82
+
83
+ # Return results up to max_results
84
+ return results_list[:max_results]
85
+
86
+ except aiohttp.ClientError as e:
87
+ logger.error("Serper API request failed", error=str(e), query=query)
88
+ raise SearchError(f"Serper API request failed: {e}") from e
89
+ except RateLimitError:
90
+ raise
91
+ except Exception as e:
92
+ logger.error("Unexpected error in Serper search", error=str(e), query=query)
93
+ raise SearchError(f"Serper search failed: {e}") from e
94
+
src/tools/vendored/web_search_core.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Core web search utilities vendored from folder/tools/web_search.py.
2
+
3
+ This module contains shared utilities for web scraping, URL processing,
4
+ and HTML text extraction used by web search tools.
5
+ """
6
+
7
+ import asyncio
8
+ import ssl
9
+ from typing import List, Optional
10
+
11
+ import aiohttp
12
+ import structlog
13
+ from bs4 import BeautifulSoup
14
+ from pydantic import BaseModel, Field
15
+
16
+ logger = structlog.get_logger()
17
+
18
+ # Content length limit to avoid exceeding token limits
19
+ CONTENT_LENGTH_LIMIT = 10000
20
+
21
+ # Create a shared SSL context for web requests
22
+ ssl_context = ssl.create_default_context()
23
+ ssl_context.check_hostname = False
24
+ ssl_context.verify_mode = ssl.CERT_NONE
25
+ ssl_context.set_ciphers("DEFAULT:@SECLEVEL=1") # Allow older cipher suites
26
+
27
+
28
+ class ScrapeResult(BaseModel):
29
+ """Result of scraping a single webpage."""
30
+
31
+ url: str = Field(description="The URL of the webpage")
32
+ text: str = Field(description="The full text content of the webpage")
33
+ title: str = Field(description="The title of the webpage")
34
+ description: str = Field(description="A short description of the webpage")
35
+
36
+
37
+ class WebpageSnippet(BaseModel):
38
+ """Snippet information for a webpage (before scraping)."""
39
+
40
+ url: str = Field(description="The URL of the webpage")
41
+ title: str = Field(description="The title of the webpage")
42
+ description: Optional[str] = Field(
43
+ default=None, description="A short description of the webpage"
44
+ )
45
+
46
+
47
+ async def scrape_urls(items: List[WebpageSnippet]) -> List[ScrapeResult]:
48
+ """Fetch text content from provided URLs.
49
+
50
+ Args:
51
+ items: List of WebpageSnippet items to extract content from
52
+
53
+ Returns:
54
+ List of ScrapeResult objects with scraped content
55
+ """
56
+ connector = aiohttp.TCPConnector(ssl=ssl_context)
57
+ async with aiohttp.ClientSession(connector=connector) as session:
58
+ # Create list of tasks for concurrent execution
59
+ tasks = []
60
+ for item in items:
61
+ if item.url: # Skip empty URLs
62
+ tasks.append(fetch_and_process_url(session, item))
63
+
64
+ # Execute all tasks concurrently and gather results
65
+ results = await asyncio.gather(*tasks, return_exceptions=True)
66
+
67
+ # Filter out errors and return successful results
68
+ successful_results: List[ScrapeResult] = []
69
+ for result in results:
70
+ if isinstance(result, ScrapeResult):
71
+ successful_results.append(result)
72
+ elif isinstance(result, Exception):
73
+ logger.warning("Failed to scrape URL", error=str(result))
74
+
75
+ return successful_results
76
+
77
+
78
+ async def fetch_and_process_url(
79
+ session: aiohttp.ClientSession, item: WebpageSnippet
80
+ ) -> ScrapeResult:
81
+ """Helper function to fetch and process a single URL.
82
+
83
+ Args:
84
+ session: aiohttp ClientSession
85
+ item: WebpageSnippet with URL to fetch
86
+
87
+ Returns:
88
+ ScrapeResult with fetched content
89
+ """
90
+ if not is_valid_url(item.url):
91
+ return ScrapeResult(
92
+ url=item.url,
93
+ title=item.title,
94
+ description=item.description or "",
95
+ text="Error fetching content: URL contains restricted file extension",
96
+ )
97
+
98
+ try:
99
+ timeout = aiohttp.ClientTimeout(total=8)
100
+ async with session.get(item.url, timeout=timeout) as response:
101
+ if response.status == 200:
102
+ content = await response.text()
103
+ # Run html_to_text in a thread pool to avoid blocking
104
+ loop = asyncio.get_event_loop()
105
+ text_content = await loop.run_in_executor(None, html_to_text, content)
106
+ text_content = text_content[
107
+ :CONTENT_LENGTH_LIMIT
108
+ ] # Trim content to avoid exceeding token limit
109
+ return ScrapeResult(
110
+ url=item.url,
111
+ title=item.title,
112
+ description=item.description or "",
113
+ text=text_content,
114
+ )
115
+ else:
116
+ # Return a ScrapeResult with an error message
117
+ return ScrapeResult(
118
+ url=item.url,
119
+ title=item.title,
120
+ description=item.description or "",
121
+ text=f"Error fetching content: HTTP {response.status}",
122
+ )
123
+ except Exception as e:
124
+ logger.warning("Error fetching URL", url=item.url, error=str(e))
125
+ # Return a ScrapeResult with an error message
126
+ return ScrapeResult(
127
+ url=item.url,
128
+ title=item.title,
129
+ description=item.description or "",
130
+ text=f"Error fetching content: {str(e)}",
131
+ )
132
+
133
+
134
+ def html_to_text(html_content: str) -> str:
135
+ """Strip out unnecessary elements from HTML to prepare for text extraction.
136
+
137
+ Args:
138
+ html_content: Raw HTML content
139
+
140
+ Returns:
141
+ Extracted text from relevant HTML tags
142
+ """
143
+ # Parse the HTML using lxml for speed
144
+ soup = BeautifulSoup(html_content, "lxml")
145
+
146
+ # Extract text from relevant tags
147
+ tags_to_extract = ("h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "blockquote")
148
+
149
+ # Use a generator expression for efficiency
150
+ extracted_text = "\n".join(
151
+ element.get_text(strip=True)
152
+ for element in soup.find_all(tags_to_extract)
153
+ if element.get_text(strip=True)
154
+ )
155
+
156
+ return extracted_text
157
+
158
+
159
+ def is_valid_url(url: str) -> bool:
160
+ """Check that a URL does not contain restricted file extensions.
161
+
162
+ Args:
163
+ url: URL to validate
164
+
165
+ Returns:
166
+ True if URL is valid, False if it contains restricted extensions
167
+ """
168
+ restricted_extensions = [
169
+ ".pdf",
170
+ ".doc",
171
+ ".xls",
172
+ ".ppt",
173
+ ".zip",
174
+ ".rar",
175
+ ".7z",
176
+ ".txt",
177
+ ".js",
178
+ ".xml",
179
+ ".css",
180
+ ".png",
181
+ ".jpg",
182
+ ".jpeg",
183
+ ".gif",
184
+ ".ico",
185
+ ".svg",
186
+ ".webp",
187
+ ".mp3",
188
+ ".mp4",
189
+ ".avi",
190
+ ".mov",
191
+ ".wmv",
192
+ ".flv",
193
+ ".wma",
194
+ ".wav",
195
+ ".m4a",
196
+ ".m4v",
197
+ ".m4b",
198
+ ".m4p",
199
+ ".m4u",
200
+ ]
201
+
202
+ if any(ext in url for ext in restricted_extensions):
203
+ return False
204
+ return True
205
+
src/tools/web_search.py CHANGED
@@ -5,7 +5,9 @@ import asyncio
5
  import structlog
6
  from duckduckgo_search import DDGS
7
 
8
- from src.utils.models import Citation, Evidence, SearchResult
 
 
9
 
10
  logger = structlog.get_logger()
11
 
@@ -16,14 +18,34 @@ class WebSearchTool:
16
  def __init__(self) -> None:
17
  self._ddgs = DDGS()
18
 
19
- async def search(self, query: str, max_results: int = 10) -> SearchResult:
20
- """Execute a web search."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  try:
 
 
 
 
22
  loop = asyncio.get_running_loop()
23
 
24
  def _do_search() -> list[dict[str, str]]:
25
  # text() returns an iterator, need to list() it or iterate
26
- return list(self._ddgs.text(query, max_results=max_results))
27
 
28
  raw_results = await loop.run_in_executor(None, _do_search)
29
 
@@ -42,12 +64,8 @@ class WebSearchTool:
42
  )
43
  evidence.append(ev)
44
 
45
- return SearchResult(
46
- query=query, evidence=evidence, sources_searched=["web"], total_found=len(evidence)
47
- )
48
 
49
  except Exception as e:
50
- logger.error("Web search failed", error=str(e))
51
- return SearchResult(
52
- query=query, evidence=[], sources_searched=["web"], total_found=0, errors=[str(e)]
53
- )
 
5
  import structlog
6
  from duckduckgo_search import DDGS
7
 
8
+ from src.tools.query_utils import preprocess_query
9
+ from src.utils.exceptions import SearchError
10
+ from src.utils.models import Citation, Evidence
11
 
12
  logger = structlog.get_logger()
13
 
 
18
  def __init__(self) -> None:
19
  self._ddgs = DDGS()
20
 
21
+ @property
22
+ def name(self) -> str:
23
+ """Return the name of this search tool."""
24
+ return "duckduckgo"
25
+
26
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
27
+ """Execute a web search and return evidence.
28
+
29
+ Args:
30
+ query: The search query string
31
+ max_results: Maximum number of results to return
32
+
33
+ Returns:
34
+ List of Evidence objects
35
+
36
+ Raises:
37
+ SearchError: If the search fails
38
+ """
39
  try:
40
+ # Preprocess query to remove noise
41
+ clean_query = preprocess_query(query)
42
+ final_query = clean_query if clean_query else query
43
+
44
  loop = asyncio.get_running_loop()
45
 
46
  def _do_search() -> list[dict[str, str]]:
47
  # text() returns an iterator, need to list() it or iterate
48
+ return list(self._ddgs.text(final_query, max_results=max_results))
49
 
50
  raw_results = await loop.run_in_executor(None, _do_search)
51
 
 
64
  )
65
  evidence.append(ev)
66
 
67
+ return evidence
 
 
68
 
69
  except Exception as e:
70
+ logger.error("Web search failed", error=str(e), query=query)
71
+ raise SearchError(f"DuckDuckGo search failed: {e}") from e
 
 
src/tools/web_search_adapter.py CHANGED
@@ -1,10 +1,12 @@
1
  """Web search tool adapter for Pydantic AI agents.
2
 
3
- Adapts the folder/tools/web_search.py implementation to work with Pydantic AI.
4
  """
5
 
6
  import structlog
7
 
 
 
8
  logger = structlog.get_logger()
9
 
10
 
@@ -22,42 +24,32 @@ async def web_search(query: str) -> str:
22
  Formatted string with search results including titles, descriptions, and URLs
23
  """
24
  try:
25
- # Lazy import to avoid requiring folder/ dependencies at import time
26
- # This will use the existing web_search tool from folder/tools
27
- from folder.llm_config import create_default_config
28
- from folder.tools.web_search import create_web_search_tool
29
-
30
- config = create_default_config()
31
- web_search_tool = create_web_search_tool(config)
32
 
33
- # Call the tool function
34
- # The tool returns List[ScrapeResult] or str
35
- results = await web_search_tool(query)
36
 
37
- if isinstance(results, str):
38
- # Error message returned
39
- logger.warning("Web search returned error", error=results)
40
- return results
41
 
42
- if not results:
43
  return f"No web search results found for: {query}"
44
 
45
  # Format results for agent consumption
46
- formatted = [f"Found {len(results)} web search results:\n"]
47
- for i, result in enumerate(results[:5], 1): # Limit to 5 results
48
- formatted.append(f"{i}. **{result.title}**")
49
- if result.description:
50
- formatted.append(f" {result.description[:200]}...")
51
- formatted.append(f" URL: {result.url}")
52
- if result.text:
53
- formatted.append(f" Content: {result.text[:300]}...")
54
  formatted.append("")
55
 
56
  return "\n".join(formatted)
57
 
58
- except ImportError as e:
59
- logger.error("Web search tool not available", error=str(e))
60
- return f"Web search tool not available: {e!s}"
61
  except Exception as e:
62
  logger.error("Web search failed", error=str(e), query=query)
63
  return f"Error performing web search: {e!s}"
 
1
  """Web search tool adapter for Pydantic AI agents.
2
 
3
+ Uses the new web search factory to provide web search functionality.
4
  """
5
 
6
  import structlog
7
 
8
+ from src.tools.web_search_factory import create_web_search_tool
9
+
10
  logger = structlog.get_logger()
11
 
12
 
 
24
  Formatted string with search results including titles, descriptions, and URLs
25
  """
26
  try:
27
+ # Get web search tool from factory
28
+ tool = create_web_search_tool()
 
 
 
 
 
29
 
30
+ if tool is None:
31
+ logger.warning("Web search tool not available", hint="Check configuration")
32
+ return "Web search tool not available. Please configure a web search provider."
33
 
34
+ # Call the tool - it returns list[Evidence]
35
+ evidence = await tool.search(query, max_results=5)
 
 
36
 
37
+ if not evidence:
38
  return f"No web search results found for: {query}"
39
 
40
  # Format results for agent consumption
41
+ formatted = [f"Found {len(evidence)} web search results:\n"]
42
+ for i, ev in enumerate(evidence, 1):
43
+ citation = ev.citation
44
+ formatted.append(f"{i}. **{citation.title}**")
45
+ if citation.url:
46
+ formatted.append(f" URL: {citation.url}")
47
+ if ev.content:
48
+ formatted.append(f" Content: {ev.content[:300]}...")
49
  formatted.append("")
50
 
51
  return "\n".join(formatted)
52
 
 
 
 
53
  except Exception as e:
54
  logger.error("Web search failed", error=str(e), query=query)
55
  return f"Error performing web search: {e!s}"
src/tools/web_search_factory.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Factory for creating web search tools based on configuration."""
2
+
3
+ import structlog
4
+
5
+ from src.tools.base import SearchTool
6
+ from src.tools.searchxng_web_search import SearchXNGWebSearchTool
7
+ from src.tools.serper_web_search import SerperWebSearchTool
8
+ from src.tools.web_search import WebSearchTool
9
+ from src.utils.config import settings
10
+ from src.utils.exceptions import ConfigurationError
11
+
12
+ logger = structlog.get_logger()
13
+
14
+
15
+ def create_web_search_tool() -> SearchTool | None:
16
+ """Create a web search tool based on configuration.
17
+
18
+ Returns:
19
+ SearchTool instance, or None if not available/configured
20
+
21
+ The tool is selected based on settings.web_search_provider:
22
+ - "serper": SerperWebSearchTool (requires SERPER_API_KEY)
23
+ - "searchxng": SearchXNGWebSearchTool (requires SEARCHXNG_HOST)
24
+ - "duckduckgo": WebSearchTool (always available, no API key)
25
+ - "brave" or "tavily": Not yet implemented, returns None
26
+ """
27
+ provider = settings.web_search_provider
28
+
29
+ try:
30
+ if provider == "serper":
31
+ if not settings.serper_api_key:
32
+ logger.warning(
33
+ "Serper provider selected but no API key found",
34
+ hint="Set SERPER_API_KEY environment variable",
35
+ )
36
+ return None
37
+ return SerperWebSearchTool()
38
+
39
+ elif provider == "searchxng":
40
+ if not settings.searchxng_host:
41
+ logger.warning(
42
+ "SearchXNG provider selected but no host found",
43
+ hint="Set SEARCHXNG_HOST environment variable",
44
+ )
45
+ return None
46
+ return SearchXNGWebSearchTool()
47
+
48
+ elif provider == "duckduckgo":
49
+ # DuckDuckGo is always available (no API key required)
50
+ return WebSearchTool()
51
+
52
+ elif provider in ("brave", "tavily"):
53
+ logger.warning(
54
+ f"Web search provider '{provider}' not yet implemented",
55
+ hint="Use 'serper', 'searchxng', or 'duckduckgo'",
56
+ )
57
+ return None
58
+
59
+ else:
60
+ logger.warning(
61
+ f"Unknown web search provider '{provider}', falling back to DuckDuckGo"
62
+ )
63
+ return WebSearchTool()
64
+
65
+ except ConfigurationError as e:
66
+ logger.error("Failed to create web search tool", error=str(e), provider=provider)
67
+ return None
68
+ except Exception as e:
69
+ logger.error(
70
+ "Unexpected error creating web search tool", error=str(e), provider=provider
71
+ )
72
+ return None
73
+
src/utils/llm_factory.py CHANGED
@@ -132,34 +132,22 @@ def get_pydantic_ai_model(oauth_token: str | None = None) -> Any:
132
  Returns:
133
  Configured pydantic-ai model
134
  """
135
- from pydantic_ai.models.anthropic import AnthropicModel
136
  from pydantic_ai.models.huggingface import HuggingFaceModel
137
- from pydantic_ai.models.openai import OpenAIChatModel as OpenAIModel
138
- from pydantic_ai.providers.anthropic import AnthropicProvider
139
  from pydantic_ai.providers.huggingface import HuggingFaceProvider
140
- from pydantic_ai.providers.openai import OpenAIProvider
141
 
142
- # Priority: oauth_token > env vars
143
  effective_hf_token = oauth_token or settings.hf_token or settings.huggingface_api_key
144
 
145
- if settings.llm_provider == "huggingface":
146
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
147
- hf_provider = HuggingFaceProvider(api_key=effective_hf_token)
148
- return HuggingFaceModel(model_name, provider=hf_provider)
149
-
150
- if settings.llm_provider == "openai":
151
- if not settings.openai_api_key:
152
- raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
153
- provider = OpenAIProvider(api_key=settings.openai_api_key)
154
- return OpenAIModel(settings.openai_model, provider=provider)
155
-
156
- if settings.llm_provider == "anthropic":
157
- if not settings.anthropic_api_key:
158
- raise ConfigurationError("ANTHROPIC_API_KEY not set for pydantic-ai")
159
- anthropic_provider = AnthropicProvider(api_key=settings.anthropic_api_key)
160
- return AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
161
 
162
- # Default to HuggingFace if provider is unknown or not specified
163
  model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
164
  hf_provider = HuggingFaceProvider(api_key=effective_hf_token)
165
  return HuggingFaceModel(model_name, provider=hf_provider)
 
132
  Returns:
133
  Configured pydantic-ai model
134
  """
 
135
  from pydantic_ai.models.huggingface import HuggingFaceModel
 
 
136
  from pydantic_ai.providers.huggingface import HuggingFaceProvider
 
137
 
138
+ # Priority: oauth_token > settings.hf_token > settings.huggingface_api_key
139
  effective_hf_token = oauth_token or settings.hf_token or settings.huggingface_api_key
140
 
141
+ # HuggingFaceProvider requires a token - cannot use None
142
+ if not effective_hf_token:
143
+ raise ConfigurationError(
144
+ "HuggingFace token required. Please either:\n"
145
+ "1. Log in via HuggingFace OAuth (recommended for Spaces)\n"
146
+ "2. Set HF_TOKEN environment variable\n"
147
+ "3. Set huggingface_api_key in settings"
148
+ )
 
 
 
 
 
 
 
 
149
 
150
+ # Always use HuggingFace with available token
151
  model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
152
  hf_provider = HuggingFaceProvider(api_key=effective_hf_token)
153
  return HuggingFaceModel(model_name, provider=hf_provider)