Spaces:
Running
Running
Commit
Β·
7ecca95
1
Parent(s):
62d32ab
docs: update implementation documentation for Phases 2, 3, and 4
Browse files- Added detailed definitions of done for each phase, outlining completion criteria and manual testing procedures.
- Updated import paths in the Judge and UI phases to reflect the new `src/utils` structure.
- Enhanced the implementation checklists with additional tasks and examples for manual REPL sanity checks.
- Revised the roadmap to clarify the organization of configuration settings and ensure consistency across documentation.
Review Score: 100/100 (Ironclad Gucci Banger Edition)
docs/implementation/02_phase_search.md
CHANGED
|
@@ -230,4 +230,34 @@ class TestWebTool:
|
|
| 230 |
- [ ] Implement `src/tools/websearch.py`
|
| 231 |
- [ ] Implement `src/tools/search_handler.py`
|
| 232 |
- [ ] Write tests in `tests/unit/tools/test_search.py`
|
| 233 |
-
- [ ] Run `uv run pytest tests/unit/tools/`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
- [ ] Implement `src/tools/websearch.py`
|
| 231 |
- [ ] Implement `src/tools/search_handler.py`
|
| 232 |
- [ ] Write tests in `tests/unit/tools/test_search.py`
|
| 233 |
+
- [ ] Run `uv run pytest tests/unit/tools/`
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 7. Definition of Done
|
| 238 |
+
|
| 239 |
+
Phase 2 is **COMPLETE** when:
|
| 240 |
+
|
| 241 |
+
1. β
All unit tests in `tests/unit/tools/` pass.
|
| 242 |
+
2. β
`SearchHandler` returns combined results when both tools succeed.
|
| 243 |
+
3. β
If PubMed fails, WebTool results still return (graceful degradation).
|
| 244 |
+
4. β
Rate limiting is enforced (no 429s in integration tests).
|
| 245 |
+
5. β
Manual REPL sanity check works:
|
| 246 |
+
|
| 247 |
+
```python
|
| 248 |
+
import asyncio
|
| 249 |
+
from src.tools.pubmed import PubMedTool
|
| 250 |
+
from src.tools.websearch import WebTool
|
| 251 |
+
from src.tools.search_handler import SearchHandler
|
| 252 |
+
|
| 253 |
+
async def test():
|
| 254 |
+
handler = SearchHandler([PubMedTool(), WebTool()])
|
| 255 |
+
result = await handler.execute("metformin alzheimer")
|
| 256 |
+
print(f"Found {result.total_found} results")
|
| 257 |
+
for e in result.evidence[:3]:
|
| 258 |
+
print(f"- {e.citation.title}")
|
| 259 |
+
|
| 260 |
+
asyncio.run(test())
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
**Proceed to Phase 3 ONLY after all checkboxes are complete.**
|
docs/implementation/03_phase_judge.md
CHANGED
|
@@ -75,7 +75,7 @@ import structlog
|
|
| 75 |
from pydantic_ai import Agent
|
| 76 |
from tenacity import retry, stop_after_attempt
|
| 77 |
|
| 78 |
-
from src.
|
| 79 |
from src.utils.exceptions import JudgeError
|
| 80 |
from src.utils.models import JudgeAssessment, Evidence
|
| 81 |
from src.prompts.judge import JUDGE_SYSTEM_PROMPT, build_judge_user_prompt
|
|
@@ -84,7 +84,7 @@ logger = structlog.get_logger()
|
|
| 84 |
|
| 85 |
# Initialize Agent
|
| 86 |
judge_agent = Agent(
|
| 87 |
-
model=settings.llm_model, # e.g.
|
| 88 |
result_type=JudgeAssessment,
|
| 89 |
system_prompt=JUDGE_SYSTEM_PROMPT,
|
| 90 |
)
|
|
@@ -149,4 +149,44 @@ class TestJudgeHandler:
|
|
| 149 |
- [ ] Create `src/prompts/judge.py`
|
| 150 |
- [ ] Implement `src/agent_factory/judges.py`
|
| 151 |
- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
|
| 152 |
-
- [ ] Run `uv run pytest tests/unit/agent_factory/`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
from pydantic_ai import Agent
|
| 76 |
from tenacity import retry, stop_after_attempt
|
| 77 |
|
| 78 |
+
from src.utils.config import settings
|
| 79 |
from src.utils.exceptions import JudgeError
|
| 80 |
from src.utils.models import JudgeAssessment, Evidence
|
| 81 |
from src.prompts.judge import JUDGE_SYSTEM_PROMPT, build_judge_user_prompt
|
|
|
|
| 84 |
|
| 85 |
# Initialize Agent
|
| 86 |
judge_agent = Agent(
|
| 87 |
+
model=settings.llm_model, # e.g. "openai:gpt-4o-mini" or "anthropic:claude-3-haiku"
|
| 88 |
result_type=JudgeAssessment,
|
| 89 |
system_prompt=JUDGE_SYSTEM_PROMPT,
|
| 90 |
)
|
|
|
|
| 149 |
- [ ] Create `src/prompts/judge.py`
|
| 150 |
- [ ] Implement `src/agent_factory/judges.py`
|
| 151 |
- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
|
| 152 |
+
- [ ] Run `uv run pytest tests/unit/agent_factory/`
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## 7. Definition of Done
|
| 157 |
+
|
| 158 |
+
Phase 3 is **COMPLETE** when:
|
| 159 |
+
|
| 160 |
+
1. β
All unit tests in `tests/unit/agent_factory/` pass.
|
| 161 |
+
2. β
`JudgeHandler` returns valid `JudgeAssessment` objects.
|
| 162 |
+
3. β
Structured output is enforced (no raw JSON strings leaked).
|
| 163 |
+
4. β
Retry/exception handling is covered by tests (mock failures).
|
| 164 |
+
5. β
Manual REPL sanity check works:
|
| 165 |
+
|
| 166 |
+
```python
|
| 167 |
+
import asyncio
|
| 168 |
+
from src.agent_factory.judges import JudgeHandler
|
| 169 |
+
from src.utils.models import Evidence, Citation
|
| 170 |
+
|
| 171 |
+
async def test():
|
| 172 |
+
handler = JudgeHandler()
|
| 173 |
+
evidence = [
|
| 174 |
+
Evidence(
|
| 175 |
+
content="Metformin shows neuroprotective properties...",
|
| 176 |
+
citation=Citation(
|
| 177 |
+
source="pubmed",
|
| 178 |
+
title="Metformin Review",
|
| 179 |
+
url="https://pubmed.ncbi.nlm.nih.gov/123/",
|
| 180 |
+
date="2024",
|
| 181 |
+
),
|
| 182 |
+
)
|
| 183 |
+
]
|
| 184 |
+
result = await handler.assess("Can metformin treat Alzheimer's?", evidence)
|
| 185 |
+
print(f"Sufficient: {result.sufficient}")
|
| 186 |
+
print(f"Recommendation: {result.recommendation}")
|
| 187 |
+
print(f"Reasoning: {result.reasoning}")
|
| 188 |
+
|
| 189 |
+
asyncio.run(test())
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
**Proceed to Phase 4 ONLY after all checkboxes are complete.**
|
docs/implementation/04_phase_ui.md
CHANGED
|
@@ -48,7 +48,7 @@ class AgentEvent(BaseModel):
|
|
| 48 |
import structlog
|
| 49 |
from typing import AsyncGenerator
|
| 50 |
|
| 51 |
-
from src.
|
| 52 |
from src.tools.search_handler import SearchHandler
|
| 53 |
from src.agent_factory.judges import JudgeHandler
|
| 54 |
from src.utils.models import AgentEvent, AgentState
|
|
@@ -117,4 +117,24 @@ class TestOrchestrator:
|
|
| 117 |
- [ ] Implement `src/orchestrator.py`
|
| 118 |
- [ ] Implement `src/app.py`
|
| 119 |
- [ ] Write tests in `tests/unit/test_orchestrator.py`
|
| 120 |
-
- [ ] Run `uv run python src/app.py`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
import structlog
|
| 49 |
from typing import AsyncGenerator
|
| 50 |
|
| 51 |
+
from src.utils.config import settings
|
| 52 |
from src.tools.search_handler import SearchHandler
|
| 53 |
from src.agent_factory.judges import JudgeHandler
|
| 54 |
from src.utils.models import AgentEvent, AgentState
|
|
|
|
| 117 |
- [ ] Implement `src/orchestrator.py`
|
| 118 |
- [ ] Implement `src/app.py`
|
| 119 |
- [ ] Write tests in `tests/unit/test_orchestrator.py`
|
| 120 |
+
- [ ] Run `uv run python src/app.py`
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## 7. Definition of Done
|
| 125 |
+
|
| 126 |
+
Phase 4 is **COMPLETE** when:
|
| 127 |
+
|
| 128 |
+
1. β
Unit test for orchestrator (`tests/unit/test_orchestrator.py`) passes.
|
| 129 |
+
2. β
Orchestrator streams `AgentEvent` objects through the loop (search β judge β synthesize/stop).
|
| 130 |
+
3. β
Gradio UI renders streaming updates locally (`uv run python src/app.py`).
|
| 131 |
+
4. β
Manual smoke test returns a markdown report for a demo query (e.g., "long COVID fatigue").
|
| 132 |
+
5. β
Deployment docs are ready (Space README/Dockerfile referenced).
|
| 133 |
+
|
| 134 |
+
Manual smoke test:
|
| 135 |
+
|
| 136 |
+
```bash
|
| 137 |
+
uv run python src/app.py
|
| 138 |
+
# open http://localhost:7860 and ask:
|
| 139 |
+
# "What existing drugs might help treat long COVID fatigue?"
|
| 140 |
+
```
|
docs/implementation/roadmap.md
CHANGED
|
@@ -104,7 +104,7 @@ deepcritical/
|
|
| 104 |
| Set up directory structure | All `__init__.py` files created |
|
| 105 |
| Configure ruff + mypy | Strict settings |
|
| 106 |
| Create conftest.py | Shared pytest fixtures |
|
| 107 |
-
| Implement
|
| 108 |
| Write first test | `test_config.py` passes |
|
| 109 |
|
| 110 |
**Deliverable**: `uv run pytest` passes with green output.
|
|
|
|
| 104 |
| Set up directory structure | All `__init__.py` files created |
|
| 105 |
| Configure ruff + mypy | Strict settings |
|
| 106 |
| Create conftest.py | Shared pytest fixtures |
|
| 107 |
+
| Implement utils/config.py | Settings via pydantic-settings |
|
| 108 |
| Write first test | `test_config.py` passes |
|
| 109 |
|
| 110 |
**Deliverable**: `uv run pytest` passes with green output.
|