File size: 11,287 Bytes
2f8ae1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# P0 CRITICAL BUGS - Why DeepCritical Produces Garbage Results

**Date:** November 27, 2025
**Status:** CRITICAL - App is functionally useless
**Severity:** P0 (Blocker)

## TL;DR

The app produces garbage because:
1. **BioRxiv search doesn't work** - returns random papers
2. **Free tier LLM is too dumb** - can't identify drugs
3. **Query construction is naive** - no optimization for PubMed/CT.gov syntax
4. **Loop terminates too early** - 5 iterations isn't enough

---

## P0-001: BioRxiv Search is Fundamentally Broken

**File:** `src/tools/biorxiv.py:248-286`

**The Problem:**
The bioRxiv API **DOES NOT SUPPORT KEYWORD SEARCH**.

The code does this:
```python
# Fetch recent papers (last 90 days, first 100 papers)
url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
# Then filter client-side for keywords
```

**What Actually Happens:**
1. Fetches the first 100 papers from medRxiv in the last 90 days (chronological order)
2. Filters those 100 random papers for query keywords
3. Returns whatever garbage matches

**Result:** For "Long COVID medications", you get random papers like:
- "Calf muscle structure-function adaptations"
- "Work-Life Balance of Ophthalmologists During COVID"

These papers contain "COVID" somewhere but have NOTHING to do with Long COVID treatments.

**Root Cause:** The `/0/json` pagination only returns 100 papers. You'd need to paginate through ALL papers (thousands) to do proper keyword filtering.

**Fix Options:**
1. **Remove BioRxiv entirely** - It's unusable without proper search API
2. **Use a different preprint aggregator** - Europe PMC has preprints WITH search
3. **Add pagination** - Fetch all papers (slow, expensive)
4. **Use Semantic Scholar API** - Has preprints and proper search

---

## P0-002: Free Tier LLM Cannot Perform Drug Identification

**File:** `src/agent_factory/judges.py:153-211`

**The Problem:**
Without an API key, the app uses `HFInferenceJudgeHandler` with:
- Llama 3.1 8B Instruct
- Mistral 7B Instruct

These are **7-8 billion parameter models**. They cannot:
- Reliably parse complex biomedical abstracts
- Identify drug candidates from scientific text
- Generate structured JSON output consistently
- Reason about mechanism of action

**Evidence of Failure:**
```python
# From MockJudgeHandler - the honest fallback when LLM fails
drug_candidates=[
    "Drug identification requires AI analysis",
    "Enter API key above for full results",
]
```

The team KNEW the free tier can't identify drugs and added this message.

**Root Cause:** Drug repurposing requires understanding:
- Drug mechanisms
- Disease pathophysiology
- Clinical trial phases
- Statistical significance

This requires GPT-4 / Claude Sonnet class models (100B+ parameters).

**Fix Options:**
1. **Require API key** - No free tier, be honest
2. **Use larger HF models** - Llama 70B or Mixtral 8x7B (expensive on free tier)
3. **Hybrid approach** - Use free tier for search, require paid for synthesis

---

## P0-003: PubMed Query Not Optimized

**File:** `src/tools/pubmed.py:54-71`

**The Problem:**
The query is passed directly to PubMed without optimization:
```python
search_params = self._build_params(
    db="pubmed",
    term=query,  # Raw user query!
    retmax=max_results,
    sort="relevance",
)
```

**What User Enters:** "What medications show promise for Long COVID?"

**What PubMed Receives:** `What medications show promise for Long COVID?`

**What PubMed Should Receive:**
```
("long covid"[Title/Abstract] OR "post-COVID"[Title/Abstract] OR "PASC"[Title/Abstract])
AND (drug[Title/Abstract] OR treatment[Title/Abstract] OR medication[Title/Abstract] OR therapy[Title/Abstract])
AND (clinical trial[Publication Type] OR randomized[Title/Abstract])
```

**Root Cause:** No query preprocessing or medical term expansion.

**Fix Options:**
1. **Add query preprocessor** - Extract medical entities, expand synonyms
2. **Use MeSH terms** - PubMed's controlled vocabulary for better recall
3. **LLM query generation** - Use LLM to generate optimized PubMed query

---

## P0-004: Loop Terminates Too Early

**File:** `src/app.py:42-45` and `src/utils/models.py`

**The Problem:**
```python
config = OrchestratorConfig(
    max_iterations=5,
    max_results_per_tool=10,
)
```

5 iterations is not enough to:
1. Search multiple variations of the query
2. Gather enough evidence for the Judge to synthesize
3. Refine queries based on initial results

**Evidence:** The user's output shows "Max Iterations Reached" with only 6 sources.

**Root Cause:** Conservative defaults to avoid API costs, but makes app useless.

**Fix Options:**
1. **Increase default to 10-15** - More iterations = better results
2. **Dynamic termination** - Stop when confidence > threshold, not iteration count
3. **Parallel query expansion** - Run more queries per iteration

---

## P0-005: No Query Understanding Layer

**Files:** `src/orchestrator.py`, `src/tools/search_handler.py`

**The Problem:**
There's no NLU (Natural Language Understanding) layer. The system:
1. Takes raw user query
2. Passes directly to search tools
3. No entity extraction
4. No intent classification
5. No query expansion

For drug repurposing, you need to extract:
- **Disease:** "Long COVID" β†’ [Long COVID, PASC, Post-COVID syndrome, chronic COVID]
- **Drug intent:** "medications" β†’ [drugs, treatments, therapeutics, interventions]
- **Evidence type:** "show promise" β†’ [clinical trials, efficacy, RCT]

**Root Cause:** No preprocessing pipeline between user input and search execution.

**Fix Options:**
1. **Add entity extraction** - Use BioBERT or PubMedBERT for medical NER
2. **Add query expansion** - Use medical ontologies (UMLS, MeSH)
3. **LLM preprocessing** - Use LLM to generate search strategy before searching

---

## P0-006: ClinicalTrials.gov Results Not Filtered

**File:** `src/tools/clinicaltrials.py`

**The Problem:**
ClinicalTrials.gov returns ALL matching trials including:
- Withdrawn trials
- Terminated trials
- Not yet recruiting
- Observational studies (not interventional)

For drug repurposing, you want:
- Interventional studies
- Phase 2+ (has safety/efficacy data)
- Completed or with results

**Root Cause:** No filtering of trial metadata.

---

## Summary: Why This App Produces Garbage

```
User Query: "What medications show promise for Long COVID?"
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ NO QUERY PREPROCESSING                                       β”‚
β”‚ - No entity extraction                                       β”‚
β”‚ - No synonym expansion                                       β”‚
β”‚ - No medical term normalization                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BROKEN SEARCH LAYER                                          β”‚
β”‚ - PubMed: Raw query, no MeSH, gets 1 result                 β”‚
β”‚ - BioRxiv: Returns random papers (API doesn't support search)β”‚
β”‚ - ClinicalTrials: Returns all trials, no filtering          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GARBAGE EVIDENCE                                             β”‚
β”‚ - 6 papers, most irrelevant                                  β”‚
β”‚ - "Calf muscle adaptations" (mentions COVID once)            β”‚
β”‚ - "Ophthalmologist work-life balance"                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DUMB JUDGE (Free Tier)                                       β”‚
β”‚ - Llama 8B can't identify drugs from garbage                 β”‚
β”‚ - JSON parsing fails                                         β”‚
β”‚ - Falls back to "Drug identification requires AI analysis"   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LOOP HITS MAX (5 iterations)                                 β”‚
β”‚ - Never finds enough good evidence                           β”‚
β”‚ - Never synthesizes anything useful                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
    GARBAGE OUTPUT
```

---

## What Would Make This Actually Work

### Minimum Viable Fix (1-2 days)

1. **Remove BioRxiv** - It doesn't work
2. **Require API key** - Be honest that free tier is useless
3. **Add basic query preprocessing** - Strip question words, expand COVID synonyms
4. **Increase iterations to 10**

### Proper Fix (1-2 weeks)

1. **Query Understanding Layer**
   - Medical NER (BioBERT/SciBERT)
   - Query expansion with MeSH/UMLS
   - Intent classification (drug discovery vs mechanism vs safety)

2. **Optimized Search**
   - PubMed: Proper query syntax with MeSH terms
   - ClinicalTrials: Filter by phase, status, intervention type
   - Replace BioRxiv with Europe PMC (has preprints + search)

3. **Evidence Ranking**
   - Score by publication type (RCT > cohort > case report)
   - Score by journal impact factor
   - Score by recency
   - Score by citation count

4. **Proper LLM Pipeline**
   - Use GPT-4 / Claude for synthesis
   - Structured extraction of: drug, mechanism, evidence level, effect size
   - Multi-step reasoning: identify β†’ validate β†’ rank β†’ synthesize

---

## The Hard Truth

Building a drug repurposing agent that works is HARD. The state of the art is:

- **Drug2Disease (IBM)** - Uses knowledge graphs + ML
- **COVID-KG (Stanford)** - Dedicated COVID knowledge graph
- **Literature Mining at scale (PubMed)** - Millions of papers, not 10

This hackathon project is fundamentally a **search wrapper with an LLM prompt**. That's not enough.

To make it useful:
1. Either scope it down (e.g., "find clinical trials for X disease")
2. Or invest serious engineering in the NLU + search + ranking pipeline