Spaces:
Build error
Build error
| # P0 CRITICAL BUGS - Why DeepCritical Produces Garbage Results | |
| **Date:** November 27, 2025 | |
| **Status:** CRITICAL - App is functionally useless | |
| **Severity:** P0 (Blocker) | |
| ## TL;DR | |
| The app produces garbage because: | |
| 1. **BioRxiv search doesn't work** - returns random papers | |
| 2. **Free tier LLM is too dumb** - can't identify drugs | |
| 3. **Query construction is naive** - no optimization for PubMed/CT.gov syntax | |
| 4. **Loop terminates too early** - 5 iterations isn't enough | |
| --- | |
| ## P0-001: BioRxiv Search is Fundamentally Broken | |
| **File:** `src/tools/biorxiv.py:248-286` | |
| **The Problem:** | |
| The bioRxiv API **DOES NOT SUPPORT KEYWORD SEARCH**. | |
| The code does this: | |
| ```python | |
| # Fetch recent papers (last 90 days, first 100 papers) | |
| url = f"{self.BASE_URL}/{self.server}/{interval}/0/json" | |
| # Then filter client-side for keywords | |
| ``` | |
| **What Actually Happens:** | |
| 1. Fetches the first 100 papers from medRxiv in the last 90 days (chronological order) | |
| 2. Filters those 100 random papers for query keywords | |
| 3. Returns whatever garbage matches | |
| **Result:** For "Long COVID medications", you get random papers like: | |
| - "Calf muscle structure-function adaptations" | |
| - "Work-Life Balance of Ophthalmologists During COVID" | |
| These papers contain "COVID" somewhere but have NOTHING to do with Long COVID treatments. | |
| **Root Cause:** The `/0/json` pagination only returns 100 papers. You'd need to paginate through ALL papers (thousands) to do proper keyword filtering. | |
| **Fix Options:** | |
| 1. **Remove BioRxiv entirely** - It's unusable without proper search API | |
| 2. **Use a different preprint aggregator** - Europe PMC has preprints WITH search | |
| 3. **Add pagination** - Fetch all papers (slow, expensive) | |
| 4. **Use Semantic Scholar API** - Has preprints and proper search | |
| --- | |
| ## P0-002: Free Tier LLM Cannot Perform Drug Identification | |
| **File:** `src/agent_factory/judges.py:153-211` | |
| **The Problem:** | |
| Without an API key, the app uses `HFInferenceJudgeHandler` with: | |
| - Llama 3.1 8B Instruct | |
| - Mistral 7B Instruct | |
| These are **7-8 billion parameter models**. They cannot: | |
| - Reliably parse complex biomedical abstracts | |
| - Identify drug candidates from scientific text | |
| - Generate structured JSON output consistently | |
| - Reason about mechanism of action | |
| **Evidence of Failure:** | |
| ```python | |
| # From MockJudgeHandler - the honest fallback when LLM fails | |
| drug_candidates=[ | |
| "Drug identification requires AI analysis", | |
| "Enter API key above for full results", | |
| ] | |
| ``` | |
| The team KNEW the free tier can't identify drugs and added this message. | |
| **Root Cause:** Drug repurposing requires understanding: | |
| - Drug mechanisms | |
| - Disease pathophysiology | |
| - Clinical trial phases | |
| - Statistical significance | |
| This requires GPT-4 / Claude Sonnet class models (100B+ parameters). | |
| **Fix Options:** | |
| 1. **Require API key** - No free tier, be honest | |
| 2. **Use larger HF models** - Llama 70B or Mixtral 8x7B (expensive on free tier) | |
| 3. **Hybrid approach** - Use free tier for search, require paid for synthesis | |
| --- | |
| ## P0-003: PubMed Query Not Optimized | |
| **File:** `src/tools/pubmed.py:54-71` | |
| **The Problem:** | |
| The query is passed directly to PubMed without optimization: | |
| ```python | |
| search_params = self._build_params( | |
| db="pubmed", | |
| term=query, # Raw user query! | |
| retmax=max_results, | |
| sort="relevance", | |
| ) | |
| ``` | |
| **What User Enters:** "What medications show promise for Long COVID?" | |
| **What PubMed Receives:** `What medications show promise for Long COVID?` | |
| **What PubMed Should Receive:** | |
| ``` | |
| ("long covid"[Title/Abstract] OR "post-COVID"[Title/Abstract] OR "PASC"[Title/Abstract]) | |
| AND (drug[Title/Abstract] OR treatment[Title/Abstract] OR medication[Title/Abstract] OR therapy[Title/Abstract]) | |
| AND (clinical trial[Publication Type] OR randomized[Title/Abstract]) | |
| ``` | |
| **Root Cause:** No query preprocessing or medical term expansion. | |
| **Fix Options:** | |
| 1. **Add query preprocessor** - Extract medical entities, expand synonyms | |
| 2. **Use MeSH terms** - PubMed's controlled vocabulary for better recall | |
| 3. **LLM query generation** - Use LLM to generate optimized PubMed query | |
| --- | |
| ## P0-004: Loop Terminates Too Early | |
| **File:** `src/app.py:42-45` and `src/utils/models.py` | |
| **The Problem:** | |
| ```python | |
| config = OrchestratorConfig( | |
| max_iterations=5, | |
| max_results_per_tool=10, | |
| ) | |
| ``` | |
| 5 iterations is not enough to: | |
| 1. Search multiple variations of the query | |
| 2. Gather enough evidence for the Judge to synthesize | |
| 3. Refine queries based on initial results | |
| **Evidence:** The user's output shows "Max Iterations Reached" with only 6 sources. | |
| **Root Cause:** Conservative defaults to avoid API costs, but makes app useless. | |
| **Fix Options:** | |
| 1. **Increase default to 10-15** - More iterations = better results | |
| 2. **Dynamic termination** - Stop when confidence > threshold, not iteration count | |
| 3. **Parallel query expansion** - Run more queries per iteration | |
| --- | |
| ## P0-005: No Query Understanding Layer | |
| **Files:** `src/orchestrator.py`, `src/tools/search_handler.py` | |
| **The Problem:** | |
| There's no NLU (Natural Language Understanding) layer. The system: | |
| 1. Takes raw user query | |
| 2. Passes directly to search tools | |
| 3. No entity extraction | |
| 4. No intent classification | |
| 5. No query expansion | |
| For drug repurposing, you need to extract: | |
| - **Disease:** "Long COVID" β [Long COVID, PASC, Post-COVID syndrome, chronic COVID] | |
| - **Drug intent:** "medications" β [drugs, treatments, therapeutics, interventions] | |
| - **Evidence type:** "show promise" β [clinical trials, efficacy, RCT] | |
| **Root Cause:** No preprocessing pipeline between user input and search execution. | |
| **Fix Options:** | |
| 1. **Add entity extraction** - Use BioBERT or PubMedBERT for medical NER | |
| 2. **Add query expansion** - Use medical ontologies (UMLS, MeSH) | |
| 3. **LLM preprocessing** - Use LLM to generate search strategy before searching | |
| --- | |
| ## P0-006: ClinicalTrials.gov Results Not Filtered | |
| **File:** `src/tools/clinicaltrials.py` | |
| **The Problem:** | |
| ClinicalTrials.gov returns ALL matching trials including: | |
| - Withdrawn trials | |
| - Terminated trials | |
| - Not yet recruiting | |
| - Observational studies (not interventional) | |
| For drug repurposing, you want: | |
| - Interventional studies | |
| - Phase 2+ (has safety/efficacy data) | |
| - Completed or with results | |
| **Root Cause:** No filtering of trial metadata. | |
| --- | |
| ## Summary: Why This App Produces Garbage | |
| ``` | |
| User Query: "What medications show promise for Long COVID?" | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β NO QUERY PREPROCESSING β | |
| β - No entity extraction β | |
| β - No synonym expansion β | |
| β - No medical term normalization β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β BROKEN SEARCH LAYER β | |
| β - PubMed: Raw query, no MeSH, gets 1 result β | |
| β - BioRxiv: Returns random papers (API doesn't support search)β | |
| β - ClinicalTrials: Returns all trials, no filtering β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β GARBAGE EVIDENCE β | |
| β - 6 papers, most irrelevant β | |
| β - "Calf muscle adaptations" (mentions COVID once) β | |
| β - "Ophthalmologist work-life balance" β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β DUMB JUDGE (Free Tier) β | |
| β - Llama 8B can't identify drugs from garbage β | |
| β - JSON parsing fails β | |
| β - Falls back to "Drug identification requires AI analysis" β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LOOP HITS MAX (5 iterations) β | |
| β - Never finds enough good evidence β | |
| β - Never synthesizes anything useful β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| GARBAGE OUTPUT | |
| ``` | |
| --- | |
| ## What Would Make This Actually Work | |
| ### Minimum Viable Fix (1-2 days) | |
| 1. **Remove BioRxiv** - It doesn't work | |
| 2. **Require API key** - Be honest that free tier is useless | |
| 3. **Add basic query preprocessing** - Strip question words, expand COVID synonyms | |
| 4. **Increase iterations to 10** | |
| ### Proper Fix (1-2 weeks) | |
| 1. **Query Understanding Layer** | |
| - Medical NER (BioBERT/SciBERT) | |
| - Query expansion with MeSH/UMLS | |
| - Intent classification (drug discovery vs mechanism vs safety) | |
| 2. **Optimized Search** | |
| - PubMed: Proper query syntax with MeSH terms | |
| - ClinicalTrials: Filter by phase, status, intervention type | |
| - Replace BioRxiv with Europe PMC (has preprints + search) | |
| 3. **Evidence Ranking** | |
| - Score by publication type (RCT > cohort > case report) | |
| - Score by journal impact factor | |
| - Score by recency | |
| - Score by citation count | |
| 4. **Proper LLM Pipeline** | |
| - Use GPT-4 / Claude for synthesis | |
| - Structured extraction of: drug, mechanism, evidence level, effect size | |
| - Multi-step reasoning: identify β validate β rank β synthesize | |
| --- | |
| ## The Hard Truth | |
| Building a drug repurposing agent that works is HARD. The state of the art is: | |
| - **Drug2Disease (IBM)** - Uses knowledge graphs + ML | |
| - **COVID-KG (Stanford)** - Dedicated COVID knowledge graph | |
| - **Literature Mining at scale (PubMed)** - Millions of papers, not 10 | |
| This hackathon project is fundamentally a **search wrapper with an LLM prompt**. That's not enough. | |
| To make it useful: | |
| 1. Either scope it down (e.g., "find clinical trials for X disease") | |
| 2. Or invest serious engineering in the NLU + search + ranking pipeline | |