thought-retriever / CHANGELOG.md
star0628's picture
v2.0.0: Thought-Retriever with Chinese optimization
fdc316f

Changelog

All notable changes to Thought-Retriever will be documented in this file.

[2.0.0] - 2025-05-19

Added

  • πŸ‡¨πŸ‡³ Chinese Embedding Engine: jieba tokenization + TF-IDF embedder (_JiebaTfidfEmbedder)

    • Auto-detected as priority backend after sentence-transformers
    • 3-5x better semantic similarity for Chinese text vs character n-gram TF-IDF
    • Fully offline, no model download required
    • Vocabulary caching for consistent embedding dimensions
  • πŸ‡¨πŸ‡³ Chinese Prompt Templates: thought_confidence_prompt_zh() and answer_generation_prompt_zh()

    • Auto language detection based on Chinese character ratio
    • Configurable via ThoughtConfig(language="zh")
  • 🧹 Smart Message Filtering: should_generate_thought() utility

    • Skip short/meaningless messages (e.g., "ε—―", "ε₯½", "ok")
    • 7 regex patterns for filtering noise
  • 🧹 Thought Content Cleaning: Auto-remove LLM output labels

    • Clean [ηŸ₯θ―†η‚Ή]:, [ζη‚Όηš„ηŸ₯θ―†η‚Ή] etc. from thought content
  • πŸ“Š Language Configuration: ThoughtConfig.language and ThoughtConfig.thought_prompt_lang

    • auto: Auto-detect based on Chinese character ratio
    • zh: Force Chinese prompts
    • en: Force English prompts
  • πŸ”§ Public API Extensions:

    • ThoughtStore.clear_knowledge() public method
    • Export ThoughtStore, EmbeddingEngine, generate_id, timestamp_now, chunk_text from package
  • πŸ›‘ Robust Response Parsing: Enhanced _parse_thought_response()

    • Support Chinese "是/否/ζœ‰ζ•ˆ/ζ— ζ•ˆ" in addition to "1/0"
    • Fallback: treat multi-line content as valid if β‰₯5 chars

Changed

  • Embedding Backend Priority: sentence-transformers β†’ jieba_tfidf β†’ tfidf β†’ hash
  • JiebaTfidfEmbedder: Vocabulary and IDF cached after first encode, reused in subsequent calls
  • setup.py: Added jieba>=0.42.1 to core dependencies
  • Minimum Python version: 3.8 (unchanged)

Fixed

  • Fixed embedding dimension inconsistency when jieba_tfidf rebuilds vocabulary each call
  • Fixed _parse_thought_response failing on Chinese LLM output formats
  • Fixed ThoughtMemory.clear() directly accessing private _knowledge attribute

[1.0.0] - 2025-05-15

Added

  • Initial release based on TMLR 2026 paper
  • 5-step pipeline: Retrieval β†’ Answer β†’ Thought β†’ Merge β†’ Update
  • Dual filtering: confidence (ci) + redundancy (si)
  • 3-tier embedding fallback: sentence-transformers β†’ TF-IDF β†’ hash
  • JSON file persistence
  • Trae AI IDE Skill integration