Changelog

All notable changes to Thought-Retriever will be documented in this file.

[2.0.0] - 2025-05-19

🇨🇳 Chinese Embedding Engine: jieba tokenization + TF-IDF embedder (_JiebaTfidfEmbedder)
- Auto-detected as priority backend after sentence-transformers
- 3-5x better semantic similarity for Chinese text vs character n-gram TF-IDF
- Fully offline, no model download required
- Vocabulary caching for consistent embedding dimensions
🇨🇳 Chinese Prompt Templates: thought_confidence_prompt_zh() and answer_generation_prompt_zh()
- Auto language detection based on Chinese character ratio
- Configurable via ThoughtConfig(language="zh")
🧹 Smart Message Filtering: should_generate_thought() utility
- Skip short/meaningless messages (e.g., "嗯", "好", "ok")
- 7 regex patterns for filtering noise
🧹 Thought Content Cleaning: Auto-remove LLM output labels
- Clean [知识点]：, [提炼的知识点] etc. from thought content
📊 Language Configuration: ThoughtConfig.language and ThoughtConfig.thought_prompt_lang
- auto: Auto-detect based on Chinese character ratio
- zh: Force Chinese prompts
- en: Force English prompts
🔧 Public API Extensions:
- ThoughtStore.clear_knowledge() public method
- Export ThoughtStore, EmbeddingEngine, generate_id, timestamp_now, chunk_text from package
🛡 Robust Response Parsing: Enhanced _parse_thought_response()
- Support Chinese "是/否/有效/无效" in addition to "1/0"
- Fallback: treat multi-line content as valid if ≥5 chars

Embedding Backend Priority: sentence-transformers → jieba_tfidf → tfidf → hash
JiebaTfidfEmbedder: Vocabulary and IDF cached after first encode, reused in subsequent calls
setup.py: Added jieba>=0.42.1 to core dependencies
Minimum Python version: 3.8 (unchanged)

Fixed embedding dimension inconsistency when jieba_tfidf rebuilds vocabulary each call
Fixed _parse_thought_response failing on Chinese LLM output formats
Fixed ThoughtMemory.clear() directly accessing private _knowledge attribute