๐ KoE5
Introducing KoE5, a model with advanced retrieval abilities. It has shown remarkable performance in Korean text retrieval.
For details, visit the KURE repository
Model Versions
| Model Name | Dimension | Sequence Length | Introduction |
|---|---|---|---|
| KURE-v1 | 1024 | 8192 | Fine-tuned BAAI/bge-m3 with Korean data via CachedGISTEmbedLoss |
| KoE5 | 1024 | 512 | Fine-tuned intfloat/multilingual-e5-large with ko-triplet-v1.0 via CachedMultipleNegativesRankingLoss |
Model Description
This is the model card of a ๐ค transformers model that has been pushed on the Hub.
- Developed by: NLP&AI Lab
- Language(s) (NLP): Korean, English
- License: MIT
- Finetuned from model: intfloat/multilingual-e5-large
- Finetuned dataset: ko-triplet-v1.0
Example code
Install Dependencies
First install the Sentence Transformers library:
pip install -U sentence-transformers
Python code
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("nlpai-lab/KoE5")
# Run inference
sentences = [
'query: ํ๋ฒ๊ณผ ๋ฒ์์กฐ์ง๋ฒ์ ์ด๋ค ๋ฐฉ์์ ํตํด ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ ๋ฑ์ ๋ค์ํ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ์ด',
'passage: 4. ์์ฌ์ ๊ณผ ๊ฐ์ ๋ฐฉํฅ ์์ ์ดํด๋ณธ ๋ฐ์ ๊ฐ์ด ์ฐ๋ฆฌ ํ๋ฒ๊ณผ ๏ฝข๋ฒ์์กฐ์ง ๋ฒ๏ฝฃ์ ๋๋ฒ์ ๊ตฌ์ฑ์ ๋ค์ํํ์ฌ ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ๊ณผ ๋ฏผ์ฃผ์ฃผ์ ํ๋ฆฝ์ ์์ด ๋ค๊ฐ์ ์ธ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ๋ ๊ฒ์ ๊ทผ๋ณธ ๊ท๋ฒ์ผ๋ก ํ๊ณ ์๋ค. ๋์ฑ์ด ํฉ์์ฒด๋ก์์ ๋๋ฒ์ ์๋ฆฌ๋ฅผ ์ฑํํ๊ณ ์๋ ๊ฒ ์ญ์ ๊ทธ ๊ตฌ์ฑ์ ๋ค์์ฑ์ ์์ฒญํ๋ ๊ฒ์ผ๋ก ํด์๋๋ค. ์ด์ ๊ฐ์ ๊ด์ ์์ ๋ณผ ๋ ํ์ง ๋ฒ์์ฅ๊ธ ๊ณ ์๋ฒ๊ด์ ์ค์ฌ์ผ๋ก ๋๋ฒ์์ ๊ตฌ์ฑํ๋ ๊ดํ์ ๊ฐ์ ํ ํ์๊ฐ ์๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค.',
'passage: โก ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ 2001๋
1์ 24์ผ 5:3์ ๋ค์๊ฒฌํด๋ก ใ๋ฒ์์กฐ์ง๋ฒใ ์ 169์กฐ ์ 2๋ฌธ์ด ํ๋ฒ์ ํฉ์น๋๋ค๋ ํ๊ฒฐ์ ๋ด๋ ธ์ โ 5์ธ์ ๋ค์ ์ฌํ๊ด์ ์์ก๊ด๊ณ์ธ์ ์ธ๊ฒฉ๊ถ ๋ณดํธ, ๊ณต์ ํ ์ ์ฐจ์ ๋ณด์ฅ๊ณผ ๋ฐฉํด๋ฐ์ง ์๋ ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ ๋ฑ์ ๊ทผ๊ฑฐ๋ก ํ์ฌ ํ
๋ ๋น์ ์ดฌ์์ ๋ํ ์ ๋์ ์ธ ๊ธ์ง๋ฅผ ํ๋ฒ์ ํฉ์นํ๋ ๊ฒ์ผ๋ก ๋ณด์์ โ ๊ทธ๋ฌ๋ ๋๋จธ์ง 3์ธ์ ์ฌํ๊ด์ ํ์ ๋ฒ์์ ์์ก์ ์ฐจ๋ ํน๋ณํ ์ธ๊ฒฉ๊ถ ๋ณดํธ์ ์ด์ต๋ ์์ผ๋ฉฐ, ํ
๋ ๋น์ ๊ณต๊ฐ์ฃผ์๋ก ์ธํด ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ์ ๊ณผ์ ์ด ์ธ์ ๋ ์ํ๋กญ๊ฒ ๋๋ ๊ฒ์ ์๋๋ผ๋ฉด์ ๋ฐ๋์๊ฒฌ์ ์ ์ํจ โ ์๋ํ๋ฉด ํ์ ๋ฒ์์ ์์ก์ ์ฐจ์์๋ ์์ก๋น์ฌ์๊ฐ ๊ฐ์ธ์ ์ผ๋ก ์ง์ ์ฌ๋ฆฌ์ ์ฐธ์ํ๊ธฐ๋ณด๋ค๋ ๋ณํธ์ฌ๊ฐ ์ฐธ์ํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ผ๋ฉฐ, ์ฌ๋ฆฌ๋์๋ ์ฌ์ค๋ฌธ์ ๊ฐ ์๋ ๋ฒ๋ฅ ๋ฌธ์ ๊ฐ ๋๋ถ๋ถ์ด๊ธฐ ๋๋ฌธ์ด๋ผ๋ ๊ฒ์ โก ํํธ, ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ(Bundesverfassungsgerichtsgesetz: BVerfGG) ์ 17a์กฐ์ ๋ฐ๋ผ ์ ํ์ ์ด๋๋ง ์ฌํ์ ๋ํ ๋ฐฉ์ก์ ํ์ฉํ๊ณ ์์ โ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ ์ 17์กฐ์์ ใ๋ฒ์์กฐ์ง๋ฒใ ์ 14์ ๋ด์ง ์ 16์ ์ ๊ท์ ์ ์ค์ฉํ๋๋ก ํ๊ณ ์์ง๋ง, ๋
น์์ด๋ ์ดฌ์์ ํตํ ์ฌํ๊ณต๊ฐ์ ๊ด๋ จํ์ฌ์๋ ใ๋ฒ์์กฐ์ง๋ฒใ๊ณผ ๋ค๋ฅธ ๋ด์ฉ์ ๊ท์ ํ๊ณ ์์',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6721, 0.3897],
# [0.6721, 1.0000, 0.3740],
# [0.3897, 0.3740, 1.0000]])
Training Details
Training Data
- ko-triplet-v1.0
- Korean query-document-hard_negative data pair (open data)
- About 700000+ examples used totally
Training Procedure
- loss: Used CachedMultipleNegativesRankingLoss by sentence-transformers
- batch size: 512
- learning rate: 1e-05
- epochs: 1
Evaluation
Metrics
- Recall, Precision, NDCG, F1
Benchmark Datasets
- Ko-StrategyQA: ํ๊ตญ์ด ODQA multi-hop ๊ฒ์ ๋ฐ์ดํฐ์ (StrategyQA ๋ฒ์ญ)
- AutoRAGRetrieval: ๊ธ์ต, ๊ณต๊ณต, ์๋ฃ, ๋ฒ๋ฅ , ์ปค๋จธ์ค 5๊ฐ ๋ถ์ผ์ ๋ํด, pdf๋ฅผ ํ์ฑํ์ฌ ๊ตฌ์ฑํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MIRACLRetrieval: Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- PublicHealthQA: ์๋ฃ ๋ฐ ๊ณต์ค๋ณด๊ฑด ๋๋ฉ์ธ์ ๋ํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- BelebeleRetrieval: FLORES-200 ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MrTidyRetrieval: Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MultiLongDocRetrieval: ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ์ฅ๋ฌธ ๊ฒ์ ๋ฐ์ดํฐ์
- XPQARetrieval: ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
Results
์๋๋ ๋ชจ๋ ๋ชจ๋ธ์, ๋ชจ๋ ๋ฒค์น๋งํฌ ๋ฐ์ดํฐ์ ์ ๋ํ ํ๊ท ๊ฒฐ๊ณผ์ ๋๋ค. ์์ธํ ๊ฒฐ๊ณผ๋ KURE Github์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
Top-k 1
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|---|---|---|---|---|
| nlpai-lab/KURE-v1 | 0.52640 | 0.60551 | 0.60551 | 0.55784 |
| dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 |
| BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 |
| nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 |
| intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 |
| jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 |
| BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 |
| intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 |
| intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 |
| intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 |
| Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 |
| openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 |
| Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 |
| upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 |
| jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 |
Top-k 3
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|---|---|---|---|---|
| nlpai-lab/KURE-v1 | 0.68678 | 0.28711 | 0.65538 | 0.39835 |
| dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 |
| BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 |
| intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 |
| nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 |
| BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 |
| jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 |
| intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 |
| Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 |
| intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 |
| intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 |
| openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 |
| Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 |
| upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 |
| jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 |
Top-k 5
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|---|---|---|---|---|
| nlpai-lab/KURE-v1 | 0.73851 | 0.19130 | 0.67479 | 0.29903 |
| dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 |
| BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 |
| nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 |
| intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 |
| BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 |
| jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 |
| intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 |
| Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 |
| intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 |
| intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 |
| openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 |
| Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 |
| upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 |
| jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 |
Top-k 10
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|---|---|---|---|---|
| nlpai-lab/KURE-v1 | 0.79682 | 0.10624 | 0.69473 | 0.18524 |
| dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 |
| BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | 0.68189 | 0.18260 |
| intfloat/multilingual-e5-large | 0.75902 | 0.10147 | 0.66370 | 0.17693 |
| nlpai-lab/KoE5 | 0.75296 | 0.09937 | 0.66012 | 0.17369 |
| BAAI/bge-multilingual-gemma2 | 0.76153 | 0.10364 | 0.65330 | 0.18003 |
| jinaai/jina-embeddings-v3 | 0.76277 | 0.10240 | 0.65290 | 0.17843 |
| intfloat/multilingual-e5-large-instruct | 0.74851 | 0.09888 | 0.64451 | 0.17283 |
| Alibaba-NLP/gte-multilingual-base | 0.75631 | 0.09938 | 0.64025 | 0.17363 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.74092 | 0.09607 | 0.63258 | 0.16847 |
| intfloat/multilingual-e5-base | 0.73512 | 0.09717 | 0.63216 | 0.16977 |
| intfloat/e5-mistral-7b-instruct | 0.73795 | 0.09777 | 0.63076 | 0.17078 |
| openai/text-embedding-3-large | 0.72946 | 0.09571 | 0.61670 | 0.16739 |
| Salesforce/SFR-Embedding-2_R | 0.71662 | 0.09546 | 0.60589 | 0.16651 |
| upskyy/bge-m3-korean | 0.71895 | 0.09583 | 0.60258 | 0.16712 |
| jhgan/ko-sroberta-multitask | 0.61225 | 0.07826 | 0.48687 | 0.13757 |
FAQ
- Do I need to add the prefix "query: " and "passage: " to input texts?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
Here are some rules of thumb:
Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
Citation
If you find our paper or models helpful, please consider cite as follows:
@misc{KURE,
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
year = {2024},
url = {https://github.com/nlpai-lab/KURE}
},
@misc{KoE5,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
year = {2024},
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
}
Limitations
Long texts will be truncated to at most 512 tokens.
- Downloads last month
- 12,869
Model tree for nlpai-lab/KoE5
Base model
intfloat/multilingual-e5-large