LLM Course documentation

မေးခွန်းဖြေဆိုခြင်း

LLM Course

0. စတင်ပြင်ဆင်ခြင်း

1. Transformer models များ

2. 🤗 Transformers ကို အသုံးပြုခြင်း

3. Pretrained Model တစ်ခုကို Fine-tuning လုပ်ခြင်း

4. Models နှင့် Tokenizers များကို မျှဝေခြင်း

5. The 🤗 Datasets library

6. The 🤗 Tokenizers library

7. Classical NLP Tasks များ

နိဒါန်း Token Classification Masked Language Model တစ်ခုကို Fine-tuning လုပ်ခြင်း ဘာသာပြန်ခြင်း အနှစ်ချုပ်ဖော်ပြခြင်း Causal Language Model တစ်ခုကို အစကနေ Train လုပ်ခြင်း မေးခွန်းဖြေဆိုခြင်း LLM များကို ကျွမ်းကျင်ခြင်း အခန်း (၇) ဆိုင်ရာ မေးခွန်းများ

8. အကူအညီတောင်းခံနည်း

9. Demos များ တည်ဆောက်ခြင်းနှင့် မျှဝေခြင်း

10. အရည်အသွေးမြင့် Datasets များကို စုစည်းခြင်း

11. Large Language Models များကို Fine-tune လုပ်ခြင်း

12. Reasoning Models များ တည်ဆောက်ခြင်း new

သင်တန်း ဆိုင်ရာ အခမ်းအနားများ

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Pytorch TensorFlow

မေးခွန်းဖြေဆိုခြင်း

မေးခွန်းဖြေဆိုခြင်းကို ကြည့်ရအောင်။ ဒီ task မှာ ပုံစံအမျိုးမျိုးရှိပေမယ့်၊ ဒီအပိုင်းမှာ ကျွန်တော်တို့ အာရုံစိုက်မယ့် ပုံစံကို extractive question answering လို့ ခေါ်ပါတယ်။ ဒါက document တစ်ခုအကြောင်း မေးခွန်းတွေမေးပြီး အဖြေတွေကို document ထဲက စာသားအပိုင်းများ (spans of text) အဖြစ် ဖော်ထုတ်တာကို ဆိုလိုပါတယ်။

Wikipedia ဆောင်းပါးများအစုအဝေးပေါ်တွင် crowdworkers များက မေးထားသော မေးခွန်းများပါဝင်သည့် SQuAD dataset ပေါ်မှာ BERT model တစ်ခုကို ကျွန်တော်တို့ fine-tune လုပ်ပါမယ်။ ဒါက အခုလိုမျိုး predictions တွေကို တွက်ချက်နိုင်တဲ့ model တစ်ခုကို ရရှိစေပါလိမ့်မယ်။

ဒါက ဒီအပိုင်းမှာ ပြထားတဲ့ code ကို အသုံးပြုပြီး Hub ကို train လုပ်ပြီး upload လုပ်ထားတဲ့ model ကို လက်တွေ့ပြသနေတာပါ။ သင် ဒါကို ဒီမှာ ရှာဖွေပြီး predictions တွေကို ထပ်မံစစ်ဆေးနိုင်ပါတယ်။

💡 BERT လို Encoder-only models တွေက “Transformer architecture ကို ဘယ်သူ တီထွင်ခဲ့တာလဲ” လို factoid မေးခွန်းတွေရဲ့ အဖြေတွေကို ထုတ်ယူရာမှာ အထူးကောင်းမွန်ပေမယ့် “ကောင်းကင်က ဘာလို့ အပြာရောင်ဖြစ်တာလဲ” လို open-ended မေးခွန်းတွေအတွက်တော့ အလုပ်ဖြစ်တာ နည်းပါတယ်။ ဒီလို ပိုမိုခက်ခဲတဲ့ အခြေအနေတွေမှာတော့ T5 နဲ့ BART လို encoder-decoder models တွေကို အချက်အလက်တွေကို စုစည်းရာမှာ ပုံမှန်အားဖြင့် အသုံးပြုပြီး text summarization နဲ့ အတော်လေး ဆင်တူပါတယ်။ ဒီလို generative question answering အမျိုးအစားကို စိတ်ဝင်စားတယ်ဆိုရင်၊ ELI5 dataset ပေါ်အခြေခံထားတဲ့ ကျွန်တော်တို့ရဲ့ demo ကို ကြည့်ရှုဖို့ ကျွန်တော်တို့ အကြံပြုပါတယ်။

Data ကို ပြင်ဆင်ခြင်း

extractive question answering အတွက် academic benchmark အဖြစ် အများဆုံး အသုံးပြုတဲ့ dataset က SQuAD ဖြစ်ပါတယ်။ ဒါကြောင့် ဒီနေရာမှာ အဲဒါကိုပဲ ကျွန်တော်တို့ အသုံးပြုပါမယ်။ SQuAD v2 benchmark ပိုမိုခက်ခဲတဲ့ SQuAD v2 လည်း ရှိပြီး၊ အဖြေမရှိတဲ့ မေးခွန်းတွေ ပါဝင်ပါတယ်။ သင့်ရဲ့ dataset မှာ contexts အတွက် column တစ်ခု၊ questions အတွက် column တစ်ခုနဲ့ answers အတွက် column တစ်ခု ပါဝင်နေသရွေ့ အောက်ပါအဆင့်တွေကို အသုံးပြုနိုင်ပါလိမ့်မယ်။

SQuAD Dataset

ပုံမှန်အတိုင်းပါပဲ၊ load_dataset() ကြောင့် dataset ကို တစ်ဆင့်တည်း download လုပ်ပြီး cache လုပ်နိုင်ပါတယ်။

from datasets import load_dataset

raw_datasets = load_dataset("squad")

ပြီးရင် SQuAD dataset အကြောင်း ပိုမိုသိရှိနိုင်ဖို့ ဒီ object ကို ကြည့်နိုင်ပါတယ်။

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

context, question, နဲ့ answers fields တွေနဲ့ ကျွန်တော်တို့ လိုအပ်သမျှ အားလုံးရှိနေပုံရပါတယ်။ ဒါကြောင့် training set ရဲ့ ပထမဆုံး element အတွက် ဒါတွေကို print ထုတ်ကြည့်ရအောင်။

print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context: 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
Answer: {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

context နဲ့ question fields တွေက အသုံးပြုရတာ အလွန်ရိုးရှင်းပါတယ်။ answers field ကတော့ အနည်းငယ် ပိုရှုပ်ထွေးပါတယ်။ ဘာလို့လဲဆိုတော့ fields နှစ်ခုစလုံးက lists တွေပါဝင်တဲ့ dictionary တစ်ခု ပါဝင်နေလို့ပါ။ ဒါက evaluation လုပ်နေစဉ် squad metric က မျှော်လင့်ထားတဲ့ format ဖြစ်ပါတယ်၊ သင်ကိုယ်ပိုင် data ကို အသုံးပြုနေတယ်ဆိုရင် အဖြေတွေကို ဒီ format အတိုင်း ထည့်စရာမလိုပါဘူး။ text field ကတော့ ရှင်းပါတယ်၊ answer_start field ကတော့ context ထဲက အဖြေတစ်ခုစီရဲ့ စတင်တဲ့ character index ကို ပါဝင်ပါတယ်။

training လုပ်နေစဉ်မှာ၊ ဖြစ်နိုင်တဲ့ အဖြေတစ်ခုပဲ ရှိပါတယ်။ ဒါကို Dataset.filter() method ကို အသုံးပြုပြီး ထပ်မံစစ်ဆေးနိုင်ပါတယ်။

raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

သို့သော်လည်း၊ evaluation အတွက်၊ sample တစ်ခုစီအတွက် ဖြစ်နိုင်တဲ့ အဖြေများစွာရှိနိုင်ပြီး၊ ဒါတွေက တူညီနိုင်သလို ကွဲပြားနိုင်ပါတယ်။

print(raw_datasets["validation"][0]["answers"])
print(raw_datasets["validation"][2]["answers"])

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}

evaluation script ကို ကျွန်တော်တို့ နက်နက်နဲနဲ လေ့လာသွားမှာ မဟုတ်ပါဘူး၊ ဘာလို့လဲဆိုတော့ 🤗 Datasets metric က ဒါတွေအားလုံးကို ကျွန်တော်တို့အတွက် လုပ်ဆောင်ပေးမှာမို့လို့ပါ။ ဒါပေမယ့် အတိုချုပ်ပြောရရင် မေးခွန်းအချို့မှာ ဖြစ်နိုင်တဲ့ အဖြေများစွာရှိနိုင်ပြီး၊ ဒီ script က ခန့်မှန်းထားတဲ့ အဖြေကို လက်ခံနိုင်တဲ့ အဖြေအားလုံးနဲ့ နှိုင်းယှဉ်ပြီး အကောင်းဆုံး score ကို ရယူမှာပါ။ ဥပမာ၊ index 2 မှာရှိတဲ့ sample ကို ကြည့်မယ်ဆိုရင်…

print(raw_datasets["validation"][2]["context"])
print(raw_datasets["validation"][2]["question"])

'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.'
'Where did Super Bowl 50 take place?'

အဖြေက ကျွန်တော်တို့ အရင်ကတွေ့ခဲ့ရတဲ့ ဖြစ်နိုင်တဲ့ အဖြေသုံးခုထဲက တစ်ခု ဖြစ်နိုင်တယ်ဆိုတာ တွေ့ရပါတယ်။

Training Data ကို စီမံဆောင်ရွက်ခြင်း

training data ကို preprocessing လုပ်ခြင်းဖြင့် စတင်ကြရအောင်။ အခက်ခဲဆုံး အပိုင်းကတော့ question ရဲ့ အဖြေအတွက် labels တွေကို ဖန်တီးဖို့ပါပဲ။ အဲဒီ labels တွေက context ထဲက အဖြေနဲ့ ကိုက်ညီတဲ့ tokens တွေရဲ့ စတင်တဲ့နဲ့ အဆုံးသတ်တဲ့ position တွေ ဖြစ်ပါလိမ့်မယ်။

ဒါပေမယ့် အလျင်စလို မလုပ်ပါနဲ့။ ပထမဆုံး၊ input ထဲက text ကို model က နားလည်နိုင်တဲ့ IDs တွေအဖြစ် tokenizer ကို အသုံးပြုပြီး ပြောင်းလဲဖို့ လိုအပ်ပါတယ်။

from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

ယခင်က ဖော်ပြခဲ့တဲ့အတိုင်း၊ ကျွန်တော်တို့ BERT model တစ်ခုကို fine-tuning လုပ်မှာဖြစ်ပေမယ့်၊ fast tokenizer ကို အကောင်အထည်ဖော်ထားသရွေ့ မည်သည့် model type ကိုမဆို အသုံးပြုနိုင်ပါတယ်။ fast version ပါရှိတဲ့ architectures အားလုံးကို ဒီဇယားကြီးမှာ ကြည့်နိုင်ပြီး၊ သင်အသုံးပြုနေတဲ့ tokenizer object က 🤗 Tokenizers က ထောက်ပံ့ပေးထားခြင်းရှိမရှိ စစ်ဆေးဖို့အတွက် ၎င်းရဲ့ is_fast attribute ကို ကြည့်နိုင်ပါတယ်။

tokenizer.is_fast

True

ကျွန်တော်တို့ရဲ့ tokenizer ကို question နဲ့ context ကို အတူတကွ ပေးနိုင်ပြီး၊ အဲဒါက အခုလိုမျိုး sentence တစ်ခုကို ဖန်တီးဖို့ special tokens တွေကို မှန်ကန်စွာ ထည့်သွင်းပေးပါလိမ့်မယ်။

[CLS] question [SEP] context [SEP]

ထပ်မံစစ်ဆေးကြည့်ရအောင်…

context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, '
'the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin '
'Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms '
'upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred '
'Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a '
'replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette '
'Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues '
'and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

labels တွေကတော့ အဖြေစတင်တဲ့နဲ့ အဆုံးသတ်တဲ့ tokens တွေရဲ့ index တွေ ဖြစ်ပါလိမ့်မယ်။ model က input ထဲက token တစ်ခုစီအတွက် start နဲ့ end logit တစ်ခုကို ခန့်မှန်းဖို့ တာဝန်ပေးခံရမှာဖြစ်ပြီး၊ သီအိုရီအရ labels တွေက အောက်ပါအတိုင်း ဖြစ်ပါလိမ့်မယ်။

One-hot encoded labels for question answering.

ဒီကိစ္စမှာ context က မရှည်လွန်းပါဘူး၊ ဒါပေမယ့် dataset ထဲက ဥပမာအချို့မှာ အလွန်ရှည်လျားတဲ့ contexts တွေ ပါဝင်ပြီး ကျွန်တော်တို့ သတ်မှတ်ထားတဲ့ maximum length (ဒီကိစ္စမှာ 384) ကို ကျော်လွန်သွားပါလိမ့်မယ်။ Chapter 6 မှာ question-answering pipeline ရဲ့ အတွင်းပိုင်းကို လေ့လာခဲ့တုန်းက တွေ့ခဲ့ရတဲ့အတိုင်း၊ ရှည်လျားတဲ့ contexts တွေကို dataset ရဲ့ sample တစ်ခုကနေ training features များစွာ ဖန်တီးခြင်းဖြင့် ကိုင်တွယ်ဖြေရှင်းပါမယ်။ features တွေကြားမှာ sliding window တစ်ခု ထားရှိပါမယ်။

လက်ရှိဥပမာကို အသုံးပြုပြီး ဒါဘယ်လိုအလုပ်လုပ်လဲဆိုတာ ကြည့်ဖို့အတွက်၊ length ကို 100 အထိ ကန့်သတ်ပြီး 50 tokens ရဲ့ sliding window တစ်ခုကို အသုံးပြုနိုင်ပါတယ်။ အမှတ်ရစေဖို့အတွက်၊ ကျွန်တော်တို့ အသုံးပြုတာက…

max_length က maximum length ကို သတ်မှတ်ဖို့ (ဒီနေရာမှာ 100)
truncation="only_second" က question နဲ့ context က အတူတူရှည်လွန်းတဲ့အခါ context (ဒုတိယနေရာမှာရှိတာ) ကို truncate လုပ်ဖို့
stride က ဆက်တိုက် chunks နှစ်ခုကြား ထပ်နေတဲ့ tokens အရေအတွက်ကို သတ်မှတ်ဖို့ (ဒီနေရာမှာ 50)
return_overflowing_tokens=True က tokenizer ကို overflowing tokens တွေ လိုချင်တယ်လို့ သိစေဖို့

inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]'
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]'
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 [SEP]'
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP]. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

ကျွန်တော်တို့ တွေ့ရတဲ့အတိုင်း၊ ကျွန်တော်တို့ရဲ့ ဥပမာကို inputs လေးခုအဖြစ် ပိုင်းဖြတ်ထားပြီး၊ တစ်ခုစီမှာ question နဲ့ context ရဲ့ အစိတ်အပိုင်းအချို့ ပါဝင်ပါတယ်။ question ရဲ့ အဖြေ (“Bernadette Soubirous”) ဟာ တတိယနဲ့ နောက်ဆုံး inputs တွေမှာပဲ ပေါ်လာတယ်ဆိုတာ သတိပြုပါ။ ဒါကြောင့် ဒီနည်းလမ်းနဲ့ ရှည်လျားတဲ့ contexts တွေကို ကိုင်တွယ်ခြင်းက context ထဲမှာ အဖြေမပါဝင်တဲ့ training examples အချို့ကို ဖန်တီးပေးပါလိမ့်မယ်။ အဲဒီ examples တွေအတွက် labels တွေက start_position = end_position = 0 (ဒါကြောင့် ကျွန်တော်တို့ [CLS] token ကို ခန့်မှန်းတာ) ဖြစ်ပါလိမ့်မယ်။ အဖြေကို truncate လုပ်ထားတာကြောင့် အစ (သို့မဟုတ် အဆုံး) ပဲ ကျန်တော့တဲ့ ကံမကောင်းတဲ့ အခြေအနေမှာလည်း အဲဒီ labels တွေကို သတ်မှတ်ပေးပါမယ်။ အဖြေက context ထဲမှာ အပြည့်အစုံပါဝင်တဲ့ examples တွေအတွက်တော့ labels တွေက အဖြေစတင်တဲ့ token ရဲ့ index နဲ့ အဖြေဆုံးသတ်တဲ့ token ရဲ့ index ဖြစ်ပါလိမ့်မယ်။

dataset က ကျွန်တော်တို့ကို context ထဲက အဖြေစတင်တဲ့ character ကို ပေးထားပြီး၊ အဖြေရဲ့ အရှည်ကို ပေါင်းထည့်ခြင်းဖြင့် context ထဲက အဆုံးသတ်တဲ့ character ကို ရှာဖွေနိုင်ပါတယ်။ ဒါတွေကို token indices တွေနဲ့ map လုပ်ဖို့အတွက်၊ Chapter 6 မှာ ကျွန်တော်တို့ လေ့လာခဲ့တဲ့ offset mappings တွေကို အသုံးပြုဖို့ လိုအပ်ပါလိမ့်မယ်။ return_offsets_mapping=True ကို ပေးပို့ခြင်းဖြင့် ကျွန်တော်တို့ရဲ့ tokenizer ကို ဒါတွေ ပြန်ပေးနိုင်ပါတယ်။

inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

ကျွန်တော်တို့ တွေ့ရတဲ့အတိုင်း၊ ပုံမှန် input IDs, token type IDs, နဲ့ attention mask တွေအပြင်၊ ကျွန်တော်တို့ လိုအပ်တဲ့ offset mapping နဲ့ အပို key ဖြစ်တဲ့ overflow_to_sample_mapping တို့ကို ပြန်ရပါတယ်။ သက်ဆိုင်ရာ value ကတော့ ကျွန်တော်တို့ texts များစွာကို တစ်ပြိုင်နက်တည်း tokenize လုပ်တဲ့အခါ အသုံးဝင်ပါလိမ့်မယ် (ကျွန်တော်တို့ရဲ့ tokenizer က Rust က ထောက်ပံ့ထားတာဖြစ်တဲ့အတွက် ဒါကို အကျိုးယူသင့်ပါတယ်)။ sample တစ်ခုက features များစွာကို ပေးနိုင်တဲ့အတွက်၊ ဒါက feature တစ်ခုစီကို ၎င်းစတင်ရာ example သို့ map လုပ်ပေးပါတယ်။ ဒီနေရာမှာ ကျွန်တော်တို့ example တစ်ခုတည်းကိုပဲ tokenize လုပ်ခဲ့တာကြောင့် 0 တွေရဲ့ list တစ်ခုကို ရရှိပါတယ်။

inputs["overflow_to_sample_mapping"]

[0, 0, 0, 0]

ဒါပေမယ့် ကျွန်တော်တို့ examples ပိုများများ tokenize လုပ်ရင် ဒါက ပိုအသုံးဝင်လာပါလိမ့်မယ်။

inputs = tokenizer(
    raw_datasets["train"][2:6]["question"],
    raw_datasets["train"][2:6]["context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

'The 4 examples gave 19 features.'
'Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3].'

ကျွန်တော်တို့ တွေ့ရတဲ့အတိုင်း၊ ပထမဆုံး examples သုံးခု (training set မှာ index 2, 3, 4 မှာရှိတဲ့) တစ်ခုစီက features လေးခုစီ ပေးခဲ့ပြီး နောက်ဆုံး example (training set မှာ index 5 မှာရှိတဲ့) က features ၇ ခု ပေးခဲ့ပါတယ်။

ဒီအချက်အလက်တွေက ကျွန်တော်တို့ ရရှိတဲ့ feature တစ်ခုစီကို ၎င်းရဲ့ သက်ဆိုင်ရာ label နဲ့ map လုပ်ဖို့ အသုံးဝင်ပါလိမ့်မယ်။ အစောပိုင်းက ဖော်ပြခဲ့တဲ့အတိုင်း၊ အဲဒီ labels တွေက…

အဖြေက context ရဲ့ သက်ဆိုင်ရာ span မှာ မပါဝင်ရင် (0, 0)
အဖြေက context ရဲ့ သက်ဆိုင်ရာ span မှာ ပါဝင်ရင် (start_position, end_position)၊ start_position က အဖြေရဲ့ အစမှာရှိတဲ့ token (input IDs ထဲက) ရဲ့ index ဖြစ်ပြီး end_position က အဖြေဆုံးသတ်တဲ့ token (input IDs ထဲက) ရဲ့ index ဖြစ်ပါတယ်။

ဒါတွေထဲက ဘယ်ကိစ္စဖြစ်လဲဆိုတာ ဆုံးဖြတ်ဖို့နဲ့၊ သက်ဆိုင်ရာ tokens တွေရဲ့ position တွေကို သိဖို့၊ ပထမဆုံး input IDs ထဲက context ကို စတင်တဲ့နဲ့ အဆုံးသတ်တဲ့ indices တွေကို ရှာဖွေရပါမယ်။ ဒါကိုလုပ်ဖို့ token type IDs တွေကို အသုံးပြုနိုင်ပေမယ့်၊ အဲဒါတွေက models အားလုံးအတွက် မရှိနိုင်တာကြောင့် (ဥပမာ - DistilBERT က အဲဒါတွေ မလိုအပ်ပါဘူး)၊ ကျွန်တော်တို့ရဲ့ tokenizer က ပြန်ပေးတဲ့ BatchEncoding ရဲ့ sequence_ids() method ကို အစားအသုံးပြုပါမယ်။

အဲဒီ token indices တွေ ရရှိပြီဆိုတာနဲ့၊ သက်ဆိုင်ရာ offsets တွေကို ကြည့်ပါမယ်။ အဲဒါတွေက original context ထဲက characters တွေရဲ့ span ကို ကိုယ်စားပြုတဲ့ integer နှစ်ခုပါဝင်တဲ့ tuples တွေ ဖြစ်ပါတယ်။ ဒါကြောင့် ဒီ feature ထဲက context ရဲ့ chunk က အဖြေစတင်ပြီးမှ စတာလား ဒါမှမဟုတ် အဖြေစတင်တာထက် စောပြီး ဆုံးတာလားဆိုတာကို သိနိုင်ပါတယ် (ဒီကိစ္စမှာ label က (0, 0) ဖြစ်ပါလိမ့်မယ်)။ အဲဒီလိုမဟုတ်ဘူးဆိုရင်၊ အဖြေရဲ့ ပထမဆုံးနဲ့ နောက်ဆုံး tokens တွေကို ရှာဖို့ loop လုပ်ပါမယ်။

answers = raw_datasets["train"][2:6]["answers"]
start_positions = []
end_positions = []

for i, offset in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)

    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    # If the answer is not fully inside the context, label is (0, 0)
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        start_positions.append(0)
        end_positions.append(0)
    else:
        # Otherwise it's the start and end token positions
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

start_positions, end_positions

([83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0],
 [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0])

ကျွန်တော်တို့ရဲ့ နည်းလမ်းက မှန်ကန်ခြင်းရှိမရှိ စစ်ဆေးဖို့ ရလဒ်အနည်းငယ်ကို ကြည့်ရအောင်။ ပထမဆုံး feature အတွက် (83, 85) ကို labels အဖြစ် တွေ့ရပါတယ်၊ ဒါကြောင့် သီအိုရီအရ အဖြေကို 83 ကနေ 85 (ပါဝင်) အထိ tokens တွေရဲ့ decoded span နဲ့ နှိုင်းယှဉ်ကြည့်ရအောင်။

idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

'Theoretical answer: the Main Building, labels give: the Main Building'

ဒါဆိုရင် ကိုက်ညီပါတယ်။ အခု index 4 ကို စစ်ဆေးကြည့်ရအောင်၊ အဲဒီမှာ labels တွေကို (0, 0) လို့ သတ်မှတ်ထားပါတယ်။ ဒါက အဖြေက အဲဒီ feature ရဲ့ context chunk မှာ မပါဝင်ဘူးလို့ ဆိုလိုပါတယ်။

idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

'Theoretical answer: a Marian place of prayer and reflection, decoded example: [CLS] What is the Grotto at Notre Dame? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grot [SEP]'

တကယ်ပါပဲ၊ context ထဲမှာ အဖြေကို ကျွန်တော်တို့ မတွေ့ရပါဘူး။

✏️ သင့်အလှည့်ပါ။ XLNet architecture ကို အသုံးပြုတဲ့အခါ၊ padding ကို ဘယ်ဘက်မှာ အသုံးပြုပြီး question နဲ့ context ကို ပြောင်းပြန်လှန်ထားပါတယ်။ ကျွန်တော်တို့ အခုမြင်တွေ့ခဲ့ရတဲ့ code အားလုံးကို XLNet architecture နဲ့ အံဝင်ခွင်ကျ ဖြစ်အောင် ပြောင်းလဲပါ (ပြီးတော့ padding=True ကို ထည့်သွင်းပါ)။ padding ကို အသုံးပြုတဲ့အခါ [CLS] token က 0 position မှာ မရှိနိုင်ဘူးဆိုတာ သတိပြုပါ။

အခု training data ကို တစ်ဆင့်ချင်းစီ ဘယ်လို preprocess လုပ်ရမယ်ဆိုတာ ကျွန်တော်တို့ တွေ့ခဲ့ရပြီဆိုတော့၊ ဒါကို function တစ်ခုထဲမှာ အုပ်စုဖွဲ့ပြီး training dataset တစ်ခုလုံးပေါ်မှာ အသုံးပြုနိုင်ပါပြီ။ ကျွန်တော်တို့ သတ်မှတ်ထားတဲ့ maximum length အထိ feature တိုင်းကို pad လုပ်ပါမယ်။ ဘာလို့လဲဆိုတော့ contexts အများစုက ရှည်လျားပြီး (သက်ဆိုင်ရာ samples တွေကို features အများအပြားအဖြစ် ခွဲထုတ်ပါလိမ့်မယ်)၊ ဒီနေရာမှာ dynamic padding ကို အသုံးပြုတာက တကယ်တော့ အကျိုးကျေးဇူး မရှိပါဘူး။

max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

ကျွန်တော်တို့ maximum length နဲ့ sliding window ရဲ့ length ကို သတ်မှတ်ဖို့ constants နှစ်ခုကို သတ်မှတ်ခဲ့ပြီး၊ tokenize မလုပ်ခင် သန့်ရှင်းရေးအနည်းငယ် ထပ်ထည့်ခဲ့တယ်ဆိုတာ သတိပြုပါ။ SQuAD dataset ထဲက မေးခွန်းအချို့မှာ အစနဲ့အဆုံးမှာ အပို spaces တွေ ပါဝင်ပြီး ဒါတွေက ဘာမှ ထပ်မံထည့်သွင်းခြင်းမရှိသလို (RoBERTa လို model ကို အသုံးပြုရင် tokenize လုပ်တဲ့အခါ နေရာယူပါတယ်)၊ ဒါကြောင့် အဲဒီအပို spaces တွေကို ဖယ်ရှားခဲ့ပါတယ်။

ဒီ function ကို training set တစ်ခုလုံးပေါ်မှာ အသုံးပြုဖို့အတွက်၊ batched=True flag နဲ့ Dataset.map() method ကို အသုံးပြုပါတယ်။ ဒီနေရာမှာ ဒါက လိုအပ်ပါတယ်၊ ဘာလို့လဲဆိုတော့ dataset ရဲ့ length ကို ကျွန်တော်တို့ ပြောင်းလဲနေလို့ပါ (sample တစ်ခုက training features များစွာ ပေးနိုင်တာကြောင့်)။

train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)

(87599, 88729)

ကျွန်တော်တို့ တွေ့ရတဲ့အတိုင်း၊ preprocessing က features ၁,၀၀၀ ခန့် ထပ်မံထည့်သွင်းခဲ့ပါတယ်။ ကျွန်တော်တို့ရဲ့ training set က အခုဆိုရင် အသုံးပြုဖို့ အဆင်သင့်ဖြစ်ပါပြီ — validation set ရဲ့ preprocessing ကို လေ့လာကြည့်ရအောင်။

Validation Data ကို စီမံဆောင်ရွက်ခြင်း

validation data ကို preprocessing လုပ်တာက အနည်းငယ် ပိုလွယ်ကူပါလိမ့်မယ်။ ဘာလို့လဲဆိုတော့ labels တွေ ဖန်တီးဖို့ မလိုအပ်လို့ပါပဲ (validation loss ကို တွက်ချက်ချင်တယ်ဆိုရင်တော့ လိုအပ်ပေမယ့်၊ အဲဒီဂဏန်းက model ဘယ်လောက်ကောင်းလဲဆိုတာ နားလည်ဖို့ တကယ်အကူအညီဖြစ်မှာ မဟုတ်ပါဘူး)။ တကယ့် ပျော်ရွှင်စရာကတော့ model ရဲ့ predictions တွေကို original context ရဲ့ spans တွေအဖြစ် အဓိပ္ပာယ်ဖွင့်ဆိုဖို့ပါပဲ။ ဒီအတွက်၊ ကျွန်တော်တို့ offset mappings တွေနဲ့ ဖန်တီးထားတဲ့ feature တစ်ခုစီကို ၎င်းစတင်ရာ original example နဲ့ ကိုက်ညီစေမယ့် နည်းလမ်းအချို့ကို သိမ်းဆည်းထားဖို့ပဲ လိုအပ်ပါလိမ့်မယ်။ original dataset ထဲမှာ ID column တစ်ခုရှိတဲ့အတွက်၊ အဲဒီ ID ကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။

ဒီနေရာမှာ ကျွန်တော်တို့ ထပ်ထည့်မယ့် တစ်ခုတည်းသောအရာကတော့ offset mappings တွေကို သန့်ရှင်းရေးလုပ်တာ အနည်းငယ်ပါ။ ၎င်းတို့မှာ question နဲ့ context အတွက် offsets တွေ ပါဝင်ပါလိမ့်မယ်၊ ဒါပေမယ့် post-processing အဆင့်ရောက်တာနဲ့ input IDs ရဲ့ ဘယ်အပိုင်းက context နဲ့ ကိုက်ညီပြီး ဘယ်အပိုင်းက question ဖြစ်တယ်ဆိုတာကို သိဖို့ နည်းလမ်းမရှိတော့ပါဘူး (ကျွန်တော်တို့ အသုံးပြုခဲ့တဲ့ sequence_ids() method က tokenizer ရဲ့ output အတွက်ပဲ ရရှိနိုင်ပါတယ်)။ ဒါကြောင့်၊ question နဲ့ ကိုက်ညီတဲ့ offsets တွေကို None လို့ သတ်မှတ်ပါမယ်။

def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

ဒီ function ကို ယခင်ကလိုပဲ validation dataset တစ်ခုလုံးပေါ်မှာ အသုံးပြုနိုင်ပါတယ်။

validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)

(10570, 10822)

ဒီကိစ္စမှာ ကျွန်တော်တို့ samples ရာဂဏန်းအနည်းငယ်သာ ထပ်ထည့်ခဲ့တာကြောင့် validation dataset ထဲက contexts တွေက အနည်းငယ် တိုတောင်းပုံရပါတယ်။

data အားလုံးကို preprocess လုပ်ပြီးပြီဆိုတော့ training ကို စတင်နိုင်ပါပြီ။

Trainer API ဖြင့် Model ကို Fine-tuning လုပ်ခြင်း

ဒီဥပမာအတွက် training code က ယခင်အပိုင်းတွေက code နဲ့ အများကြီး ဆင်တူပါလိမ့်မယ် — အခက်ခဲဆုံးအရာကတော့ compute_metrics() function ကို ရေးသားဖို့ပါပဲ။ ကျွန်တော်တို့ samples အားလုံးကို သတ်မှတ်ထားတဲ့ maximum length အထိ padding လုပ်ခဲ့တာကြောင့်၊ သတ်မှတ်ဖို့ data collator မရှိပါဘူး။ ဒါကြောင့် ဒီ metric computation က ကျွန်တော်တို့ စိုးရိမ်ရမယ့် တစ်ခုတည်းသော အရာပါပဲ။ အခက်ခဲဆုံး အပိုင်းကတော့ model predictions တွေကို original examples ထဲက text spans တွေအဖြစ် post-process လုပ်ဖို့ပါပဲ၊ ဒါပြီးရင် 🤗 Datasets library ကနေ metric က အလုပ်အများစုကို ကျွန်တော်တို့အတွက် လုပ်ပေးပါလိမ့်မယ်။

Post-processing

model က input IDs ထဲက အဖြေရဲ့ start နဲ့ end positions အတွက် logits တွေကို ထုတ်ပေးပါလိမ့်မယ်။ Chapter 6 မှာ question-answering pipeline ကို လေ့လာခဲ့စဉ်က ကျွန်တော်တို့ တွေ့ခဲ့ရတဲ့အတိုင်းပါပဲ။ post-processing အဆင့်က အဲဒီမှာ လုပ်ခဲ့တာနဲ့ ဆင်တူပါလိမ့်မယ်။ ဒါကြောင့် ကျွန်တော်တို့ လုပ်ခဲ့တဲ့ လုပ်ဆောင်ချက်တွေကို အတိုချုပ် ပြန်ပြောရရင်…

context ပြင်ပ tokens တွေနဲ့ ကိုက်ညီတဲ့ start နဲ့ end logits တွေကို ကျွန်တော်တို့ mask လုပ်ခဲ့ပါတယ်။
ပြီးရင် start နဲ့ end logits တွေကို softmax အသုံးပြုပြီး probabilities တွေအဖြစ် ပြောင်းလဲခဲ့ပါတယ်။
သက်ဆိုင်ရာ probabilities နှစ်ခုရဲ့ product ကို ယူခြင်းဖြင့် (start_token, end_token) အတွဲတစ်ခုစီကို score တစ်ခု သတ်မှတ်ခဲ့ပါတယ်။
အကောင်းဆုံး score ရှိတဲ့ အဖြေမှန် (ဥပမာ - start_token က end_token ထက် နည်းပါးတာ) ပေးတဲ့ အတွဲကို ရှာဖွေခဲ့ပါတယ်။

ဒီနေရာမှာတော့ ဒီလုပ်ငန်းစဉ်ကို အနည်းငယ် ပြောင်းလဲပါမယ်။ ဘာလို့လဲဆိုတော့ တကယ့် scores တွေကို တွက်ချက်ဖို့ မလိုအပ်လို့ပါ (ခန့်မှန်းထားတဲ့ အဖြေကိုပဲ လိုအပ်လို့ပါ)။ ဒါက softmax အဆင့်ကို ကျော်ဖြတ်နိုင်တယ်လို့ ဆိုလိုပါတယ်။ ပိုမြန်အောင်လုပ်ဖို့၊ ဖြစ်နိုင်တဲ့ (start_token, end_token) အတွဲအားလုံးကိုလည်း score လုပ်မှာ မဟုတ်ပါဘူး၊ အမြင့်ဆုံး n_best logits (with n_best=20) နဲ့ ကိုက်ညီတဲ့ အတွဲတွေကိုပဲ score လုပ်ပါမယ်။ ကျွန်တော်တို့ softmax ကို ကျော်ဖြတ်မှာဖြစ်တဲ့အတွက်၊ အဲဒီ scores တွေက logit scores တွေဖြစ်ပြီး start နဲ့ end logits တွေရဲ့ sum ကို ယူခြင်းဖြင့် ရရှိပါလိမ့်မယ် (product အစား၊ ဘာလို့လဲဆိုတော့ $\log(ab) = \log(a) + \log(b)$ စည်းမျဉ်းကြောင့်ပါ)။

ဒါတွေအားလုံးကို ပြသဖို့အတွက်၊ predictions အချို့ လိုအပ်ပါလိမ့်မယ်။ ကျွန်တော်တို့ model ကို မ train ရသေးတာကြောင့်၊ validation set ရဲ့ အပိုင်းအနည်းငယ်ပေါ်မှာ predictions တွေ ထုတ်ပေးဖို့ QA pipeline အတွက် default model ကို အသုံးပြုပါမယ်။ ယခင်ကလို preprocessing function ကို အသုံးပြုနိုင်ပါတယ်၊ global constant tokenizer ပေါ်မှာ မှီခိုနေတာကြောင့်၊ အဲဒီ object ကို ယာယီအသုံးပြုချင်တဲ့ model ရဲ့ tokenizer ကို ပြောင်းလဲဖို့ပဲ လိုအပ်ပါတယ်။

small_eval_set = raw_datasets["validation"].select(range(100))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

preprocessing လုပ်ပြီးပြီဆိုတော့၊ tokenizer ကို ကျွန်တော်တို့ မူလက ရွေးချယ်ခဲ့တဲ့ တစ်ခုကို ပြန်ပြောင်းပါမယ်။

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

ပြီးရင် model က မမျှော်လင့်ထားတဲ့ eval_set ရဲ့ columns တွေကို ဖယ်ရှားပြီး၊ အဲဒီ small validation set အားလုံးနဲ့ batch တစ်ခု တည်ဆောက်ကာ model ကနေတစ်ဆင့် ပေးပို့ပါတယ်။ GPU ရရှိနိုင်ရင် ပိုမြန်အောင် အသုံးပြုပါတယ်။

import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(
    device
)

with torch.no_grad():
    outputs = trained_model(**batch)

Trainer က ကျွန်တော်တို့ကို predictions တွေကို NumPy arrays အဖြစ် ပေးမှာဖြစ်တဲ့အတွက်၊ start နဲ့ end logits တွေကို ယူပြီး အဲဒီ format အဖြစ် ပြောင်းလဲပါမယ်။

start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

အခု၊ small_eval_set ထဲက example တစ်ခုစီအတွက် ခန့်မှန်းထားတဲ့ အဖြေကို ရှာဖို့ လိုအပ်ပါတယ်။ example တစ်ခုကို eval_set ထဲမှာ features များစွာအဖြစ် ခွဲထုတ်ခဲ့နိုင်တာကြောင့်၊ ပထမအဆင့်က small_eval_set ထဲက example တစ်ခုစီကို eval_set ထဲက သက်ဆိုင်ရာ features တွေနဲ့ map လုပ်ဖို့ပါပဲ။

import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

ဒီအချက်အလက်တွေ ရရှိပြီဆိုတာနဲ့၊ examples အားလုံးကို loop လုပ်ပြီး၊ example တစ်ခုစီအတွက် သက်ဆိုင်ရာ features အားလုံးကို loop လုပ်ခြင်းဖြင့် တကယ်အလုပ်လုပ်နိုင်ပါပြီ။ အရင်က ပြောခဲ့တဲ့အတိုင်း၊ n_best start logits နဲ့ end logits တွေအတွက် logit scores တွေကို ကြည့်ပါမယ်။ အောက်ပါအခြေအနေတွေနဲ့ ကိုက်ညီတဲ့ positions တွေကိုတော့ ချန်လှပ်ထားပါမယ်။

context ထဲမှာ မပါဝင်တဲ့ အဖြေတစ်ခု
အနှုတ်တန်ဖိုးရှိတဲ့ အရှည်ရှိတဲ့ အဖြေတစ်ခု
အလွန်ရှည်လျားတဲ့ အဖြေတစ်ခု (ဖြစ်နိုင်ချေရှိတဲ့ အဖြေတွေကို max_answer_length=30 နဲ့ ကန့်သတ်ထားပါတယ်)

example တစ်ခုအတွက် အမှတ်ပေးထားတဲ့ ဖြစ်နိုင်တဲ့ အဖြေအားလုံး ရရှိပြီဆိုတာနဲ့၊ အကောင်းဆုံး logit score ရှိတဲ့ တစ်ခုကို ရွေးလိုက်ရုံပါပဲ။

import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # context ထဲမှာ အပြည့်အဝ မပါဝင်တဲ့ အဖြေတွေကို ကျော်ဖြတ်ပါ
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # length က < 0 ဒါမှမဟုတ် > max_answer_length ဖြစ်တဲ့ အဖြေတွေကို ကျော်ဖြတ်ပါ။
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

ခန့်မှန်းထားတဲ့ အဖြေတွေရဲ့ နောက်ဆုံး format က ကျွန်တော်တို့ အသုံးပြုမယ့် metric က မျှော်လင့်ထားတဲ့ ပုံစံပါပဲ။ ပုံမှန်အတိုင်းပါပဲ၊ 🤗 Evaluate library ရဲ့ အကူအညီနဲ့ ဒါကို load လုပ်နိုင်ပါတယ်။

import evaluate

metric = evaluate.load("squad")

ဒီ metric က အထက်မှာ ကျွန်တော်တို့ တွေ့ခဲ့ရတဲ့ ပုံစံ (example ရဲ့ ID အတွက် key တစ်ခုနဲ့ ခန့်မှန်းထားတဲ့ text အတွက် key တစ်ခုပါဝင်တဲ့ dictionaries စာရင်း) နဲ့ အောက်ပါပုံစံ (example ရဲ့ ID အတွက် key တစ်ခုနဲ့ ဖြစ်နိုင်တဲ့ အဖြေတွေအတွက် key တစ်ခုပါဝင်တဲ့ dictionaries စာရင်း) တို့ကို မျှော်လင့်ထားပါတယ်။

theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in small_eval_set
]

အခု lists နှစ်ခုလုံးရဲ့ ပထမဆုံး element ကို ကြည့်ခြင်းဖြင့် sensible results တွေ ရမရ စစ်ဆေးနိုင်ပါပြီ။

print(predicted_answers[0])
print(theoretical_answers[0])

{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
{'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}

မဆိုးဘူး! အခု metric က ကျွန်တော်တို့ကို ပေးတဲ့ score ကို ကြည့်ရအောင်…

metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 83.0, 'f1': 88.25}

ထပ်ပြီး၊ အဲဒါက တော်တော်လေး ကောင်းမွန်ပါတယ်။ ၎င်းရဲ့ paper အရ SQuAD ပေါ်မှာ fine-tune လုပ်ထားတဲ့ DistilBERT က dataset တစ်ခုလုံးပေါ်မှာ 79.1 နဲ့ 86.9 ကို ရရှိတာကြောင့် ကျွန်တော်တို့ မှန်ကန်တဲ့ နေရာမှာ ရှိနေပါတယ်။

အခု ကျွန်တော်တို့ လုပ်ခဲ့တာတွေအားလုံးကို compute_metrics() function တစ်ခုထဲမှာ ထည့်ပြီး Trainer မှာ အသုံးပြုပါမယ်။ ပုံမှန်အားဖြင့်၊ အဲဒီ compute_metrics() function က logits နဲ့ labels တွေပါဝင်တဲ့ tuple eval_preds ကိုပဲ လက်ခံပါတယ်။ ဒီနေရာမှာတော့ အနည်းငယ် ထပ်လိုအပ်ပါလိမ့်မယ်။ ဘာလို့လဲဆိုတော့ offsets တွေအတွက် features dataset ထဲမှာ ကြည့်ရမှာဖြစ်ပြီး၊ original contexts တွေအတွက် examples dataset ထဲမှာ ကြည့်ရမှာမို့လို့ပါ။ ဒါကြောင့် training လုပ်နေစဉ် ပုံမှန် evaluation results တွေရဖို့ ဒီ function ကို အသုံးပြုနိုင်မှာ မဟုတ်ပါဘူး။ training ပြီးဆုံးမှသာ ရလဒ်တွေကို စစ်ဆေးဖို့ အသုံးပြုပါမယ်။

compute_metrics() function က ယခင်က လုပ်ခဲ့တဲ့ အဆင့်တွေ အတူတူကို အုပ်စုဖွဲ့ထားပါတယ်။ ကျွန်တော်တို့က valid answers တွေ မထွက်လာတဲ့အခါ (ဒီကိစ္စမှာ empty string ကို ခန့်မှန်းတာ) အနည်းငယ် စစ်ဆေးခြင်းကို ထပ်ထည့်လိုက်ပါတယ်။

from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # အဲဒီ example နဲ့ ဆက်စပ်နေတဲ့ features အားလုံးကို loop လုပ်ပါ
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # context ထဲမှာ အပြည့်အဝ မပါဝင်တဲ့ အဖြေတွေကို ကျော်ဖြတ်ပါ
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # length က < 0 ဒါမှမဟုတ် > max_answer_length ဖြစ်တဲ့ အဖြေတွေကို ကျော်ဖြတ်ပါ။
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # အကောင်းဆုံး score ရှိတဲ့ အဖြေကို ရွေးပါ
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

ကျွန်တော်တို့ရဲ့ predictions တွေပေါ်မှာ ဒါက အလုပ်လုပ်မလုပ် စစ်ဆေးနိုင်ပါတယ်။

compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

{'exact_match': 83.0, 'f1': 88.25}

ကောင်းပါတယ်။ အခု ဒါကို အသုံးပြုပြီး ကျွန်တော်တို့ model ကို fine-tune လုပ်ရအောင်။

Model ကို Fine-tuning လုပ်ခြင်း

ကျွန်တော်တို့ model ကို train ဖို့ အဆင်သင့်ဖြစ်ပါပြီ။ ပထမဆုံး AutoModelForQuestionAnswering class ကို ယခင်ကလို အသုံးပြုပြီး model ကို ဖန်တီးပါ။

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

ပုံမှန်အတိုင်းပါပဲ၊ weights အချို့ (pretrained head ကနေရတဲ့) ကို အသုံးမပြုကြောင်းနဲ့ အခြားအချို့ကို (question answering head အတွက်) ကျပန်း စတင်သတ်မှတ်ထားကြောင်း သတိပေးချက်တစ်ခု ရရှိပါလိမ့်မယ်။ အခုဆိုရင် ဒါနဲ့ ရင်းနှီးပြီးသားဖြစ်သင့်ပေမယ့်၊ ဒါက ဒီ model ကို အခုချက်ချင်း အသုံးပြုဖို့ အဆင်သင့်မဖြစ်သေးဘဲ fine-tuning လုပ်ဖို့ လိုအပ်တယ်လို့ ဆိုလိုပါတယ် — ဒါကို အခုလုပ်တော့မှာမို့ ကောင်းပါတယ်။

ကျွန်တော်တို့ model ကို Hub ကို push လုပ်နိုင်ဖို့ Hugging Face ကို log in ဝင်ဖို့ လိုအပ်ပါလိမ့်မယ်။ သင် ဒီ code ကို notebook မှာ run နေတယ်ဆိုရင်၊ login credentials တွေကို ထည့်သွင်းနိုင်မယ့် widget တစ်ခုကို ပြသပေးမယ့် အောက်ပါ utility function နဲ့ ဒါကို လုပ်ဆောင်နိုင်ပါတယ်။

from huggingface_hub import notebook_login

notebook_login()

သင် notebook မှာ အလုပ်မလုပ်ဘူးဆိုရင်၊ သင့် terminal မှာ အောက်ပါစာကြောင်းကို ရိုက်ထည့်လိုက်ပါ။

huggingface-cli login

ဒါပြီးတာနဲ့၊ ကျွန်တော်တို့ရဲ့ TrainingArguments တွေကို သတ်မှတ်နိုင်ပါတယ်။ metric ကို တွက်ချက်ဖို့ function ကို သတ်မှတ်ခဲ့စဉ်က ပြောခဲ့တဲ့အတိုင်း၊ compute_metrics() function ရဲ့ signature ကြောင့် ပုံမှန် evaluation loop တစ်ခု ရှိနိုင်မှာ မဟုတ်ပါဘူး။ ဒါကိုလုပ်ဖို့ Trainer ရဲ့ ကိုယ်ပိုင် subclass တစ်ခုကို ရေးသားနိုင်ပါတယ် (ဒီနည်းလမ်းကို question answering example script မှာ တွေ့နိုင်ပါတယ်)၊ ဒါပေမယ့် ဒီအပိုင်းအတွက်ကတော့ နည်းနည်း ရှည်လျားပါတယ်။ အစား၊ ဒီနေရာမှာ training ပြီးဆုံးမှ model ကိုသာ evaluate လုပ်ပြီး၊ “A custom training loop” အောက်မှာ ပုံမှန် evaluation လုပ်နည်းကို ပြသပေးပါမယ်။

ဒါက Trainer API ရဲ့ ကန့်သတ်ချက်တွေကို ပြသပြီး 🤗 Accelerate library က ဘယ်လောက် အစွမ်းထက်တယ်ဆိုတာကို ပြသတဲ့နေရာပါပဲ။ class ကို သီးခြား use case တစ်ခုအတွက် customize လုပ်တာက ခက်ခဲနိုင်ပေမယ့်၊ အပြည့်အဝ ဖော်ပြထားတဲ့ training loop ကို ပြင်ဆင်တာက လွယ်ကူပါတယ်။

ကျွန်တော်တို့ရဲ့ TrainingArguments တွေကို ကြည့်ရအောင်…

from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)

ဒါတွေအများစုကို ကျွန်တော်တို့ အရင်က မြင်ဖူးပါတယ်၊ hyperparameters အချို့ (learning rate, training epochs အရေအတွက်နဲ့ weight decay အချို့) ကို သတ်မှတ်ပြီး၊ epoch တိုင်းအဆုံးမှာ model ကို save လုပ်ချင်တယ်၊ evaluation ကို ကျော်ဖြတ်ပြီး၊ ကျွန်တော်တို့ရဲ့ ရလဒ်တွေကို Model Hub ကို upload လုပ်ချင်တယ်ဆိုတာ ဖော်ပြပါတယ်။ fp16=True နဲ့ mixed-precision training ကိုလည်း ဖွင့်ထားပါတယ်။ ဘာလို့လဲဆိုတော့ ဒါက မကြာသေးမီက ထုတ်ထားတဲ့ GPU ပေါ်မှာ training ကို ကောင်းကောင်း အရှိန်မြှင့်ပေးနိုင်လို့ပါ။

default အားဖြင့်၊ အသုံးပြုတဲ့ repository က သင့် namespace ထဲမှာရှိပြီး သင်သတ်မှတ်ထားတဲ့ output directory အတိုင်း နာမည်ပေးထားပါလိမ့်မယ်။ ဒါကြောင့် ကျွန်တော်တို့ရဲ့ ကိစ္စမှာ "sgugger/bert-finetuned-squad" ဖြစ်ပါလိမ့်မယ်။ hub_model_id ကို ပေးပို့ခြင်းဖြင့် ဒါကို override လုပ်နိုင်ပါတယ်၊ ဥပမာ၊ model ကို huggingface_course organization ကို push လုပ်ဖို့အတွက် hub_model_id="huggingface_course/bert-finetuned-squad" (ဒါက ဒီအပိုင်းရဲ့ အစမှာ ကျွန်တော်တို့ link ချိတ်ထားတဲ့ model ပါ) ကို အသုံးပြုခဲ့ပါတယ်။

💡 သင်အသုံးပြုနေတဲ့ output directory က ရှိပြီးသားဆိုရင်၊ ဒါက သင် push လုပ်ချင်တဲ့ repository ရဲ့ local clone ဖြစ်ဖို့ လိုပါတယ်။ (ဒါကြောင့် သင့် Trainer ကို သတ်မှတ်တဲ့အခါ error ရရင် နာမည်အသစ်တစ်ခု သတ်မှတ်ပါ)။

နောက်ဆုံးအနေနဲ့၊ အရာအားလုံးကို Trainer class ကို ပေးပို့ပြီး training ကို စတင်လိုက်ရုံပါပဲ။

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

training လုပ်နေစဉ်မှာ၊ model ကို save လုပ်တဲ့အခါတိုင်း (ဒီနေရာမှာတော့ epoch တိုင်း) background မှာ Hub ကို upload လုပ်တယ်ဆိုတာ သတိပြုပါ။ ဒီနည်းနဲ့၊ လိုအပ်ရင် အခြား machine တစ်ခုပေါ်မှာ training ကို ပြန်လည်စတင်နိုင်ပါလိမ့်မယ်။ training အားလုံးက အချိန်ယူရပါတယ် (Titan RTX ပေါ်မှာ တစ်နာရီကျော်ကြာပါတယ်)။ ဒါကြောင့် အဲဒါလုပ်ဆောင်နေချိန်မှာ သင် ကော်ဖီသောက်နိုင်ပါတယ် ဒါမှမဟုတ် သင်တန်းရဲ့ ပိုခက်ခဲတယ်လို့ ယူဆရတဲ့ အပိုင်းအချို့ကို ပြန်လည်ဖတ်ရှုနိုင်ပါတယ်။ ပထမဆုံး epoch ပြီးတာနဲ့၊ Hub ကို upload လုပ်ထားတဲ့ weights တွေကို မြင်ရမှာဖြစ်ပြီး သင်ရဲ့ model ကို ၎င်းရဲ့ page မှာ စတင်ကစားနိုင်ပါပြီ။

training ပြီးဆုံးတာနဲ့၊ နောက်ဆုံးမှာ ကျွန်တော်တို့ model ကို evaluate လုပ်နိုင်ပါပြီ (ပြီးတော့ အဲဒီ compute time အားလုံးကို အလဟဿ မသုံးခဲ့မိဖို့ ဆုတောင်းရပါမယ်)။ Trainer ရဲ့ predict() method က model ရဲ့ predictions တွေကို (ဒီနေရာမှာ start နဲ့ end logits တွေပါဝင်တဲ့ အတွဲ) ပြန်ပေးပါလိမ့်မယ်။ ဒါကို ကျွန်တော်တို့ရဲ့ compute_metrics() function ကို ပေးပို့ပါတယ်။

predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

{'exact_match': 81.18259224219489, 'f1': 88.67381321905516}

ကောင်းပါတယ်။ နှိုင်းယှဉ်ချက်အနေနဲ့၊ ဒီ model အတွက် BERT article မှာ ဖော်ပြထားတဲ့ baseline scores တွေက 80.8 နဲ့ 88.5 ဖြစ်တာကြောင့် ကျွန်တော်တို့ ရှိသင့်တဲ့ နေရာမှာပဲ ရှိနေပါတယ်။

နောက်ဆုံးအနေနဲ့၊ model ရဲ့ နောက်ဆုံး version ကို upload လုပ်ဖို့ push_to_hub() method ကို အသုံးပြုပါတယ်။

trainer.push_to_hub(commit_message="Training complete")

ဒါက သင်စစ်ဆေးချင်တယ်ဆိုရင် ဒါက အခုပဲ လုပ်ခဲ့တဲ့ commit ရဲ့ URL ကို ပြန်ပေးပါလိမ့်မယ်။

'https://huggingface.co/sgugger/bert-finetuned-squad/commit/9dcee1fbc25946a6ed4bb32efb1bd71d5fa90b68'

Trainer က evaluation results အားလုံးပါဝင်တဲ့ model card draft တစ်ခုကိုလည်း ဖန်တီးပြီး upload လုပ်ပါတယ်။

ဒီအဆင့်မှာ၊ Model Hub ပေါ်က inference widget ကို အသုံးပြုပြီး model ကို စမ်းသပ်နိုင်ပြီး သင့်သူငယ်ချင်းတွေ၊ မိသားစုနဲ့ အိမ်မွေးတိရစ္ဆာန်တွေကို မျှဝေနိုင်ပါတယ်။ သင်ဟာ question answering task တစ်ခုပေါ်မှာ model တစ်ခုကို အောင်မြင်စွာ fine-tune လုပ်ခဲ့ပါပြီ — ဂုဏ်ယူပါတယ်!

✏️ သင့်အလှည့်ပါ။ ဒီ task မှာ ပိုကောင်းတဲ့ စွမ်းဆောင်ရည်ရှိမရှိ သိဖို့ အခြား model architecture တစ်ခုကို စမ်းသပ်ကြည့်ပါ။

training loop ကို နက်နက်နဲနဲ လေ့လာချင်တယ်ဆိုရင်၊ အခု 🤗 Accelerate ကို အသုံးပြုပြီး အတူတူကို ဘယ်လိုလုပ်ရမလဲဆိုတာ ကျွန်တော်တို့ ပြသပေးပါမယ်။

Custom Training Loop တစ်ခု

အခု training loop အပြည့်အစုံကို ကြည့်ရအောင်၊ ဒါမှ သင်လိုအပ်တဲ့ အပိုင်းတွေကို လွယ်ကူစွာ customize လုပ်နိုင်ပါလိမ့်မယ်။ ဒါက Chapter 3 မှာရှိတဲ့ training loop နဲ့ အများကြီး ဆင်တူပါလိမ့်မယ်၊ evaluation loop မှာတော့ ကွဲပြားပါလိမ့်မယ်။ ကျွန်တော်တို့ Trainer class ရဲ့ ကန့်သတ်ချက်တွေ မရှိတော့တဲ့အတွက် model ကို ပုံမှန် evaluate လုပ်နိုင်ပါလိမ့်မယ်။

Training အတွက် အားလုံးကို ပြင်ဆင်ခြင်း

ပထမဆုံး ကျွန်တော်တို့ datasets တွေကနေ DataLoaders တွေကို တည်ဆောက်ဖို့ လိုအပ်ပါတယ်။ အဲဒီ datasets တွေရဲ့ format ကို "torch" လို့ သတ်မှတ်ပြီး၊ model က အသုံးမပြုတဲ့ validation set ထဲက columns တွေကို ဖယ်ရှားပါတယ်။ ပြီးရင်၊ Transformers က ပံ့ပိုးပေးတဲ့ default_data_collator ကို collate_fn အဖြစ် အသုံးပြုပြီး training set ကို shuffle လုပ်နိုင်ပါတယ်၊ ဒါပေမယ့် validation set ကိုတော့ shuffle မလုပ်ပါဘူး။

from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

နောက်တစ်ဆင့်အနေနဲ့ model ကို ပြန်လည် instantiate လုပ်ပါမယ်။ ဒါက အရင်က fine-tuning လုပ်တာကို ဆက်မလုပ်ဘဲ BERT pretrained model ကနေ အသစ်ပြန်စတာကို သေချာစေဖို့ပါပဲ-

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

ပြီးရင် optimizer တစ်ခု လိုအပ်ပါလိမ့်မယ်။ ပုံမှန်အတိုင်းပါပဲ၊ weight decay ကို အသုံးပြုတဲ့ နည်းလမ်းမှာ fix ပါဝင်တဲ့ Adam နဲ့တူတဲ့ classic AdamW ကို ကျွန်တော်တို့ အသုံးပြုပါတယ်။

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

အဲဒီ objects တွေအားလုံး ရရှိပြီဆိုတာနဲ့၊ ဒါတွေကို accelerator.prepare() method ကို ပေးပို့နိုင်ပါတယ်။ Colab notebook မှာ TPUs တွေပေါ်မှာ train လုပ်ချင်တယ်ဆိုရင်၊ ဒီ code အားလုံးကို training function တစ်ခုထဲကို ရွှေ့ဖို့ လိုအပ်မှာဖြစ်ပြီး၊ အဲဒါက Accelerator တစ်ခုကို instantiate လုပ်တဲ့ cell ကို execute မလုပ်သင့်ဘူးဆိုတာ သတိရပါ။ fp16=True ကို Accelerator ကို ပေးပို့ခြင်းဖြင့် mixed-precision training ကို အတင်းအကျပ် လုပ်ဆောင်နိုင်ပါတယ် (ဒါမှမဟုတ်၊ သင် code ကို script အဖြစ် execute လုပ်နေတယ်ဆိုရင်၊ 🤗 Accelerate config ကို သင့်လျော်စွာ ဖြည့်စွက်ထားဖို့ သေချာပါစေ)။

from accelerate import Accelerator

accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

ယခင်အပိုင်းတွေကနေ သင်သိထားသင့်တဲ့အတိုင်း၊ accelerator.prepare() method ကနေ ပြီးသွားမှသာ train_dataloader length ကို အသုံးပြုပြီး training steps အရေအတွက်ကို တွက်ချက်နိုင်ပါတယ်။ ယခင်အပိုင်းတွေကလို တူညီတဲ့ linear schedule ကို ကျွန်တော်တို့ အသုံးပြုပါတယ်။

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

ကျွန်တော်တို့ model ကို Hub ကို push လုပ်နိုင်ဖို့အတွက် working folder တစ်ခုမှာ Repository object တစ်ခု ဖန်တီးဖို့ လိုအပ်ပါလိမ့်မယ်။ ပထမဆုံး Hugging Face Hub ကို log in ဝင်ပါ၊ အကယ်၍ သင် log in မဝင်ရသေးဘူးဆိုရင်။ ကျွန်တော်တို့ model ကို ပေးချင်တဲ့ model ID ကနေ repository name ကို ဆုံးဖြတ်ပါမယ် (သင့်ရဲ့ username ပါဝင်ရမယ့် get_full_repo_name() function က လုပ်ဆောင်တဲ့အတိုင်း repo_name ကို သင့်စိတ်ကြိုက် ပြောင်းလဲနိုင်ပါတယ်)။

from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-squad-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'sgugger/bert-finetuned-squad-accelerate'

ပြီးရင် အဲဒီ repository ကို local folder တစ်ခုမှာ clone လုပ်နိုင်ပါတယ်။ အကယ်၍ ဒါက ရှိပြီးသားဆိုရင်၊ ဒီ local folder က ကျွန်တော်တို့ အလုပ်လုပ်နေတဲ့ repository ရဲ့ clone ဖြစ်ရပါမယ်။

output_dir = "bert-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

အခု output_dir ထဲမှာ save လုပ်ထားတာတွေကို repo.push_to_hub() method ကို ခေါ်ခြင်းဖြင့် upload လုပ်နိုင်ပါပြီ။ ဒါက epoch တစ်ခုစီအဆုံးမှာ ကြားခံ models တွေကို upload လုပ်ဖို့ ကူညီပါလိမ့်မယ်။

Training Loop

အခု training loop အပြည့်အစုံကို ရေးဖို့ အဆင်သင့်ဖြစ်ပါပြီ။ training ဘယ်လိုသွားလဲဆိုတာကို လိုက်ကြည့်ဖို့ progress bar တစ်ခု သတ်မှတ်ပြီးနောက်၊ loop မှာ အပိုင်းသုံးပိုင်း ပါဝင်ပါတယ်။

training လုပ်ငန်းစဉ် သူ့ဘာသာသူ၊ ဒါက train_dataloader၊ model ကနေတစ်ဆင့် forward pass၊ ပြီးရင် backward pass နဲ့ optimizer step တွေအပေါ် classic iteration ပါပဲ။
evaluation၊ ဒီမှာ start_logits နဲ့ end_logits တွေအတွက် values အားလုံးကို စုစည်းပြီးမှ NumPy arrays တွေအဖြစ် ပြောင်းလဲပါတယ်။ evaluation loop ပြီးသွားတာနဲ့၊ ရလဒ်အားလုံးကို concatenate လုပ်ပါတယ်။ Accelerator က process တစ်ခုစီမှာ examples အရေအတွက် တူညီအောင် သေချာစေဖို့အတွက် အဆုံးမှာ samples အနည်းငယ် ထပ်ထည့်နိုင်တာကြောင့် truncate လုပ်ဖို့ လိုအပ်တယ်ဆိုတာ သတိပြုပါ။
saving နဲ့ uploading၊ ဒီမှာ ပထမဆုံး model နဲ့ tokenizer ကို save လုပ်ပြီးမှ repo.push_to_hub() ကို ခေါ်ပါတယ်။ အရင်က လုပ်ခဲ့သလိုပဲ၊ blocking=False argument ကို အသုံးပြုပြီး 🤗 Hub library ကို asynchronous process တစ်ခုမှာ push လုပ်ဖို့ ပြောပါတယ်။ ဒီနည်းနဲ့ training က ပုံမှန်အတိုင်း ဆက်လက်လုပ်ဆောင်ပြီး ဒီ (ကြာမြင့်တဲ့) instruction ကို background မှာ execute လုပ်ပါလိမ့်မယ်။

ဒီမှာ training loop အတွက် code အပြည့်အစုံပါ။

from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]

    metrics = compute_metrics(
        start_logits, end_logits, validation_dataset, raw_datasets["validation"]
    )
    print(f"epoch {epoch}:", metrics)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

🤗 Accelerate နဲ့ model ကို save လုပ်တာကို ပထမဆုံးအကြိမ် မြင်ဖူးတယ်ဆိုရင်၊ အဲဒါနဲ့ တွဲဖက်ပါဝင်တဲ့ code လိုင်းသုံးခုကို ခဏလောက် စစ်ဆေးကြည့်ရအောင်။

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

ပထမစာကြောင်းက သူ့ဘာသာသူ ရှင်းပြပြီးသားပါ။ ဒါက processes အားလုံးကို ဆက်မလုပ်ဆောင်ခင် အဲဒီအဆင့်ရောက်သည်အထိ အားလုံးကို စောင့်ဆိုင်းဖို့ ပြောပါတယ်။ ဒါက save မလုပ်ခင် processes တိုင်းမှာ model တူတူရှိနေတာကို သေချာစေဖို့ပါပဲ။ ပြီးရင် ကျွန်တော်တို့ သတ်မှတ်ခဲ့တဲ့ base model ဖြစ်တဲ့ unwrapped_model ကို ယူလိုက်ပါတယ်။ accelerator.prepare() method က distributed training မှာ အလုပ်လုပ်အောင် model ကို ပြောင်းလဲပေးတာကြောင့်၊ ၎င်းမှာ save_pretrained() method ရှိတော့မှာ မဟုတ်ပါဘူး၊ accelerator.unwrap_model() method က အဲဒီအဆင့်ကို ပြန်လည်လုပ်ဆောင်ပေးပါတယ်။ နောက်ဆုံးအနေနဲ့၊ save_pretrained() ကို ခေါ်ပေမယ့် torch.save() အစား accelerator.save() ကို အသုံးပြုဖို့ အဲဒီ method ကို ပြောပါတယ်။

ဒါပြီးတာနဲ့၊ Trainer နဲ့ train ထားတဲ့ တစ်ခုနဲ့ အလွန်ဆင်တူတဲ့ ရလဒ်တွေ ထုတ်ပေးတဲ့ model တစ်ခုကို သင်ရရှိပါလိမ့်မယ်။ ဒီ code ကို အသုံးပြုပြီး ကျွန်တော်တို့ train ထားတဲ့ model ကို huggingface-course/bert-finetuned-squad-accelerate မှာ စစ်ဆေးနိုင်ပါတယ်။ training loop မှာ ပြင်ဆင်မှုအချို့ကို စမ်းသပ်ချင်တယ်ဆိုရင်၊ အထက်မှာ ပြသထားတဲ့ code ကို တိုက်ရိုက် edit လုပ်ပြီး implement လုပ်နိုင်ပါတယ်!

Fine-tuned Model ကို အသုံးပြုခြင်း

ကျွန်တော်တို့ fine-tune လုပ်ထားတဲ့ model ကို inference widget နဲ့ Model Hub ပေါ်မှာ ဘယ်လိုအသုံးပြုရမယ်ဆိုတာကို အရင်က ပြသခဲ့ပြီးပါပြီ။ ဒါကို pipeline တစ်ခုထဲမှာ locally အသုံးပြုဖို့အတွက်၊ model identifier ကို သတ်မှတ်ပေးရုံပါပဲ။

from transformers import pipeline

# ဒါကို သင့်ကိုယ်ပိုင် checkpoint နဲ့ အစားထိုးပါ
model_checkpoint = "huggingface-course/bert-finetuned-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.9979003071784973,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

ကောင်းပါတယ်။ ကျွန်တော်တို့ model က ဒီ pipeline အတွက် default model အတိုင်း ကောင်းကောင်း အလုပ်လုပ်နေပါတယ်။

ဝေါဟာရ ရှင်းလင်းချက် (Glossary)

Question Answering (QA): ပေးထားသော စာသားတစ်ခုမှ မေးခွန်းတစ်ခု၏ အဖြေကို ရှာဖွေခြင်း။
Extractive Question Answering: မေးခွန်းတစ်ခု၏ အဖြေကို ပေးထားသော document ထဲမှ စာသားအပိုင်း (span of text) အဖြစ် တိုက်ရိုက်ထုတ်ယူသည့် QA အမျိုးအစား။
Spans of Text: စာသားတစ်ခုအတွင်းရှိ စကားလုံးများ သို့မဟုတ် စာလုံးများ၏ အဆက်မပြတ်သော အပိုင်း။
Document: စာသားအချက်အလက်များပါဝင်သော အရင်းအမြစ်။
BERT Model: Google မှ ဖန်တီးခဲ့သော Bidirectional Encoder Representations from Transformers (BERT) မော်ဒယ်။ Transformer architecture ပေါ်တွင် အခြေခံထားသော pre-trained language model တစ်ခု။
Fine-tune: ကြိုတင်လေ့ကျင့်ထားပြီးသား (pre-trained) မော်ဒယ်တစ်ခုကို သီးခြားလုပ်ငန်းတစ်ခု (specific task) အတွက် အနည်းငယ်သော ဒေတာနဲ့ ထပ်မံလေ့ကျင့်ပေးခြင်း။
SQuAD Dataset (Stanford Question Answering Dataset): မေးခွန်းဖြေဆိုခြင်းအတွက် ကျယ်ပြန့်စွာ အသုံးပြုသော dataset တစ်ခု။ Wikipedia ဆောင်းပါးများပေါ်တွင် မေးခွန်းများနှင့် အဖြေများ ပါဝင်သည်။
Crowdworkers: အွန်လိုင်း platform များမှတစ်ဆင့် အသေးစား၊ ထပ်တလဲလဲ လုပ်ငန်းများကို လုပ်ဆောင်ပေးသော လူများ။
Wikipedia Articles: Wikipedia စွယ်စုံကျမ်းမှ ဆောင်းပါးများ။
Predictions: Machine Learning မော်ဒယ်တစ်ခုက input data ကို အခြေခံပြီး ခန့်မှန်းထုတ်ပေးသော ရလဒ်များ။
Model Hub: Hugging Face Hub ကို ရည်ညွှန်းပြီး AI မော်ဒယ်များ ရှာဖွေ၊ မျှဝေ၊ အသုံးပြုနိုင်သော ဗဟို platform။
Encoder-only Models: Transformer architecture ၏ encoder အပိုင်းကိုသာ အသုံးပြုသော မော်ဒယ်များ (ဥပမာ- BERT)။ ၎င်းတို့သည် input ကို နားလည်ပြီး contextual representations များ ထုတ်လုပ်ရာတွင် ကောင်းမွန်သည်။
Factoid Questions: အချက်အလက်အခြေခံသော၊ တိုက်ရိုက်အဖြေရှိသော မေးခွန်းများ။
Open-ended Questions: တိကျသော အဖြေမရှိဘဲ၊ ပိုမိုကျယ်ပြန့်သော ဆွေးနွေးမှု သို့မဟုတ် ရှင်းလင်းချက်လိုအပ်သော မေးခွန်းများ။
Encoder-decoder Models: Transformer architecture ၏ encoder နှင့် decoder နှစ်ခုစလုံးကို အသုံးပြုသော မော်ဒယ်များ (ဥပမာ- T5, BART)။ ၎င်းတို့သည် input ကို နားလည်ပြီး output sequence ကို ဖန်တီးရာတွင် ကောင်းမွန်သည်။
Synthesize Information: အချက်အလက်အမျိုးမျိုးကို စုစည်းပေါင်းစပ်ခြင်း။
Text Summarization: ရှည်လျားသော စာသားတစ်ခု၏ အနှစ်ချုပ်ကို ထုတ်လုပ်ခြင်း။
Generative Question Answering: အဖြေကို ပေးထားသော document ထဲမှ တိုက်ရိုက်ထုတ်ယူခြင်းမဟုတ်ဘဲ၊ မော်ဒယ်က အဖြေအသစ်တစ်ခုကို ဖန်တီးထုတ်လုပ်သည့် QA အမျိုးအစား။
ELI5 Dataset (Explain Like I’m 5): ရှည်လျားသော၊ ရှင်းလင်းချက်လိုအပ်သော မေးခွန်းများနှင့် အဖြေများပါဝင်သည့် dataset။
Academic Benchmark: သုတေသနပရောဂျက်များတွင် မော်ဒယ်များ၏ စွမ်းဆောင်ရည်ကို နှိုင်းယှဉ်ရန် အသုံးပြုသော စံ dataset သို့မဟုတ် task။
SQuAD v2 Benchmark: မဖြေနိုင်သော မေးခွန်းများ ပါဝင်သော SQuAD dataset ၏ ပိုမိုခက်ခဲသော version။
Contexts: မေးခွန်း၏ အဖြေပါဝင်နိုင်သည့် စာသားအပိုင်း။
Questions: မေးမြန်းထားသော မေးခွန်းများ။
Answers: မေးခွန်းများ၏ အဖြေများ။
Column: Dataset ၏ ဒေတာတစ်ခုစီတွင် ပါဝင်သည့် feature သို့မဟုတ် attribute။
load_dataset() Function: Hugging Face Datasets library မှ dataset များကို download လုပ်ပြီး cache လုပ်ရန် အသုံးပြုသော function။
DatasetDict Object: Training set, validation set, နှင့် test set ကဲ့သို့သော dataset အများအပြားကို dictionary ပုံစံဖြင့် သိမ်းဆည်းထားသော object။
train Split: Model ကို လေ့ကျင့်ရန်အတွက် အသုံးပြုသော dataset အပိုင်း။
validation Split: Training လုပ်နေစဉ် model ၏ စွမ်းဆောင်ရည်ကို အကဲဖြတ်ရန် အသုံးပြုသော dataset အပိုင်း။
features: Dataset ၏ columns များ၏ အမျိုးအစားများနှင့် အချက်အလက်များကို ဖော်ပြသော dictionary။
num_rows: Dataset ၏ rows အရေအတွက် (ဥပမာများ)။
context Field: SQuAD dataset တွင် မေးခွန်း၏ အဖြေပါဝင်နိုင်သည့် စာသားအပိုင်း။
question Field: SQuAD dataset တွင် မေးမြန်းထားသော မေးခွန်း။
answers Field: SQuAD dataset တွင် မေးခွန်း၏ အဖြေများ (text နှင့် start character index ပါဝင်သည်)။
text Field (in answers): အဖြေ၏ စာသား။
answer_start Field (in answers): context ထဲက အဖြေစတင်သည့် character index။
squad Metric: SQuAD dataset အတွက် evaluation metrics (Exact Match နှင့် F1 Score) ကို တွက်ချက်သော metric။
Dataset.filter() Method: 🤗 Datasets library မှာ ပါဝင်တဲ့ method တစ်ခုဖြစ်ပြီး သတ်မှတ်ထားသော အခြေအနေများနှင့် ကိုက်ညီသော ဒေတာများကိုသာ dataset မှ ရွေးထုတ်ရန် အသုံးပြုသည်။
Evaluation Script: Model ၏ စွမ်းဆောင်ရည်ကို အကဲဖြတ်ရန် အသုံးပြုသော code။
Preprocessing: ဒေတာများကို model က နားလည်ပြီး လုပ်ဆောင်နိုင်တဲ့ ပုံစံအဖြစ် ပြောင်းလဲပြင်ဆင်ခြင်း လုပ်ငန်းစဉ်။
Labels: training data တွင် မော်ဒယ်က ခန့်မှန်းရန် လိုအပ်သော မှန်ကန်သည့် အဖြေများ။
Tokens: စာသားကို ပိုင်းခြားထားသော အသေးငယ်ဆုံးယူနစ်များ။
AutoTokenizer: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး မော်ဒယ်အမည်ကို အသုံးပြုပြီး သက်ဆိုင်ရာ tokenizer ကို အလိုအလျောက် load လုပ်ပေးသည်။
model_checkpoint: pretrained model ၏ identifier (ဥပမာ- “bert-base-cased”)။
Fast Tokenizer: Rust ဘာသာစကားဖြင့် အကောင်အထည်ဖော်ထားသော tokenizer ဖြစ်ပြီး Python-based tokenizers များထက် အလွန်မြန်ဆန်သည်။
is_fast Attribute: tokenizer object က fast tokenizer ဟုတ်မဟုတ်ကို ပြသသော attribute။
🤗 Tokenizers: Rust ဘာသာနဲ့ ရေးသားထားတဲ့ Hugging Face library တစ်ခုဖြစ်ပြီး မြန်ဆန်ထိရောက်တဲ့ tokenization ကို လုပ်ဆောင်ပေးသည်။
Special Tokens: Transformer models များက စာသားကို စီမံဆောင်ရွက်ရာတွင် အထူးအဓိပ္ပာယ်ရှိသော tokens များ (ဥပမာ- [CLS], [SEP])။
[CLS] Token: BERT model တွင် sequence ၏ အစကို ကိုယ်စားပြုသော special token။
[SEP] Token: BERT model တွင် sentence တစ်ခု၏ အဆုံး သို့မဟုတ် sentence နှစ်ခုကြား ပိုင်းခြားရန် အသုံးပြုသော special token။
inputs (from Tokenizer): tokenizer က စီမံဆောင်ရွက်ပြီးနောက် ရရှိသော dictionary (input IDs, attention mask, token type IDs စသည်)။
tokenizer.decode() Method: token IDs များကို မူရင်းစာသားအဖြစ် ပြန်ပြောင်းပေးသော tokenizer method။
input_ids: Tokenizer မှ ထုတ်ပေးသော tokens တစ်ခုစီ၏ ထူးခြားသော ဂဏန်းဆိုင်ရာ ID များ။
Logit: Neural network ၏ နောက်ဆုံး layer မှ output ဖြစ်သော raw, unnormalized score။
Start Position Logit: အဖြေ၏ စတင်သည့် token အတွက် model က ခန့်မှန်းသော logit။
End Position Logit: အဖြေ၏ အဆုံးသတ်သည့် token အတွက် model က ခန့်မှန်းသော logit။
Max Length: Tokenized sequence တစ်ခု၏ အမြင့်ဆုံးခွင့်ပြုထားသော အရှည်။
Sliding Window: ရှည်လျားသော စာသားများကို သေးငယ်သော၊ ထပ်နေသော အပိုင်းများ (chunks) အဖြစ် ပိုင်းခြားခြင်း။
max_length (Argument): tokenizer တွင် အမြင့်ဆုံး sequence length ကို သတ်မှတ်ရန် argument။
truncation="only_second": tokenizer တွင် input sequence နှစ်ခုရှိသည့်အခါ ဒုတိယ sequence ကိုသာ truncate လုပ်ရန် သတ်မှတ်ခြင်း။
stride: Sliding window တွင် နောက်ဆက်တွဲ chunks များကြား ထပ်နေသော tokens အရေအတွက်။
return_overflowing_tokens=True: tokenizer ကို truncation လုပ်ပြီးနောက် ကျန်ရှိသော tokens များကို ပြန်ပေးရန် သတ်မှတ်ခြင်း။
return_offsets_mapping=True: tokenizer ကို offset mappings များကို ပြန်ပေးရန် သတ်မှတ်ခြင်း။
Offset Mappings: input sequence ရှိ tokens တစ်ခုစီ၏ မူရင်းစာသားထဲမှ စတင်သည့်နှင့် အဆုံးသတ်သည့် character indices များ။
overflow_to_sample_mapping: မူရင်း sample တစ်ခုကနေ features များစွာ ထုတ်လုပ်တဲ့အခါ၊ feature တစ်ခုစီကို မူရင်း sample နဲ့ map လုပ်ပေးတဲ့ key။
sequence_ids() Method: tokenizer output (BatchEncoding) မှ token တစ်ခုစီသည် မည်သည့် input sequence (question သို့မဟုတ် context) နှင့် သက်ဆိုင်သည်ကို ပြန်ပေးသော method။
BatchEncoding: tokenizer output ၏ class။
DistilBERT: BERT model ၏ ပိုမိုသေးငယ်ပြီး မြန်ဆန်သော version။
strip() Method: string ၏ အစနှင့်အဆုံးမှ spaces များကို ဖယ်ရှားသော Python string method။
RoBERTa: BERT ကို အခြေခံထားသော language model တစ်ခုဖြစ်ပြီး pretraining လုပ်ငန်းစဉ်ကို ပြောင်းလဲထားသည်။
Dataset.map() Method: 🤗 Datasets library မှာ ပါဝင်တဲ့ method တစ်ခုဖြစ်ပြီး dataset ရဲ့ element တစ်ခုစီ ဒါမှမဟုတ် batch တစ်ခုစီပေါ်မှာ function တစ်ခုကို အသုံးပြုနိုင်စေသည်။
batched=True: map() method မှာ အသုံးပြုသော argument တစ်ခုဖြစ်ပြီး function ကို dataset ရဲ့ element အများအပြားပေါ်မှာ တစ်ပြိုင်နက်တည်း အသုံးပြုစေသည်။
remove_columns: map() method တွင် ပြန်ပေးသော dataset မှ ဖယ်ရှားလိုသော columns များ။
Validation Loss: Validation set ပေါ်တွင် တွက်ချက်ထားသော loss function တန်ဖိုး။
Post-processing: Model ၏ output များကို နောက်ဆုံးအသုံးပြုမှုအတွက် ပြင်ဆင်ခြင်း လုပ်ငန်းစဉ်။
example_id: မူရင်း dataset ထဲက example တစ်ခုစီအတွက် ထူးခြားသော ID။
enumerate(): iterable တစ်ခုကို loop လုပ်နေစဉ် index နှင့် value နှစ်ခုစလုံးကို ပြန်ပေးသော Python function။
AutoModelForQuestionAnswering: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး question answering task အတွက် သက်ဆိုင်ရာ model ကို အလိုအလျောက် load လုပ်ပေးသည်။
TFAutoModelForQuestionAnswering: TensorFlow framework အတွက် AutoModelForQuestionAnswering နှင့် တူညီသော class။
n_best: အကောင်းဆုံး logit scores ရှိသော အဖြေအရေအတွက်။
max_answer_length: ခန့်မှန်းထားသော အဖြေ၏ အမြင့်ဆုံး token length။
np.argsort(): NumPy array ၏ elements များကို sort လုပ်ပြီး original indices များကို ပြန်ပေးသော function။
tolist() Method: NumPy array ကို Python list အဖြစ် ပြောင်းလဲသော method။
logit_score: start logit နှင့် end logit တို့၏ ပေါင်းလဒ် (သို့မဟုတ် product)။
collections.defaultdict(list): Default value အဖြစ် list ကို အသုံးပြုသော dictionary။
evaluate.load("squad"): 🤗 Evaluate library မှ SQuAD metric ကို load လုပ်ခြင်း။
exact_match: ခန့်မှန်းထားသော အဖြေသည် ground truth အဖြေနှင့် တိကျစွာ ကိုက်ညီခြင်းရှိမရှိ တိုင်းတာသော metric။
f1: Precision နှင့် Recall တို့၏ harmonic mean ကို တွက်ချက်သော metric။
Precision: positive ဟု ခန့်မှန်းထားသော instances များထဲမှ မှန်ကန်စွာ positive ဖြစ်သော instances အရေအတွက်။
Recall: အမှန်တကယ် positive ဖြစ်သော instances များထဲမှ မှန်ကန်စွာ positive ဟု ခန့်မှန်းနိုင်သော instances အရေအတွက်။
Ground Truth: dataset တွင် ဖော်ပြထားသော မှန်ကန်သည့် အဖြေများ။
Baseline Scores: သုတေသနစာတမ်းများတွင် ဖော်ပြထားသော မော်ဒယ်တစ်ခု၏ အခြေခံစွမ်းဆောင်ရည်။
compute_metrics() Function: model ၏ predictions များကို အကဲဖြတ်ပြီး metrics များကို တွက်ချက်ပေးသော function။
eval_preds: Trainer မှ compute_metrics() function သို့ ပေးပို့သော predictions နှင့် labels များပါဝင်သော tuple။
tqdm.auto: Progress bar ကို ဖန်တီးရန်အတွက် library။
Trainer Class: Hugging Face Transformers library မှ model များကို လေ့ကျင့်ရန်အတွက် မြင့်မားသောအဆင့် API။
Weights: Neural network model ၏ လေ့ကျင့်နိုင်သော parameters များ။
Pretraining Head: pretrained model ၏ နောက်ဆုံး layer များ။
Question Answering Head: Question Answering task အတွက် အထူးဒီဇိုင်းထုတ်ထားသော model ၏ နောက်ဆုံး layer များ။
huggingface_hub.notebook_login(): Jupyter/Colab Notebooks များတွင် Hugging Face Hub သို့ login ဝင်ရန် အသုံးပြုသော function။
huggingface-cli login: Command line interface မှတစ်ဆင့် Hugging Face Hub သို့ login ဝင်ရန် command။
TrainingArguments: Hugging Face Transformers library မှ training လုပ်ငန်းစဉ်အတွက် hyperparameters နှင့် အခြား arguments များကို သတ်မှတ်ရန် class။
evaluation_strategy="no": training လုပ်နေစဉ် evaluation မလုပ်ရန် သတ်မှတ်ခြင်း။
save_strategy="epoch": epoch တိုင်းအဆုံးတွင် model ကို save လုပ်ရန် သတ်မှတ်ခြင်း။
learning_rate: Training လုပ်ငန်းစဉ်အတွင်း model ၏ weights များကို မည်မျှပြောင်းလဲရမည်ကို ထိန်းချုပ်သော parameter။
num_train_epochs: Model ကို training dataset တစ်ခုလုံးဖြင့် လေ့ကျင့်သည့် အကြိမ်အရေအတွက်။
weight_decay: Overfitting ကို လျှော့ချရန်အတွက် model ၏ weights များကို ပုံမှန်ပြုလုပ်သော နည်းလမ်း။
fp16=True: Mixed-precision training ကို ဖွင့်ရန် (floating point 16-bit)။
Mixed-precision Training: Floating point formats (ဥပမာ- FP16 နှင့် FP32) နှစ်မျိုးလုံးကို အသုံးပြုပြီး training ကို အရှိန်မြှင့်တင်ခြင်း။
push_to_hub=True: training ပြီးဆုံးပြီးနောက် model ကို Hugging Face Hub သို့ push လုပ်ရန် သတ်မှတ်ခြင်း။
DefaultDataCollator: Transformers library မှ default data collator ဖြစ်ပြီး inputs များကို batch အလိုက် စုစည်းပေးသည်။
return_tensors="tf": DefaultDataCollator တွင် output Tensors များကို TensorFlow format အဖြစ် ပြန်ပေးရန် သတ်မှတ်ခြင်း။
model.prepare_tf_dataset(): TensorFlow model အတွက် Hugging Face dataset ကို TensorFlow tf.data.Dataset အဖြစ် ပြင်ဆင်ရန် method။
collate_fn: DataLoader သို့မဟုတ် prepare_tf_dataset တွင် batch တစ်ခုအတွင်း samples များကို စုစည်းပေးသော function။
shuffle=True: dataset ကို shuffle လုပ်ရန် သတ်မှတ်ခြင်း။
batch_size: training လုပ်ငန်းစဉ်တစ်ခုစီတွင် model သို့ ပေးပို့သော input samples အရေအတွက်။
create_optimizer: Transformers library မှ optimizer နှင့် learning rate schedule ကို ဖန်တီးရန် function။
init_lr: စတင် learning rate။
num_warmup_steps: learning rate ကို ဖြည်းဖြည်းချင်း တိုးမြှင့်ပေးမည့် steps အရေအတွက်။
num_train_steps: စုစုပေါင်း training steps အရေအတွက်။
weight_decay_rate: weight decay ၏ strength။
model.compile(): Keras model ကို training အတွက် ပြင်ဆင်ရန် method (optimizer, loss function, metrics များကို သတ်မှတ်သည်)။
tf.keras.mixed_precision.set_global_policy("mixed_float16"): TensorFlow တွင် global mixed-precision policy ကို float16 အဖြစ် သတ်မှတ်ခြင်း။
model.fit(): Keras model ကို training လုပ်ရန် method။
PushToHubCallback: Keras callback တစ်ခုဖြစ်ပြီး training လုပ်နေစဉ်အတွင်း model ကို Hugging Face Hub သို့ upload လုပ်ရန်။
output_dir: model files များကို သိမ်းဆည်းမည့် directory။
hub_model_id: Hugging Face Hub ပေါ်ရှိ model repository ၏ ID (ဥပမာ- “sgugger/bert-finetuned-squad”)။
Trainer.train(): Trainer class ကို အသုံးပြု၍ model ကို training လုပ်ရန် method။
Namespace: Hugging Face Hub တွင် သုံးစွဲသူအမည် သို့မဟုတ် organization အမည်။
Local Clone: Git repository တစ်ခု၏ သင့် local machine ပေါ်ရှိ မိတ္တူ။
Resume Training: training ကို ရပ်တန့်ခဲ့သော နေရာမှ ပြန်လည်စတင်ခြင်း။
Compute Time: training သို့မဟုတ် inference လုပ်ငန်းများအတွက် လိုအပ်သော processor အချိန်။
Titan RTX: NVIDIA မှ ထုတ်လုပ်သော မြင့်မားသော စွမ်းဆောင်ရည်ရှိ GPU ကတ်။
Epoch: training dataset တစ်ခုလုံးကို model က တစ်ကြိမ် ဖြတ်သန်းခြင်း။
Trainer.predict(): Trainer class ကို အသုံးပြု၍ model ၏ predictions များကို ရယူရန် method။
Baseline Scores: သုတေသနစာတမ်းများတွင် ဖော်ပြထားသော မော်ဒယ်တစ်ခု၏ အခြေခံစွမ်းဆောင်ရည်။
trainer.push_to_hub(): Trainer class မှ model ကို Hugging Face Hub သို့ push လုပ်ရန် method။
Commit Message: Git commit တစ်ခုကို ဖော်ပြသော စာသား။
Inference Widget: Hugging Face Hub ပေါ်တွင် model ၏ စွမ်းဆောင်ရည်ကို တိုက်ရိုက်စမ်းသပ်နိုင်သော interactive tool။
DataLoader: Dataset ကနေ data တွေကို batch အလိုက် load လုပ်ပေးတဲ့ PyTorch utility class။
default_data_collator: Hugging Face Transformers library မှ default data collator ဖြစ်ပြီး inputs များကို batch အလိုက် စုစည်းပေးသည်။
collate_fn: DataLoader တွင် batch တစ်ခုအတွင်း samples များကို စုစည်းပေးသော function။
shuffle=True: training dataset ကို shuffle လုပ်ရန် သတ်မှတ်ခြင်း။
AdamW: PyTorch မှာ အသုံးပြုတဲ့ AdamW optimizer။ Model ၏ parameters များကို training လုပ်ရာမှာ အသုံးပြုသည်။
model.parameters(): model ၏ လေ့ကျင့်နိုင်သော parameters (weights နှင့် biases) များကို ပြန်ပေးသော method။
lr: Learning rate။
accelerator.prepare() Method: 🤗 Accelerate library တွင် model, optimizer, dataloaders များကို distributed training အတွက် ပြင်ဆင်ရန် method။
TPUs (Tensor Processing Units): Google မှ AI/ML workloads များအတွက် အထူးဒီဇိုင်းထုတ်ထားသော processor တစ်မျိုး။
Accelerator: 🤗 Accelerate library ၏ အဓိက class။
get_scheduler: Transformers library မှ learning rate scheduler ကို ဖန်တီးရန် function။
linear (Scheduler Type): Learning rate ကို training steps အလိုက် linear ပုံစံဖြင့် လျှော့ချပေးသော scheduler အမျိုးအစား။
num_warmup_steps: learning rate ကို ဖြည်းဖြည်းချင်း တိုးမြှင့်ပေးမည့် steps အရေအတွက်။
Repository Object: huggingface_hub library မှ Git repository များကို ကိုင်တွယ်ရန်အတွက် class။
get_full_repo_name() Function: Hugging Face Hub ပေါ်ရှိ repository ၏ full name (username/repo_name) ကို ရယူရန် function။
clone_from: Repository class တွင် remote repository မှ clone လုပ်ရန် သတ်မှတ်ခြင်း။
blocking=False: push_to_hub() method တွင် push လုပ်ငန်းစဉ်ကို asynchronous (background) တွင် လုပ်ဆောင်ရန် သတ်မှတ်ခြင်း။
Asynchronous Process: အခြားလုပ်ငန်းများ ဆက်လက်လုပ်ဆောင်နေစဉ် နောက်ခံတွင် လုပ်ဆောင်နေသော လုပ်ငန်းစဉ်။
model.train(): model ကို training mode သို့ ပြောင်းလဲခြင်း။
model.eval(): model ကို evaluation mode သို့ ပြောင်းလဲခြင်း။
accelerator.backward(loss): loss ကို အသုံးပြု၍ backpropagation ကို လုပ်ဆောင်ရန် 🤗 Accelerate method။
optimizer.step(): တွက်ချက်ထားသော gradients များကို အသုံးပြုပြီး model ၏ parameters များကို update လုပ်သော optimizer method။
lr_scheduler.step(): Learning rate scheduler ကို update လုပ်သော method။
optimizer.zero_grad(): optimizer ၏ gradients များကို သုညသို့ သတ်မှတ်ခြင်း။
torch.no_grad(): PyTorch တွင် gradient တွက်ချက်ခြင်းကို ပိတ်ရန် context manager။
accelerator.gather(): Distributed training တွင် processes အားလုံးမှ Tensors များကို စုစည်းရန် 🤗 Accelerate method။
np.concatenate(): NumPy arrays များကို ပေါင်းစပ်ရန် function။
accelerator.wait_for_everyone(): Distributed training တွင် processes အားလုံးက သတ်မှတ်ထားသော အမှတ်အထိ ရောက်သည်အထိ စောင့်ဆိုင်းရန် 🤗 Accelerate method။
accelerator.unwrap_model(model): Distributed training အတွက် ပြင်ဆင်ထားသော model မှ base model ကို ပြန်လည်ရယူရန် 🤗 Accelerate method။
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save): model ကို output directory ထဲသို့ save လုပ်ရန် (🤗 Accelerate ရဲ့ save function ကို အသုံးပြုသည်)။
accelerator.is_main_process: လက်ရှိ process က main process ဟုတ်မဟုတ်ကို စစ်ဆေးသော attribute။
tokenizer.save_pretrained(output_dir): tokenizer ကို output directory ထဲသို့ save လုပ်ရန် method။
pipeline("question-answering", model=model_checkpoint): Hugging Face Transformers library မှ question answering pipeline ကို model checkpoint သတ်မှတ်ပြီး ဖန်တီးခြင်း။
Jax: Google မှ ဖန်တီးထားသော High-performance numerical computation library။
PyTorch: Facebook (ယခု Meta) က ဖန်တီးထားတဲ့ open-source machine learning library တစ်ခုဖြစ်ပြီး deep learning မော်ဒယ်တွေ တည်ဆောက်ဖို့အတွက် အသုံးပြုပါတယ်။
TensorFlow: Google က ဖန်တီးထားတဲ့ open-source machine learning library တစ်ခုဖြစ်ပြီး deep learning မော်ဒယ်တွေ တည်ဆောက်ဖို့အတွက် အသုံးပြုပါတယ်။
Seamless Integration: ကိရိယာများ သို့မဟုတ် စနစ်များကြား ချောမွေ့စွာ ချိတ်ဆက် လုပ်ဆောင်နိုင်ခြင်း။

Update on GitHub

←Causal Language Model တစ်ခုကို အစကနေ Train လုပ်ခြင်း LLM များကို ကျွမ်းကျင်ခြင်း→