本指南适用于任何想要为自己的项目定制强大的语言模型(如Llama 2和Mistral)的人。即使您没有可用的超级计算机,我们也将使用 QLoRA 逐步完成微调这些大型语言模型 (LLM) 的步骤。
关键点:好的模型需要好的数据。我们将介绍现有数据的培训以及如何创建自己的数据集。您将了解如何格式化数据以进行训练,特别是 ChatML 格式。代码保持简单,避免额外的黑盒或训练工具,仅使用基本的 PyTorch 和 Hugging Face 包。
您将在这里学到什么:
让我们开始吧!
先决条件
在我们开始之前,您需要 Hugging Face 的最新工具。在终端中运行以下命令来安装或更新这些软件包:
pip install -U accelerate bitsandbytes datasets peft transformers
作为参考,这些是用于组合本教程的特定版本:
accelerate 0.24.1
bitsandbytes 0.41.1
Datasets 2.14.6
peft 0.6.0
transformers 4.35.0
torch 2.1.0
01
现有或创建自己的数据集
本节专门介绍加载或制作数据集以及随后根据 ChatML 结构对其进行格式化的关键过程。接下来,我们将在下一节中深入研究标记化和批处理领域。
请记住,数据集的质量至关重要——它将显著影响模型的性能。您的数据集必须适合您的任务,这一点至关重要。
总体策略
数据集可以混合来自不同来源。以Mistral 的 Open Hermes 2 微调为例,它使用来自多个数据集的约 900,000 个样本进行训练。
这些数据集通常包含问题-答案对,格式为独立对(单个样本相当于单个问题和答案)或以对话序列连接(格式为 Q/A、Q/A、Q/A)。
本节旨在指导您将这些数据集转换为与训练方案兼容的统一格式。为了准备培训,必须选择一种形式。我在这里选择了 OpenAI 的ChatML,因为它在最近的模型版本中被频繁采用,并且可能成为新标准。
以下是 ChatML 格式对话的示例(来自 Open Orca 数据集):
<|im_start|>system
You are an AI assistant. User will you give you a task. Your goal is to
complete the task as faithfully as you can. While performing the task
think step-by-step and justify your steps.<|im_end|>
<|im_start|>user
Premise: A man is inline skating in front of a wooden bench. Hypothesis:
A man is having fun skating in front of a bench. .Choose the correct
answer: Given the premise, can we conclude the hypothesis?
Select from: a). yes b). it is not possible to tell c). no<|im_end|>
<|im_start|>assistant
b). it is not possible to tell Justification: Although the man is inline
skating in front of the wooden bench, we cannot conclude whether he is
having fun or not, as his emotions are not explicitly mentioned.<|im_end|>
上面的例子可以被标记化、批处理并输入到训练算法中。然而,在继续之前,让我们先检查一些众所周知的数据集以及如何准备和格式化它们。
如何加载 Open Assistant 数据
让我们从Open Assistant 数据集开始。
from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")
加载后,数据集被预先分为训练(13k 条目)和测试部分(700 条目)。
>>> dataset
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 12947
})
test: Dataset({
features: ['text'],
num_rows: 690
})
})
让我们看一下第一个条目:
>>> print(dataset["train"][0]["text"])
<|im_start|>user
Consigliami 5 nomi per il mio cucciolo di dobberman<|im_end|>
<|im_start|>assistant
Ecco 5 nomi per il tuo cucciolo di dobermann:
- Zeus
- Apollo
- Thor
- Athena
- Odin<|im_end|>
多么方便啊!这已经是 ChatML,所以我们不需要做任何事情。除了告诉分词器和模型字符串<|im_start|>和<|im_end|>是标记,不应该被分割,并且<|im_end|>是一个特殊的标记(eos“序列结束”),标记模型答案的结束,否则模型将永远生成和永不止步。如何将这些代币与 llama2 和 Mistra 等基本模型集成将在第 3 节中详细阐述。
如何加载 Open Orca 数据
转向Open Orca,该数据集包含 420 万个条目,需要在加载后进行训练/测试分割,这可以使用train_test_split.
from datasets import load_dataset
dataset = load_dataset("Open-Orca/OpenOrca")
dataset = dataset["train"].train_test_split(test_size=0.1)
让我们检查一下数据集的结构。这是第一个条目:
{
'id': 'flan.2020759',
'system_prompt': 'You are an AI assistant. You will be given a task.
You must generate a detailed and long answer.',
'question': 'Ülke, bildirgeyi uygulamaya başlayan son ülkeler
arasında olmasına rağmen 46 ülke arasında 24. sırayı
aldı.
Could you please translate this to English?',
'response': 'Despite being one of the last countries to
implement the declaration, it ranked 24th out of 46 countries.'
}
它是一对问题 答案和一条描述必须回答问题的上下文的系统消息。
与 Open Assistant 数据集相反,我们必须自己将 Open Orca 数据格式化为 ChatML。
def format_conversation(row):
template="""<|im_start|>system
{sys}<|im_end|>
<|im_start|>user
{q}<|im_end|>
<|im_start|>assistant
{a}<|im_end|>"""
conversation=template.format(
sys=row["system_prompt"],
q=row["question"],
a=row["response"],
)
return {"text": conversation}
import os
dataset = dataset.map(
format_conversation,
remove_columns=dataset["train"].column_names # remove all columns; only "text" will be left
num_proc=os.cpu_count() # multithreaded
)
现在,数据集已准备好进行标记并输入到训练管道中。
根据播客转录创建数据集
我之前根据Lex Friedman 播客的记录对 llama1 进行了训练。这项任务涉及将一个以深入讨论而闻名的播客转变为一个训练集,让人工智能模仿 Lex 的说话方式。您可以在从转录到 AI 聊天:Lex Fridman 播客的实验中找到有关如何创建数据集的详细信息。
from datasets import load_dataset
dataset = load_dataset("g-ronimo/lfpodcast")
dataset = dataset["train"].train_test_split(test_size=0.1)
检查训练集中的第一个条目,您将看到如下 JSON 对象:
>>> print(json.dumps(dataset["train"][0],indent=2))
{
"title": "Lex_Fridman_Podcast_-_114__Russ_Tedrake_Underactuated_Robotics_Control_Dynamics_and_Touch",
"episode": 114,
"speaker_ratio_lex-vs-guest": 0.44402311303719755,
"conversation": [
{
"from": "Guest",
"text": "I think the most beautiful motion of a robot has to be the
passive dynamic walkers. I think there's just something fundamentally
beautiful. (..) but what Steve and Andy did was they took it to
this beautiful conclusion. where they built something that had knees,
arms, a torso, the arms swung naturally, give it a little push,
and that looked like a stroll through the park."
},
{
"from": "Lex",
"text": "How do you design something like that? Is that art or science?"
},
(...)
此结构捕捉了每个播客剧集的精髓,但为了准备模型训练,对话需要转换为 ChatML 格式。我们需要遍历每个消息轮次,应用 ChatML 格式,然后连接消息以将整个剧集记录存储在单个text字段中。Guest和的角色Lex将分别重新分配给user和assistant,以调节语言模型以采用 Lex 好奇且知识渊博的角色。
def format_conversation(row):
# Template for conversation turns in ChatML format
template="""<|im_start|>user
{q}<|im_end|>
<|im_start|>assistant
{a}<|im_end|>"""
turns=row["conversation"]
# If Lex is the first speaker, skip his turn to start with Guest's question
if turns[0]["from"]=="Lex":
turns=turns[1:]
conversation=[]
for i in range(0, len(turns), 2):
# Assuming the conversation always alternates between Guest and Lex
question=turns[i] # Guest
answer=turns[i 1] # Lex
conversation.append(
template.format(
q=question["text"],
a=answer["text"],
))
return {"text": "\n".join(conversation)}
import os
dataset = dataset.map(
format_conversation,
remove_columns=dataset["train"].column_names,
num_proc=os.cpu_count()
)
通过应用这些更改,生成的数据集将为标记化做好准备并输入到训练管道中,从而教导语言模型进行对话,让人想起 Lex Fridman 的播客讨论。
根据书籍创建数据集
为了更深入地了解数据集创建的细微差别,让我们考虑一个案例,我们想要训练人工智能来反映知名人物的声音和个性。比如,选择将美国著名厨师Anthony Bourdain的自传变成一个数据集。他写了《厨房秘笈》,生动地描述了厨房里的疯狂和厨师的心思。
这个过程涉及将布尔登书中的叙述转变为引人入胜的对话,就像捕捉他的精神的来回采访一样。
所需步骤:
首先,我们假设您合法获得了这本书的数字副本,kc.pdf
mv anthony-bourdain-kitchen-confidential.pdf kc.pdf
pdftotext -nopgbrk kc.pdf
# fix line breaks within sentence
sed -r ':a /[a-zA-Z,\ ]$/N;s/(.)\n/\1 /;ta' kc.txt > kc_reformat.txt
现在使用每个段落n和段落n-1来参与任何智能开源 LLM 或 GPT-3.5/4。我使用Open Hermes 2为每个段落创建一个面试问题。
# Gather paragraphs to target
with open("kc_reformat.txt") as f:
file_content = f.read()
chapters=file_content.split("\n\n")
# Define minimum and maximum lengths to ensure a good interview flow
passage_minlen=300 # if paragraph <300 chars -> merge with next
passage_maxlen=2000 # if paragraph >2k chars -> split
# Process the chapters into suitable interview passages
passages=[]
for chap in chapters:
passage=""
for par in chap.split("\n"):
if(len(passage)<passage_minlen) or not passage[-1]=="." and len(passage)<passage_maxlen:
passage ="\n" par
else:
passages.append(passage.strip().replace("\n", " "))
passage=par
# Ask Open Hermes
prompt_template="""<|im_start|>system
You are an expert interviewer who interviews an autobiography of a famous chef.
You formulate questions based on quotes from the autobiography. Below is one
such quote. Formulate a question that the quote would be the perfect answer to.
The question should be short and directed at the author of the autobiography
like in an interview. The question is short. Remember, make the question as
short as possible. Do not give away the answer in your question.
Also: If possible, ask for motvations, feelings, and perceptions rather than
events or facts.
Here is some context that might help you formulate the question regarding the quote:
{ctx}
<|im_end|>
<|im_start|>user
Quote:
{par}<|im_end|>
<|im_start|>assistant
Question:"""
prompts=[]
for i,p in enumerate(passages):
prompt=prompt_template.format(par=passages[i], ctx=passages[i-1])
prompts.append(prompt)
# Prompt smart LLM, parse results, store Q/A in .json
...
您可以在这里找到对我有用的完整代码。
生成的 .json 文件如下所示:
{
"question": "Why you choose to share your experiences and insights from
your career in the restaurant industry despite the angry or wanting
to horrify the dining public?",
"answer": "I'm not spilling my guts about everything I've seen, learned
and done in my long and checkered career as dishwasher, prep drone,
fry cook, grillardin, saucier, sous-chef and chef because I'm angry
at the business, or because I want to horrify the dining public. I'd
still like to be a chef, too, when this thing comes out, as this life
is the only life I really know. If I need a favor at four o'clock in
the morning, whether it's a quick loan, a shoulder to cry on, a sleeping
pill, bail money, or just someone to pick me up in a car in a bad
neighborhood in the driving rain, I'm definitely not calling up a fellow
writer. I'm calling my sous-chef, or a former sous-chef, or my saucier,
someone I work with or have worked with over the last twenty-plus years."
},
{
"question": "Why do you feel more comfortable sharing the \"dark recesses\"
of the restaurant underbelly instead of writing about your personal
experiences outside of the culinary world?",
"answer": "No, I want to tell you about the dark recesses of the restaurant
underbelly-a subculture whose centuries-old militaristic hierarchy and
ethos of 'rum, buggery and the lash' make for a mix of unwavering order
and nerve-shattering chaos-because I find it all quite comfortable, like
a nice warm bath. I can move around easily in this life. I speak the
language. In the small, incestuous community of chefs and cooks in New
York City, I know the people, and in my kitchen, I know how to behave
(as opposed to in real life, where I'm on shakier ground). I want the
professionals who read this to enjoy it for what it is: a straight look
at a life many of us have lived and breathed for most of our days and
nights to the exclusion of 'normal' social interaction. Never having had
a Friday or Saturday night off, always working holidays, being busiest
when the rest of the world is just getting out of work, makes for a
sometimes peculiar world-view, which I hope my fellow chefs and cooks
will recognize. The restaurant lifers who read this may or may not like
what I'm doing. But they'll know I'm not lying."
}
最后,我们再次将数据集转换为 ChatML 格式:
interview_fn="kc_reformat_interview.json"
dataset = load_dataset('json', data_files=interview_fn, field='interview')
dataset=dataset["train"].train_test_split(test_size=0.1)
# chatML template, from https://huggingface.co/docs/transformers/main/chat_templating
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' message['role'] '\n' message['content'] '<|im_end|>' '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
def format_interview(conv):
messages = [
{"role": "user", "content": conv["question"]},
{"role": "assistant", "content": conv["answer"]}
]
chat=tokenizer.apply_chat_template(messages, tokenize=False).strip()
return {"text": chat}
dataset = dataset.map(
format_conversation,
remove_columns=dataset["train"].column_names
)
通过改造 Bourdain 的自传,我们的目标是生成一个人工智能,呼应他的叙事风格和对烹饪行业的看法,并体现他的人生哲学。所提供的方法非常基础,将受益于进一步的细化,例如删除低内容答案,剥离脚注、页码等非必要文本元素。这将提高模型的质量。
如果您好奇,请与Mistral Bourdain交谈。尽管当前的输出是对 Bourdain 声音的基本模仿,但它可以作为概念证明;增强的数据集管理无疑会产生更有说服力的模拟。
创建您自己的数据集
你现在已经明白了。以下是 GPT-4 提出的一些关于创意数据集创建的其他想法:
02
加载并准备模型和分词器
在开始处理刚刚准备的数据之前,我们需要加载模型和标记生成器,并确保它们正确处理 ChatML 标签<|im_start|>并被<|im_end|>识别<|im_end|>为(新的)eos 标记。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
modelpath="models/Mistral-7B-v0.1"
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
modelpath,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
),
torch_dtype=torch.bfloat16,
)
# Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)
# Add tokens <|im_start|> and <|im_end|>, latter is special eos token
tokenizer.pad_token = "</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
model.resize_token_embeddings(len(tokenizer))
model.config.eos_token_id = tokenizer.eos_token_id
由于我们不是训练所有参数而只是训练一个子集,因此我们必须使用 Huggingface 将 LoRA 适配器添加到模型中peft。确保使用peft>= 0.6,否则 1)get_peft_model会非常慢,2) Mistral 训练会失败。
# Add LoRA adapters to model
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=64,
lora_alpha=16,
target_modules = ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
lora_dropout=0.1,
bias="none",
modules_to_save = ["lm_head", "embed_tokens"], # needed because we added new tokens to tokenizer/model
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.config.use_cache = False
03
准备训练数据
正确的标记化和批处理对于确保正确处理数据至关重要。
代币化
text对数据集中的字段进行标记,无需添加特殊标记或填充,因为我们将手动执行此操作。
def tokenize(element):
return tokenizer(
element["text"],
truncation=True,
max_length=2048,
add_special_tokens=False,
)
dataset_tokenized = dataset.map(
tokenize,
batched=True,
num_proc=os.cpu_count(), # multithreaded
remove_columns=["text"] # don't need the strings anymore, we have tokens from here on
)
max_length:指定样本的最大长度(以标记数为单位)。所有超过 2048 个令牌的内容都将被截断并且不会进行训练。如果您的数据集在单个样本中只有简短的问题/答案对(例如 Open Orca),那么这将绰绰有余,如果您的样本较长(例如播客转录本),则理想情况下您将增加(消耗 VRAM)或max_length拆分将您的样本分成几个较小的样本。llama2 的最大值是 4096。Mistral 是“使用 8k 上下文长度和固定缓存大小进行训练,理论注意力跨度为 128k 令牌”,但我从未超过 4096。
配料
Hugging Facetrainer需要一个整理器函数将样本列表转换为包含一批填充的字典
为此,我们将采用QLoRA 存储库DataCollatorForCausalLM中的简化版本。
04
训练超参数
超参数的选择可以显着影响模型性能。以下是我们为训练选择的超参数:
bs=8 # batch size
ga_steps=1 # gradient acc. steps
epochs=5
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)
args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=bs,
per_device_eval_batch_size=bs,
evaluation_strategy="steps",
logging_steps=1,
eval_steps=steps_per_epoch, # eval and save once per epoch
save_steps=steps_per_epoch,
gradient_accumulation_steps=ga_steps,
num_train_epochs=epochs,
lr_scheduler_type="constant",
optim="paged_adamw_32bit",
learning_rate=0.0002,
group_by_length=True,
fp16=True,
ddp_find_unused_parameters=False, # needed for training with accelerate
)
我们来训练吧。
trainer = Trainer(
model=model,
tokenizer=tokenizer,
data_collator=collate,
train_dataset=dataset_tokenized["train"],
eval_dataset=dataset_tokenized["test"],
args=args,
)
trainer.train()
05
训练运行示例
训练和评估损失
下面是 Open Assistant (OA) 数据集典型训练运行的 wandb 图表,比较了 llama2–7b 和 Mistral-7b 的微调。
训练时间和 VRAM 使用情况
在具有 24GB VRAM 的单个 GPU 上对 Open Assistant 数据集上的 Llama2–7B 和 Mistral-7B 进行微调每个周期大约需要 100 分钟。
将 LoRA 适配器与基本模型合并
以下代码与其他脚本(例如TheBloke 提供的脚本)有点不同,因为我们在训练之前添加了 ChatML 的标记。不过,我们没有更改基本模型,这就是为什么在加载适配器之前我们必须将新标记添加到基本模型和标记生成器中;否则,我们将尝试将具有两个附加令牌的适配器合并到没有这些令牌的模型中(这将失败)。
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_path="models/Mistral-7B-v0.1" # input: base model
adapter_path="out/checkpoint-606" # input: adapters
save_to="models/Mistral-7B-finetuned" # out: merged model ready for inference
base_model = AutoModelForCausalLM.from_pretrained(
base_path,
return_dict=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_path)
# Add/set tokens (same 5 lines of code we used before training)
tokenizer.pad_token = "</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
base_model.resize_token_embeddings(len(tokenizer))
base_model.config.eos_token_id = tokenizer.eos_token_id
# Load LoRA adapter and merge
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
model.save_pretrained(save_to, safe_serialization=True, max_shard_size='4GB')
tokenizer.save_pretrained(save_to)
故障排除
挑战是模型训练的重要组成部分。让我们讨论一些常见问题及其解决方案。
OOM
如果遇到内存不足 (OOM) 错误:
训练太慢
如果训练看起来缓慢:
最终模型的质量很差
模型的质量反映了数据集的质量。提高模型质量:
06
总 结
对 Llama 2 和 Mistral 等 LLM 进行微调是一个有益的过程,特别是当您拥有正确的数据集和训练参数时。请记住始终关注模型的性能并准备好迭代和适应。
Copyright © 2024 妖气游戏网 www.17u1u.com All Rights Reserved