利用微调神器 QLORA，训练重量级开源模型Llama2和Mistral（疯狂的厨子2）

本指南适用于任何想要为自己的项目定制强大的语言模型（如Llama 2和Mistral）的人。即使您没有可用的超级计算机，我们也将使用 QLoRA 逐步完成微调这些大型语言模型 (LLM) 的步骤。

关键点：好的模型需要好的数据。我们将介绍现有数据的培训以及如何创建自己的数据集。您将了解如何格式化数据以进行训练，特别是 ChatML 格式。代码保持简单，避免额外的黑盒或训练工具，仅使用基本的 PyTorch 和 Hugging Face 包。

您将在这里学到什么：

如何查找和准备数据集
将数据集转换为 ChatML 格式进行训练
加载基本模型的量化版本并插入 LoRA 适配器
选择正确的训练设置

让我们开始吧！

先决条件

在我们开始之前，您需要 Hugging Face 的最新工具。在终端中运行以下命令来安装或更新这些软件包：

pip install -U accelerate bitsandbytes datasets peft transformers

作为参考，这些是用于组合本教程的特定版本：

accelerate 0.24.1 bitsandbytes 0.41.1 Datasets 2.14.6 peft 0.6.0 transformers 4.35.0 torch 2.1.0

现有或创建自己的数据集

本节专门介绍加载或制作数据集以及随后根据 ChatML 结构对其进行格式化的关键过程。接下来，我们将在下一节中深入研究标记化和批处理领域。

请记住，数据集的质量至关重要——它将显著影响模型的性能。您的数据集必须适合您的任务，这一点至关重要。

总体策略

数据集可以混合来自不同来源。以Mistral 的 Open Hermes 2 微调为例，它使用来自多个数据集的约 900,000 个样本进行训练。

这些数据集通常包含问题-答案对，格式为独立对（单个样本相当于单个问题和答案）或以对话序列连接（格式为 Q/A、Q/A、Q/A）。

本节旨在指导您将这些数据集转换为与训练方案兼容的统一格式。为了准备培训，必须选择一种形式。我在这里选择了 OpenAI 的ChatML，因为它在最近的模型版本中被频繁采用，并且可能成为新标准。

以下是 ChatML 格式对话的示例（来自 Open Orca 数据集）：

<|im_start|>system You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.<|im_end|> <|im_start|>user Premise: A man is inline skating in front of a wooden bench. Hypothesis: A man is having fun skating in front of a bench. .Choose the correct answer: Given the premise, can we conclude the hypothesis? Select from: a). yes b). it is not possible to tell c). no<|im_end|> <|im_start|>assistant b). it is not possible to tell Justification: Although the man is inline skating in front of the wooden bench, we cannot conclude whether he is having fun or not, as his emotions are not explicitly mentioned.<|im_end|>

上面的例子可以被标记化、批处理并输入到训练算法中。然而，在继续之前，让我们先检查一些众所周知的数据集以及如何准备和格式化它们。

如何加载 Open Assistant 数据

让我们从Open Assistant 数据集开始。

from datasets import load_dataset dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")

加载后，数据集被预先分为训练（13k 条目）和测试部分（700 条目）。

>>> dataset DatasetDict({ train: Dataset({ features: ['text'], num_rows: 12947 }) test: Dataset({ features: ['text'], num_rows: 690 }) })

让我们看一下第一个条目：

如何加载 Open Orca 数据

转向Open Orca，该数据集包含 420 万个条目，需要在加载后进行训练/测试分割，这可以使用train_test_split.

from datasets import load_dataset dataset = load_dataset("Open-Orca/OpenOrca") dataset = dataset["train"].train_test_split(test_size=0.1)

让我们检查一下数据集的结构。这是第一个条目：

{ 'id': 'flan.2020759', 'system_prompt': 'You are an AI assistant. You will be given a task. You must generate a detailed and long answer.', 'question': 'Ülke, bildirgeyi uygulamaya başlayan son ülkeler arasında olmasına rağmen 46 ülke arasında 24. sırayı aldı. Could you please translate this to English?', 'response': 'Despite being one of the last countries to implement the declaration, it ranked 24th out of 46 countries.' }

它是一对问题答案和一条描述必须回答问题的上下文的系统消息。

与 Open Assistant 数据集相反，我们必须自己将 Open Orca 数据格式化为 ChatML。

def format_conversation(row): template="""<|im_start|>system {sys}<|im_end|> <|im_start|>user {q}<|im_end|> <|im_start|>assistant {a}<|im_end|>""" conversation=template.format( sys=row["system_prompt"], q=row["question"], a=row["response"], ) return {"text": conversation} import os dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names # remove all columns; only "text" will be left num_proc=os.cpu_count() # multithreaded )

现在，数据集已准备好进行标记并输入到训练管道中。

根据播客转录创建数据集

我之前根据Lex Friedman 播客的记录对 llama1 进行了训练。这项任务涉及将一个以深入讨论而闻名的播客转变为一个训练集，让人工智能模仿 Lex 的说话方式。您可以在从转录到 AI 聊天：Lex Fridman 播客的实验中找到有关如何创建数据集的详细信息。

from datasets import load_dataset dataset = load_dataset("g-ronimo/lfpodcast") dataset = dataset["train"].train_test_split(test_size=0.1)

检查训练集中的第一个条目，您将看到如下 JSON 对象：

>>> print(json.dumps(dataset["train"][0],indent=2)) { "title": "Lex_Fridman_Podcast_-_114__Russ_Tedrake_Underactuated_Robotics_Control_Dynamics_and_Touch", "episode": 114, "speaker_ratio_lex-vs-guest": 0.44402311303719755, "conversation": [ { "from": "Guest", "text": "I think the most beautiful motion of a robot has to be the passive dynamic walkers. I think there's just something fundamentally beautiful. (..) but what Steve and Andy did was they took it to this beautiful conclusion. where they built something that had knees, arms, a torso, the arms swung naturally, give it a little push, and that looked like a stroll through the park." }, { "from": "Lex", "text": "How do you design something like that? Is that art or science?" }, (...)

此结构捕捉了每个播客剧集的精髓，但为了准备模型训练，对话需要转换为 ChatML 格式。我们需要遍历每个消息轮次，应用 ChatML 格式，然后连接消息以将整个剧集记录存储在单个text字段中。Guest和的角色Lex将分别重新分配给user和assistant，以调节语言模型以采用 Lex 好奇且知识渊博的角色。

def format_conversation(row): # Template for conversation turns in ChatML format template="""<|im_start|>user {q}<|im_end|> <|im_start|>assistant {a}<|im_end|>""" turns=row["conversation"] # If Lex is the first speaker, skip his turn to start with Guest's question if turns[0]["from"]=="Lex": turns=turns[1:] conversation=[] for i in range(0, len(turns), 2): # Assuming the conversation always alternates between Guest and Lex question=turns[i] # Guest answer=turns[i 1] # Lex conversation.append( template.format( q=question["text"], a=answer["text"], )) return {"text": "\n".join(conversation)} import os dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names, num_proc=os.cpu_count() )

通过应用这些更改，生成的数据集将为标记化做好准备并输入到训练管道中，从而教导语言模型进行对话，让人想起 Lex Fridman 的播客讨论。

根据书籍创建数据集

为了更深入地了解数据集创建的细微差别，让我们考虑一个案例，我们想要训练人工智能来反映知名人物的声音和个性。比如，选择将美国著名厨师Anthony Bourdain的自传变成一个数据集。他写了《厨房秘笈》，生动地描述了厨房里的疯狂和厨师的心思。

这个过程涉及将布尔登书中的叙述转变为引人入胜的对话，就像捕捉他的精神的来回采访一样。

所需步骤：

将书本转换为文本
段落分析和分段：一旦书籍成为文本形式，我们就将其分段。短段落被合并，长段落被分割，以确保每个片段都可以独立存在，同时仍然对整体故事情节做出贡献。
生成面试问题：对于每个段落，我们构建一个人工面试场景，其中LLMs扮演面试官的角色，生成问题，引出自然适合书中给定段落的回答。目的是激发富有洞察力的对话，给人留下布尔登本人正在回答有关他的生活和经历的问题的印象。

首先，我们假设您合法获得了这本书的数字副本，kc.pdf

mv anthony-bourdain-kitchen-confidential.pdf kc.pdf pdftotext -nopgbrk kc.pdf # fix line breaks within sentence sed -r ':a /[a-zA-Z,\ ]$/N;s/(.)\n/\1 /;ta' kc.txt > kc_reformat.txt

现在使用每个段落n和段落n-1来参与任何智能开源 LLM 或 GPT-3.5/4。我使用Open Hermes 2为每个段落创建一个面试问题。

# Gather paragraphs to target with open("kc_reformat.txt") as f: file_content = f.read() chapters=file_content.split("\n\n") # Define minimum and maximum lengths to ensure a good interview flow passage_minlen=300 # if paragraph <300 chars -> merge with next passage_maxlen=2000 # if paragraph >2k chars -> split # Process the chapters into suitable interview passages passages=[] for chap in chapters: passage="" for par in chap.split("\n"): if(len(passage)<passage_minlen) or not passage[-1]=="." and len(passage)<passage_maxlen: passage ="\n" par else: passages.append(passage.strip().replace("\n", " ")) passage=par # Ask Open Hermes prompt_template="""<|im_start|>system You are an expert interviewer who interviews an autobiography of a famous chef. You formulate questions based on quotes from the autobiography. Below is one such quote. Formulate a question that the quote would be the perfect answer to. The question should be short and directed at the author of the autobiography like in an interview. The question is short. Remember, make the question as short as possible. Do not give away the answer in your question. Also: If possible, ask for motvations, feelings, and perceptions rather than events or facts. Here is some context that might help you formulate the question regarding the quote: {ctx} <|im_end|> <|im_start|>user Quote: {par}<|im_end|> <|im_start|>assistant Question:""" prompts=[] for i,p in enumerate(passages): prompt=prompt_template.format(par=passages[i], ctx=passages[i-1]) prompts.append(prompt) # Prompt smart LLM, parse results, store Q/A in .json ...

您可以在这里找到对我有用的完整代码。

生成的 .json 文件如下所示：

{ "question": "Why you choose to share your experiences and insights from your career in the restaurant industry despite the angry or wanting to horrify the dining public?", "answer": "I'm not spilling my guts about everything I've seen, learned and done in my long and checkered career as dishwasher, prep drone, fry cook, grillardin, saucier, sous-chef and chef because I'm angry at the business, or because I want to horrify the dining public. I'd still like to be a chef, too, when this thing comes out, as this life is the only life I really know. If I need a favor at four o'clock in the morning, whether it's a quick loan, a shoulder to cry on, a sleeping pill, bail money, or just someone to pick me up in a car in a bad neighborhood in the driving rain, I'm definitely not calling up a fellow writer. I'm calling my sous-chef, or a former sous-chef, or my saucier, someone I work with or have worked with over the last twenty-plus years." }, { "question": "Why do you feel more comfortable sharing the \"dark recesses\" of the restaurant underbelly instead of writing about your personal experiences outside of the culinary world?", "answer": "No, I want to tell you about the dark recesses of the restaurant underbelly-a subculture whose centuries-old militaristic hierarchy and ethos of 'rum, buggery and the lash' make for a mix of unwavering order and nerve-shattering chaos-because I find it all quite comfortable, like a nice warm bath. I can move around easily in this life. I speak the language. In the small, incestuous community of chefs and cooks in New York City, I know the people, and in my kitchen, I know how to behave (as opposed to in real life, where I'm on shakier ground). I want the professionals who read this to enjoy it for what it is: a straight look at a life many of us have lived and breathed for most of our days and nights to the exclusion of 'normal' social interaction. Never having had a Friday or Saturday night off, always working holidays, being busiest when the rest of the world is just getting out of work, makes for a sometimes peculiar world-view, which I hope my fellow chefs and cooks will recognize. The restaurant lifers who read this may or may not like what I'm doing. But they'll know I'm not lying." }

最后，我们再次将数据集转换为 ChatML 格式：

interview_fn="kc_reformat_interview.json" dataset = load_dataset('json', data_files=interview_fn, field='interview') dataset=dataset["train"].train_test_split(test_size=0.1) # chatML template, from https://huggingface.co/docs/transformers/main/chat_templating tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' message['role'] '\n' message['content'] '<|im_end|>' '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" def format_interview(conv): messages = [ {"role": "user", "content": conv["question"]}, {"role": "assistant", "content": conv["answer"]} ] chat=tokenizer.apply_chat_template(messages, tokenize=False).strip() return {"text": chat} dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names )

通过改造 Bourdain 的自传，我们的目标是生成一个人工智能，呼应他的叙事风格和对烹饪行业的看法，并体现他的人生哲学。所提供的方法非常基础，将受益于进一步的细化，例如删除低内容答案，剥离脚注、页码等非必要文本元素。这将提高模型的质量。

如果您好奇，请与Mistral Bourdain交谈。尽管当前的输出是对 Bourdain 声音的基本模仿，但它可以作为概念证明；增强的数据集管理无疑会产生更有说服力的模拟。

创建您自己的数据集

你现在已经明白了。以下是 GPT-4 提出的一些关于创意数据集创建的其他想法：

历史人物演讲数据集。收集历史人物的演讲、信件和书面作品，创建反映他们说话和写作风格的数据集。这可以用来生成教育内容，例如对历史人物的模拟采访，或者创建叙事体验，让这些人物对现代事件进行评论。
虚构世界百科全书。根据各种奇幻和科幻小说创建数据集，详细说明这些故事中的世界构建元素，例如地理、政治体系、物种和技术。这可用于训练人工智能生成新的幻想世界或为游戏开发提供丰富的上下文信息。
情感对话数据集。分析电影剧本、戏剧和小说，创建带有相应情感基调的对话数据集。该数据集可用于训练人工智能系统，该系统可识别并生成具有微妙情感底蕴的对话，这有利于改善聊天机器人和虚拟助手的移情反应。
技术产品评论和规格数据集。编译来自各种来源的技术产品评论、规格和用户评论的综合数据集。该数据集可以为推荐引擎或人工智能系统提供动力，旨在为消费者提供购买建议。

加载并准备模型和分词器

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model modelpath="models/Mistral-7B-v0.1" # Load 4-bit quantized model model = AutoModelForCausalLM.from_pretrained( modelpath, device_map="auto", quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", ), torch_dtype=torch.bfloat16, ) # Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False) # Add tokens <|im_start|> and <|im_end|>, latter is special eos token tokenizer.pad_token = "</s>" tokenizer.add_tokens(["<|im_start|>"]) tokenizer.add_special_tokens(dict(eos_token="<|im_end|>")) model.resize_token_embeddings(len(tokenizer)) model.config.eos_token_id = tokenizer.eos_token_id

由于我们不是训练所有参数而只是训练一个子集，因此我们必须使用 Huggingface 将 LoRA 适配器添加到模型中peft。确保使用peft>= 0.6，否则 1)get_peft_model会非常慢，2) Mistral 训练会失败。

# Add LoRA adapters to model model = prepare_model_for_kbit_training(model) config = LoraConfig( r=64, lora_alpha=16, target_modules = ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], lora_dropout=0.1, bias="none", modules_to_save = ["lm_head", "embed_tokens"], # needed because we added new tokens to tokenizer/model task_type="CAUSAL_LM" ) model = get_peft_model(model, config) model.config.use_cache = False

LoRA 等级r：指定要训练的参数数量。排名越高，您训练的参数越多，适配器文件也越大。通常是 8 到 128 之间的数字。最大可能值，即。hidden_size训练所有参数，对于 llama2-7b 和 Mistral (= in )将为 4096 config.json，并且无法达到添加适配器的目的。QLoRA 论文建议64使用guanaco（开放助理数据集），这对我来说效果很好。
target_modules：QLoRA 作者在论文中的另一个建议/发现：我们发现最关键的 LoRA 超参数是总共使用了多少个 LoRA 适配器，并且所有线性变压器块层上的 LoRA 都需要匹配完整的微调性能
modules_to_save：指定除 LoRA 层之外的模块设置为可训练并保存在最终检查点中。由于我们将 ChatML 标签作为标记添加到词汇表中，因此我们还需要训练并保存线性层lm_head和嵌入矩阵embed_tokens。这对于稍后将适配器合并回基本模型相关。

准备训练数据

正确的标记化和批处理对于确保正确处理数据至关重要。

代币化

text对数据集中的字段进行标记，无需添加特殊标记或填充，因为我们将手动执行此操作。

def tokenize(element): return tokenizer( element["text"], truncation=True, max_length=2048, add_special_tokens=False, ) dataset_tokenized = dataset.map( tokenize, batched=True, num_proc=os.cpu_count(), # multithreaded remove_columns=["text"] # don't need the strings anymore, we have tokens from here on )

max_length：指定样本的最大长度（以标记数为单位）。所有超过 2048 个令牌的内容都将被截断并且不会进行训练。如果您的数据集在单个样本中只有简短的问题/答案对（例如 Open Orca），那么这将绰绰有余，如果您的样本较长（例如播客转录本），则理想情况下您将增加（消耗 VRAM）或max_length拆分将您的样本分成几个较小的样本。llama2 的最大值是 4096。Mistral 是“使用 8k 上下文长度和固定缓存大小进行训练，理论注意力跨度为 128k 令牌”，但我从未超过 4096。

配料

Hugging Facetrainer需要一个整理器函数将样本列表转换为包含一批填充的字典

input_ids（标记化文本）
labels（目标文本，与相同input_ids）
和attention_masks（零和一的张量）。

为此，我们将采用QLoRA 存储库DataCollatorForCausalLM中的简化版本。

训练超参数

超参数的选择可以显着影响模型性能。以下是我们为训练选择的超参数：

bs=8 # batch size ga_steps=1 # gradient acc. steps epochs=5 steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps) args = TrainingArguments( output_dir="out", per_device_train_batch_size=bs, per_device_eval_batch_size=bs, evaluation_strategy="steps", logging_steps=1, eval_steps=steps_per_epoch, # eval and save once per epoch save_steps=steps_per_epoch, gradient_accumulation_steps=ga_steps, num_train_epochs=epochs, lr_scheduler_type="constant", optim="paged_adamw_32bit", learning_rate=0.0002, group_by_length=True, fp16=True, ddp_find_unused_parameters=False, # needed for training with accelerate )

batch size：尽可能高以提高速度。消耗VRAM，如果OOM则减少。
gradient_accumulation_steps：增加有效批量大小而不消耗额外的 VRAM，但会使训练速度变慢。有效批量大小为batch_size* gradient_accumulation_steps。
steps_per_epoch：如果您的数据集有 80 个样本，并且有效批量大小为 8（例如batch_size8 和gradient_accumulation_steps1），您将通过 10 个步骤（= 1 epoch）处理整个数据集。
num_train_epochs：训练多少个纪元取决于您的数据集。理想情况下，eval split 上的损失会告诉您何时停止训练以及哪个检查点是最好的 -但例如，训练guanaco 会导致在第 2 轮之后 eval_loss 增加，这表明对训练集的过度拟合，即使模型质量有所提高。有关此内容的更多信息以及 QLoRA 作者在github上和我之前的一篇文章中的官方回复。
总而言之：您只需查看哪个检查点最适合您的特定任务。通常，3-4 个 epoch 是一个好的开始。
learning_rate：我们将使用 QLoRA 作者建议的默认学习率，对于 7B（或 13 B）模型为 0.0002。对于参数较多的模型，建议较低的学习率：33B 和 65B 参数的模型为 0.0001。

我们来训练吧。

trainer = Trainer( model=model, tokenizer=tokenizer, data_collator=collate, train_dataset=dataset_tokenized["train"], eval_dataset=dataset_tokenized["test"], args=args, ) trainer.train()

训练运行示例

训练和评估损失

下面是 Open Assistant (OA) 数据集典型训练运行的 wandb 图表，比较了 llama2–7b 和 Mistral-7b 的微调。

训练时间和 VRAM 使用情况

在具有 24GB VRAM 的单个 GPU 上对 Open Assistant 数据集上的 Llama2–7B 和 Mistral-7B 进行微调每个周期大约需要 100 分钟。

显卡：NVIDIA GeForce RTX 3090
数据集“OpenAssistant/oasst_top1_2023–08–25”
批量大小 16， grad. acc. steps 1
样品max_length512

将 LoRA 适配器与基本模型合并

以下代码与其他脚本（例如TheBloke 提供的脚本）有点不同，因为我们在训练之前添加了 ChatML 的标记。不过，我们没有更改基本模型，这就是为什么在加载适配器之前我们必须将新标记添加到基本模型和标记生成器中；否则，我们将尝试将具有两个附加令牌的适配器合并到没有这些令牌的模型中（这将失败）。

from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_path="models/Mistral-7B-v0.1" # input: base model adapter_path="out/checkpoint-606" # input: adapters save_to="models/Mistral-7B-finetuned" # out: merged model ready for inference base_model = AutoModelForCausalLM.from_pretrained( base_path, return_dict=True, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(base_path) # Add/set tokens (same 5 lines of code we used before training) tokenizer.pad_token = "</s>" tokenizer.add_tokens(["<|im_start|>"]) tokenizer.add_special_tokens(dict(eos_token="<|im_end|>")) base_model.resize_token_embeddings(len(tokenizer)) base_model.config.eos_token_id = tokenizer.eos_token_id # Load LoRA adapter and merge model = PeftModel.from_pretrained(base_model, adapter_path) model = model.merge_and_unload() model.save_pretrained(save_to, safe_serialization=True, max_shard_size='4GB') tokenizer.save_pretrained(save_to)

故障排除

挑战是模型训练的重要组成部分。让我们讨论一些常见问题及其解决方案。

OOM

如果遇到内存不足 (OOM) 错误：

考虑减少批量大小。
通过减少上下文长度（max_lengthin tokenize()）来缩短训练样本。

训练太慢

如果训练看起来缓慢：

增加批量大小。
多个 GPU，购买或租用（例如在runpod上）。此处提供的代码已准备好加速，并且可用于在多 GPU 设置中进行训练，只需使用而accelerate launch qlora.py不是启动即可python qlora.py。

最终模型的质量很差

模型的质量反映了数据集的质量。提高模型质量：

确保您的数据集丰富且相关。
研究增强数据集以获得更好表示的策略。

总结

明白你在做什么。有像axolotl这样优秀的训练工具，它可以让您专注于数据集创建，而不是编写自己的填充函数。尽管如此，对基本机制的扎实掌握仍然是无价的。这些知识使您能够自信地应对复杂性并排除故障。
增量方法：从使用小数据集的基本示例开始。逐步扩大规模并逐步调整参数，以揭示它们对模型性能的影响。
强调数据质量：高质量的数据是有效训练的基石。在组装数据集时要保持创新和勤奋。

对 Llama 2 和 Mistral 等 LLM 进行微调是一个有益的过程，特别是当您拥有正确的数据集和训练参数时。请记住始终关注模型的性能并准备好迭代和适应。

查看全文