如何使用Llama3和Hugging Face优化关系提取任务（宝可梦sleep2024）

前言

关系提取(RE)是一项任务，旨在从非结构化文本中识别出各种命名实体之间的联系。它与命名实体识别(NER)配合使用，是自然语言处理流程中不可或缺的一步。随着大型语言模型(LLM)的崛起，那些需要标注实体范围并对它们之间的关系进行分类的传统监督方法得到了增强，甚至被基于LLM的方法所取代。

Llama3是生成式AI领域最新的重要发布。这个基础模型提供了两种规模：8B和70B，预计不久将发布400B模型。这些模型可以在HuggingFace平台上找到，详情请见。70B版本的模型是Meta新聊天网站Meta.ai的动力，其性能与ChatGPT相当。8B版本的模型在其类别中表现最佳。Llama3的架构与Llama2类似，性能的提升主要来自于数据的升级。这个模型配备了更新的分词器和扩大的上下文窗口。虽然宣称为开源，但只有一小部分数据被公开。总体而言，这是一个卓越的模型，我迫不及待想试一试。

虽然Llama3–70B能产生惊人的结果，但由于其庞大的规模，使其在本地系统上的使用变得不切实际、成本高昂且困难。因此，为了充分利用这个模型，我们让Llama3–70B教授较小的Llama3–8B如何从非结构化文本中提取关系。

具体来说，我们利用Llama3–70B构建了一个针对关系提取的监督式微调数据集。然后我们使用这个数据集来微调Llama3–8B，以增强其提取关系的能力。

要在本博客关联的Google Colab NoteBook中复现代码，你需要：

HuggingFace凭据（可选，用于保存微调后的模型）以及Llama3访问权限，可以按照模型说明卡片上的指示获得；
一个免费的GroqCloud账户（可通过Google账户登录）和相应的API密钥。

工作区设置

在这个项目中，我使用了配置了A100 GPU和高RAM的Google Colab Pro。

我们首先安装所有必需的库：

!pip install -q groq !pip install -U accelerate bitsandbytes datasets evaluate !pip install -U peft transformers trl

很庆幸，尽管模型很新，但整个设置从一开始就没有任何依赖问题，也无需从源代码安装transformers。

我们还需要让Google Colab获得对驱动器和文件的访问权限，并设置工作目录：

# For Google Colab settings from google.colab import userdata, drive # This will prompt for authorization drive.mount('/content/drive') # clone project ! git clone https://github.com/mcks2000/llm_notebooks.git # Set the working directory ? '/content/llm_notebooks/knowlage_graph/llama3_re'

对于那些想要将模型上传到HuggingFace Hub的人，我们需要上传Hub凭证。在我的情况下，这些存储在Google Colab的秘密中，可以通过左侧的密钥按钮访问。这一步是可选的。

# For Hugging Face Hub setting from huggingface_hub import login # Upload the HuggingFace token (should have WRITE access) from Colab secrets HF = userdata.get('HF') # This is needed to upload the model to HuggingFace login(token=HF,add_to_git_credential=True)

我还添加了一些路径变量，以简化文件访问：

# Create a path variable for the data folder data_path = '/content/llm_notebooks/knowlage_graph/llama3_re/datas/' # Full fine-tuning dataset sft_dataset_file = f'{data_path}sft_train_data.json' # Data collected from the the mini-test mini_data_path = f'{data_path}mini_data.json' # Test data containing all three outputs all_tests_data = f'{data_path}all_tests.json' # The adjusted training dataset train_data_path = f'{data_path}sft_train_data.json' # Create a path variable for the SFT model to be saved locally sft_model_path = '/content/llm_notebooks/knowlage_graph/llama3_re/Llama3_RE/'

现在我们的工作区已经设置好了，我们可以开始第一步，即为关系提取任务构建合成数据集。

利用Llama3–70B为关系提取创建合成数据集

有几个关系提取的数据集可用，最著名的是CoNLL04数据集。此外，还有一些优秀的数据集，如HuggingFace上可用的web_nlg和AllenAI开发的SciREX。然而，这些数据集大多带有限制性许可。

受web_nlg数据集格式的启发，我们将构建自己的数据集。如果我们打算对根据我们的数据集训练的模型进行微调，这种方法将特别有用。首先，我们需要一系列短句子来进行我们的关系提取任务。我们可以通过各种方式编译这个语料库。

收集句子集合

我们将使用databricks-dolly-15k，这是由Databricks员工在2023年生成的一个开源数据集。该数据集旨在进行监督微调，包括四个特征：指令、上下文、响应和类别。在分析了八个类别之后，我决定保留来自信息提取类别的上下文中的第一句话。数据解析步骤如下所示：

from datasets import load_dataset # Load the dataset dataset = load_dataset("databricks/databricks-dolly-15k") # Choose the desired category from the dataset ie_category = [e for e in dataset["train"] if e["category"]=="information_extraction"] # Retain only the context from each instance ie_context = [e["context"] for e in ie_category] # Split the text into sentences (at the period) and keep the first sentence reduced_context = [text.split('.')[0] '.' for text in ie_context] # Retain sequences of specified lengths only (use character length) sampler = [e for e in reduced_context if 30 < len(e) < 170]

通过这一选择过程，我们获得了一个包含1,041个句子的数据集。考虑到这是一个小型项目，我没有逐一挑选句子，因此，一些样本可能不适合我们的任务。在一个面向生产的项目中，我会仔细挑选只有最合适的句子。然而，对于本项目的目的，这个数据集就足够了。

格式化数据

我们首先需要创建一个系统消息，它将定义输入提示并指导模型如何生成答案：

system_message = """You are an experienced annontator. Extract all entities and the relations between them from the following text. Write the answer as a triple entity1|relationship|entitity2. Do not add anything else. Example Text: Alice is from France. Answer: Alice|is from|France. """

由于这是一个实验阶段，我将对模型的要求降到最低。我确实测试了几个其他提示，包括一些要求以CoNLL格式输出的提示，其中实体被分类，模型表现得相当好。然而，为了简单起见，我们现在只坚持基础。

我们还需要将数据转换为对话格式：

messages = [[ {"role": "system","content": f"{system_message}"}, {"role": "user", "content": e}] for e in sampler]Groq 客户端和 API

Llama3刚刚发布了几天，API选项的可用性仍然有限。虽然Llama3–70B有一个聊天界面可用，但这个项目需要一个可以用几行代码处理我1000个句子的API。我发现了这个很棒的YouTube视频，解释了如何免费使用GroqCloud API。有关更多详情，请参考该视频。

只是提醒一下：你需要登录并从GroqCloud网站检索一个免费的API密钥。我的API密钥已经保存在Google Colab secrets中。我们从初始化Groq客户端开始：

import os from groq import Groq gclient = Groq( api_key=userdata.get("GROQ"), )

接下来我们需要定义一些辅助函数，使我们能够有效地与Meta.ai聊天界面互动）：

import time from tqdm import tqdm def process_data(prompt): """Send one request and retrieve model's generation.""" chat_completion = gclient.chat.completions.create( messages=prompt, # input prompt to send to the model model="llama3-70b-8192", # according to GroqCloud labeling temperature=0.5, # controls diversity max_tokens=128, # max number tokens to generate top_p=1, # proportion of likelihood weighted options to consider stop=None, # string that signals to stop generating stream=False, # if set partial messages are sent ) return chat_completion.choices[0].message.content def send_messages(messages): """Process messages in batches with a pause between batches.""" batch_size = 10 answers = [] for i in tqdm(range(0, len(messages), batch_size)): # batches of size 10 batch = messages[i:i 10] # get the next batch of messages for message in batch: output = process_data(message) answers.append(output) if i 10 < len(messages): # check if there are batches left time.sleep(10) # wait for 10 seconds return answers

第一个函数 process_data() 作为Groq客户端聊天完成功能的包装器。第二个函数 send_messages(), 以小批量处理数据。如果你按照Groq游乐场页面上的设置链接，你会找到一个链接到限制的链接，详细说明了我们可以使用免费API的条件，包括请求和生成的令牌数量上限。为了避免超出这些限制，我在每批10条消息后添加了10秒的延迟，尽管在我的情况下这并非严格必要。你可能想要试验这些设置。

现在剩下的是生成我们的关系提取数据并将其与初始数据集集成：

# Data generation with Llama3-70B answers = send_messages(messages) # Combine input data with the generated dataset combined_dataset = [{'text': user, 'gold_re': output} for user, output in zip(sampler, answers)]评估 Llama3–8B 的关系提取

在微调模型之前，很重要的一步是评估Llama3–8B在几个样本上的表现，以判断是否确实需要微调。

构建测试数据集

我们将从我们刚构建的数据集中选取20个样本用于测试，其余的数据集将用于微调。

import random random.seed(17) # Select 20 random entries mini_data = random.sample(combined_dataset, 20) # Build conversational format parsed_mini_data = [[{'role': 'system', 'content': system_message}, {'role': 'user', 'content': e['text']}] for e in mini_data] # Create the training set train_data = [item for item in combined_dataset if item not in mini_data]

我们将使用GroqCloud API和上面定义的工具，指定model=llama3-8b-8192，而其余函数保持不变。在这种情况下，我们可以直接处理我们的小数据集，而不必担心超出API限制。

下面是一个样本输出，它提供了原始文本，Llama3-70B生成的 gold_re，以及Llama3-8B生成的 test_re 。

{'text': 'Long before any knowledge of electricity existed, people were aware of shocks from electric fish.', 'gold_re': 'people|were aware of|shocks\\nshocks|from|electric fish\\nelectric fish|had|electricity', 'test_re': 'electric fish|were aware of|shocks'}

有关完整测试数据集，请参考Google Colab NoteBook。

仅从这个例子来看，很明显Llama3–8B可以从一些改进中受益，以增强其关系提取能力。我们开始这个提升过程。

对Llama3–8B进行监督微调

我们将利用一整套技术来协助我们，包括QLoRA和Flash Attention。我不会在这里深入讨论选择超参数的细节，但如果你有兴趣进一步探索，请查看这些很棒的参考资料 [4] 和 [5]。

A100 GPU支持Flash Attention和bfloat16，并且具有大约40GB的内存，这对我们的微调需求来说是足够的。

准备 SFT 数据集

我们从将数据集解析为对话格式开始，包括系统消息、输入文本和我们从Llama3–70B生成中得到的期望答案。然后我们将其保存为HuggingFace数据集：

def create_conversation(sample): return { "messages": [ {"role": "system","content": system_message}, {"role": "user", "content": sample["text"]}, {"role": "assistant", "content": sample["gold_re"]} ] } from datasets import load_dataset, Dataset train_dataset = Dataset.from_list(train_data) # Transform to conversational format train_dataset = train_dataset.map(create_conversation, remove_columns=train_dataset.features, batched=False)选择型号

model_id = "meta-llama/Meta-Llama-3-8B"加载分词器

from transformers import AutoTokenizer # Tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = 'left' # Set a maximum length tokenizer.model_max_length = 512选择量化参数

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )加载模型

from transformers import AutoModelForCausalLM from peft import prepare_model_for_kbit_training from trl import setup_chat_format device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None model = AutoModelForCausalLM.from_pretrained( model_id, device_map=device_map, attn_implementation="flash_attention_2", quantization_config=bnb_config ) model, tokenizer = setup_chat_format(model, tokenizer) model = prepare_model_for_kbit_training(model)LoRA 配置

from peft import LoraConfig # According to Sebastian Raschka findings peft_config = LoraConfig( lora_alpha=128, #32 lora_dropout=0.05, r=256, #16 bias="none", target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"], task_type="CAUSAL_LM", )

最佳结果是在针对所有线性层时获得的。如果内存限制是一个问题，选择更标准的值，如alpha=32和rank=16会有好处，因为这些设置会导致参数显著减少。

训练参数

from transformers import TrainingArguments # Adapted from Phil Schmid blogpost args = TrainingArguments( output_dir=sft_model_path, # directory to save the model and repository id num_train_epochs=2, # number of training epochs per_device_train_batch_size=4, # batch size per device during training gradient_accumulation_steps=2, # number of steps before performing a backward/update pass gradient_checkpointing=True, # use gradient checkpointing to save memory, use in distributed training optim="adamw_8bit", # choose paged_adamw_8bit if not enough memory logging_steps=10, # log every 10 steps save_strategy="epoch", # save checkpoint every epoch learning_rate=2e-4, # learning rate, based on QLoRA paper bf16=True, # use bfloat16 precision tf32=True, # use tf32 precision max_grad_norm=0.3, # max gradient norm based on QLoRA paper warmup_ratio=0.03, # warmup ratio based on QLoRA paper lr_scheduler_type="constant", # use constant learning rate scheduler push_to_hub=True, # push model to Hugging Face hub hub_model_id="llama3-8b-sft-qlora-re", report_to="tensorboard", # report metrics to tensorboard )

如果您选择在本地保存模型，可以省略最后三个参数。您可能还需要调整per_device_batch_size和gradient_accumulation_steps以防止内存不足（OOM）错误。

初始化训练器并训练模型

from trl import SFTTrainer trainer = SFTTrainer( model=model, args=args, train_dataset=sft_dataset, peft_config=peft_config, max_seq_length=512, tokenizer=tokenizer, packing=False, # True if the dataset is large dataset_kwargs={ "add_special_tokens": False, # the template adds the special tokens "append_concat_token": False, # no need to add additional separator token } ) trainer.train() trainer.save_model()

训练，包括模型保存，大约花费了10分钟。

我们清理内存以准备进行推理测试。如果您使用的GPU内存较少，并遇到CUDA内存不足（OOM）错误，您可能需要重新启动运行时。

import torch import gc del model del tokenizer gc.collect() torch.cuda.empty_cache()使用 SFT 模型进行推理

在这最后一步，我们将以半精度加载基础模型和Peft适配器。对于这次测试，我选择不将模型与适配器合并。

from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer, pipeline import torch # HF model peft_model_id = "solanaO/llama3-8b-sft-qlora-re" # Load Model with PEFT adapter model = AutoPeftModelForCausalLM.from_pretrained( peft_model_id, device_map="auto", torch_dtype=torch.float16, offload_buffers=True )

接下来，我们加载分词器：

okenizer = AutoTokenizer.from_pretrained(peft_model_id) tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id

我们构建文本生成管道：

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

我们加载测试数据集，它包含了我们之前保留的20个样本，并以对话风格格式化数据。但这次我们省略了助手消息，并将其格式化为Hugging Face数据集：

def create_input_prompt(sample): return { "messages": [ {"role": "system","content": system_message}, {"role": "user", "content": sample["text"]}, ] } from datasets import Dataset test_dataset = Dataset.from_list(mini_data) # Transform to conversational format test_dataset = test_dataset.map(create_input_prompt, remove_columns=test_dataset.features, batched=False)单个样本测试

我们使用SFT Llama3–8B生成关系提取输出，并将其与之前两个输出进行比较，仅针对单个实例：

# Generate the input prompt prompt = pipe.tokenizer.apply_chat_template(test_dataset[2]["messages"][:2], tokenize=False, add_generation_prompt=True) # Generate the output outputs = pipe(prompt, max_new_tokens=128, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, ) # Display the results print(f"Question: {test_dataset[2]['messages'][1]['content']}\\n") print(f"Gold-RE: {test_sampler[2]['gold_re']}\\n") print(f"LLama3-8B-RE: {test_sampler[2]['test_re']}\\n") print(f"SFT-Llama3-8B-RE: {outputs[0]['generated_text'][len(prompt):].strip()}")

我们获得以下结果：

在这个例子中，我们观察到通过微调，Llama3–8B在关系提取能力上的显著改进。尽管微调数据集既不太干净也不太大，结果仍然令人印象深刻。

有关20个样本数据集的完整结果，请参考Google Colab NoteBook。请注意，推理测试需要更长时间，因为我们以半精度加载模型

结论

总之，通过利用Llama3–70B和可用数据集，我们成功创建了一个合成数据集，然后用它来对Llama3–8B进行特定任务的微调。这个过程不仅让我们熟悉了Llama3，还允许我们应用Hugging Face的简单技术。我们观察到，使用Llama3的工作体验与Llama2相似，显著的改进是输出质量的提高和更有效的分词器。

对于那些有兴趣进一步挑战模型的人，考虑使用更复杂的任务，如对实体和关系进行分类，并使用这些分类来构建知识图谱。

资源

数据集：https://huggingface.co/datasets/databricks/databricks-dolly-15k
Github 仓库：https://github.com/mcks2000/llm_notebooks/tree/main/knowlage_graph/llama3_re
https://colab.research.google.com/drive/1PQvWvOyuLi69pnVGQRV3OEFmvIiStoQE?usp=sharing

参考文献

Somin Wadhwa, Silvio Amir, Byron C. Wallace, 大型语言模型时代的关系提取回顾, arXiv.2305.05003 (2023)：https://arxiv.org/pdf/2305.05003.pdf
Meta, 介绍Meta Llama 3: 迄今为止最有能力的公开可用LLM, 2024年4月18日 (链接)：https://ai.meta.com/blog/meta-llama-3/
Philipp Schmid, Omar Sanseviero, Pedro Cuenca, Youndes Belkada, Leandro von Werra, 欢迎Llama 3 — Met的新开放LLM, 2024年4月18日：https://huggingface.co/blog/llama3
Sebastian Raschka, 使用LoRA（低秩适配）微调LLMs的实用技巧, Ahead of AI, 2023年11月19日：https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
Philipp Schmid, 2024年如何使用Hugging Face微调LLMs, 2024年1月22日：https://www.philschmid.de/fine-tune-llms-in-2024-with-trl

点赞关注获取更多资讯，并在头条上阅读我的短篇技术文章