基于LoRA的RLHF
参考Github 开源模型LLM-Tuning
一、简介
(1)RLHF (基于人类反馈的强化学习) 分为三步:
- SFT (Supervised Fine-Tuning): 有监督的微调,使用正常的 instruction following 或者对话的样本,来训练模型的基础对话、听从 prompt 的能力;
- RM (Reward Modeling): 基于人类的偏好和标注,来训练一个能模拟人偏好的打分模型;
- RL (Reinforcement Learning): 在前面的 SFT 模型的基础上,借助 RM 提供反馈,来不断通过 PPO 的强化学习框架来调整模型的行为。
(2)LoRA: Low-Rank Adaptation of Large Language Models
- 微调大规模语言模型到特殊领域和任务是自然语言处理的重要课题之一。但随着模型规模的不断扩大,微调模型的所有参数(
full fine-tuning
)的可行性变得越来越低。以GPT-3的175B参数为例,每增加一个新领域就需要完整微调一个新模型,代价和成本很高。 - 优点:训练和计算成本低、可并行、不引入推理延迟
- 在 每层 transfomer block 旁边引入一个并行低秩的支路,支路的输入是transfomer block 的输入。然后将输出和 transfomer block 的输出相加,在固定主pretrian model参数的情况下,用支路去学习特定任务知识,来完成特定任务。
- huggface 开源的一个高效微调大模型的库PEFT里面实现:lora 微调需要设置两个参数一个是r,即矩阵秩。 alpha是一个scale参数。
model_name_or_path = "./unsup-simcse-roberta-base"
peft_type = peft_type = PeftType.LORA
# lora Config参数设置
peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)
lr = 3e-4
# 加载LLM模型
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True)
# 整合可训练的增加lora的LLM模型
model = get_peft_model(model, peft_config)
# 打印可训练参数
model.print_trainable_parameters()
二、部分学习总结
(1)如何在大模型上更改结构,添加lora,并训练?例如SFT
from transformers import AutoTokenizer, AutoModelForCausalLM # model path
model_checkpoint = "../llm/Baichuan2-7B-Base" # init model
model = AutoModelForCausalLM.from_pretrained(
model_checkpoint, load_in_8bit=False, trust_remote_code=True,
device_map="auto" # 模型不同层会被自动分配到不同GPU上进行计算
)
print(model.hf_device_map) model.gradient_checkpointing_enable()
model.enable_input_require_grads()
model.lm_head = CastOutputToFloat(model.lm_head) # setup peft 两种设置模式:根据是否有lora初始化
if finetune_args.previous_lora_weights == None:
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=finetune_args.lora_rank,
lora_alpha=32,
lora_dropout=0.1,
target_modules = ["W_pack"] # 把model打印出来,找跟attention相关的模块
)
model = get_peft_model(model, peft_config)
else:
model = PeftModel.from_pretrained(model, finetune_args.previous_lora_weights)
# see: https://github.com/huggingface/peft/issues/184
for name, param in model.named_parameters():
if 'lora' in name or 'Lora' in name:
param.requires_grad = True
整体代码:


from transformers.integrations import TensorBoardCallback
from torch.utils.tensorboard import SummaryWriter
from transformers import TrainingArguments
from transformers import Trainer, HfArgumentParser
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
import torch
import torch.nn as nn
from peft import get_peft_model, LoraConfig, TaskType, PeftModel
from dataclasses import dataclass, field
import datasets
import os
from pprint import pprint as print model_checkpoint = "/data/liudianqing/llm/Baichuan2-7B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True) @dataclass
class FinetuneArguments:
tokenized_dataset: str = field(default=" ") # tokenized之后的数据集文件夹
model_path: str = field(default=" ")
lora_rank: int = field(default=8)
previous_lora_weights: str = field(default=None) # 如果要在前面的 LoRA 上继续训练,就设置一下之前的地址 class CastOutputToFloat(nn.Sequential):
def forward(self, x):
return super().forward(x).to(torch.float32) tokenizer.pad_token = tokenizer.unk_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False)
# DataCollatorForLanguageModeling 会自动帮你 padding, labels
# Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.
# 参考教程:https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt class ModifiedTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
return model(
input_ids=inputs["input_ids"],
labels=inputs["labels"],
).loss def save_model(self, output_dir=None, _internal_call=False):
# 因为交给Trainer的model实际上是PeftModel类型,所以这里的 save_pretrained 会直接使用PeftModel的保存方法
# 从而只保存 LoRA weights
self.model.save_pretrained(output_dir)
# from transformers.trainer import TRAINING_ARGS_NAME
# os.makedirs(output_dir, exist_ok=True)
# torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
# saved_params = {
# k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
# }
# torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin")) def main():
writer = SummaryWriter()
finetune_args, training_args = HfArgumentParser(
(FinetuneArguments, TrainingArguments)
).parse_args_into_dataclasses() # load dataset
dataset = datasets.load_from_disk('data/tokenized_data/'+finetune_args.tokenized_dataset)
# dataset = dataset.select(range(10000))
print(f"\n{len(dataset)=}\n") # init model
model = AutoModelForCausalLM.from_pretrained(
model_checkpoint, load_in_8bit=False, trust_remote_code=True,
device_map="auto" # 模型不同层会被自动分配到不同GPU上进行计算
# device_map={'':torch.cuda.current_device()} # 艹,这个设置有bug,一个小小的baichaun在80G的卡都能爆,换成 auto 立马就好了
)
print(model.hf_device_map) """
.gradient_checkpointing_enable()
.enable_input_require_grads()
.is_parallelizable
这三个都是 transformers 模型的函数/参数(见 transformers/modeling_utils.py 文件)
"""
model.gradient_checkpointing_enable()
# note: use gradient checkpointing to save memory at the expense of slower backward pass.
model.enable_input_require_grads()
# note: Enables the gradients for the input embeddings. This is useful for fine-tuning adapter weights while keeping the model weights fixed.
# See https://github.com/huggingface/transformers/blob/ee88ae59940fd4b2c8fc119373143d7a1175c651/src/transformers/modeling_utils.py#L1190
# model.is_parallelizable = True
# note: A flag indicating whether this model supports model parallelization.
# 设置为True之后,可能会启动模型并行化,且关闭数据并行,让一个模型分块在多块GPU上
# TODO:有点奇怪,为啥设置False了之后,依然是模型并行?
# model.model_parallel = True
model.lm_head = CastOutputToFloat(model.lm_head)
# model.config.use_cache = (
# False # silence the warnings. Please re-enable for inference!
# ) # setup peft
if finetune_args.previous_lora_weights == None:
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=finetune_args.lora_rank,
lora_alpha=32,
lora_dropout=0.1,
target_modules = ["W_pack"] # 把model打印出来,找跟attention相关的模块
) model = get_peft_model(model, peft_config)
else:
model = PeftModel.from_pretrained(model, finetune_args.previous_lora_weights)
# see: https://github.com/huggingface/peft/issues/184
for name, param in model.named_parameters():
if 'lora' in name or 'Lora' in name:
param.requires_grad = True # start train
model.save_pretrained(training_args.output_dir) # 因为adapter_config.json只能通过这个save_pretrained来生成,先这里生成一份,好在训练完之前就可以尝试中间的checkpoint
trainer = ModifiedTrainer(
model=model,
train_dataset=dataset,
args=training_args,
callbacks=[TensorBoardCallback(writer)],
data_collator=data_collator,
)
trainer.train()
writer.close()
# save model
model.save_pretrained(training_args.output_dir) if __name__ == "__main__":
main()
(2)如何load 训练好的lora并使用大模型
# 看看训练之后baichuan是否具备了Chat能力:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from transformers import TextStreamer tokenizer = AutoTokenizer.from_pretrained("../llm/Baichuan2-7B-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("../llm/Baichuan2-7B-Base", device_map="auto", trust_remote_code=True)
# load LoRA:
### sft model
# model = PeftModel.from_pretrained(model, "weights/hc3_chatgpt_zh_specific_qa_baichuan-7B-1")
### rlhf model
model = PeftModel.from_pretrained(model, "/data/intern/LLM-Tuning-master/weightsstep_200") def chat(text):
streamer = TextStreamer(tokenizer,skip_prompt=True,skip_special_tokens=True)
inputs = tokenizer("问:"+text+"答:", return_tensors='pt') # 这里添加 "问:","答:",是为了跟我构造的训练数据对应,从而更好地引导模型进行回答
inputs = inputs.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=1024,repetition_penalty=1.1, streamer=streamer)
# print(output[0]) def main():
chat("你是谁?") if __name__ == "__main__":
main()
(3)rewrad_model训练代码部分
"""
Mainly copied from https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/reward_modeling.py
Some changes:
- dataset preprocessing
- hyper-params
- Trainer: modify the save_model func, to only save the LoRA weights
"""
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import os
import evaluate
import numpy as np
import torch
import torch.nn as nn
from datasets import load_dataset,load_from_disk
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
HfArgumentParser,
PreTrainedTokenizerBase,
Trainer,
TrainerCallback,
TrainingArguments,
)
from transformers.utils import PaddingStrategy
from transformers.trainer import TRAINING_ARGS_NAME
from modeling_baichuan_for_cls import BaichuanForSequenceClassification
from sklearn.metrics import accuracy_score # Define and parse arguments.
@dataclass
class ScriptArguments:
"""
These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
""" local_rank: Optional[int] = field(default=-1, metadata={"help": "Used for multi-gpu"})
resume_from_checkpoint: Optional[bool] = field(
default=False,
metadata={"help": "If you want to resume training where it left off."},
)
deepspeed: Optional[str] = field(
default=None,
metadata={
"help": "Path to deepspeed config if using deepspeed. You may need this if the model that you want to train doesn't fit on a single GPU."
},
)
per_device_train_batch_size: Optional[int] = field(default=4)
per_device_eval_batch_size: Optional[int] = field(default=1)
gradient_accumulation_steps: Optional[int] = field(default=1)
learning_rate: Optional[float] = field(default=2e-5)
weight_decay: Optional[int] = field(default=0.001)
model_name: Optional[str] = field(
default="gpt2",
metadata={
"help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
},
)
lora_target_models: Optional[str] = field(
default=None,
metadata={
"help": "target modules for LoRA config, join the names with '|'', e.g. 'module1|module2'"
},
)
tokenizer_name: Optional[str] = field(
default=None,
metadata={
"help": "The tokenizer for your model, if left empty will use the default for your model",
},
)
bf16: Optional[bool] = field(
default=True,
metadata={
"help": "This essentially cuts the training time in half if you want to sacrifice a little precision and have a supported GPU."
},
)
num_train_epochs: Optional[int] = field(
default=1,
metadata={"help": "The number of training epochs for the reward model."},
)
eval_steps: Optional[int] = field(
default=500,
metadata={"help": "eval_steps"},
)
save_steps: Optional[int] = field(
default=500,
metadata={"help": "save_steps"},
)
train_subset: Optional[int] = field(
default=100000,
metadata={"help": "The size of the subset of the training data to use"},
)
eval_subset: Optional[int] = field(
default=50000,
metadata={"help": "The size of the subset of the eval data to use"},
)
gradient_checkpointing: Optional[bool] = field(
default=False,
metadata={"help": "Enables gradient checkpointing."},
)
optim: Optional[str] = field(
default="adamw_hf",
metadata={"help": "The optimizer to use."},
)
lr_scheduler_type: Optional[str] = field(
default="linear",
metadata={"help": "The lr scheduler"},
)
max_length: Optional[int] = field(default=512)
eval_first_step: Optional[bool] = field(
default=False,
metadata={"help": "Whether to run eval after the first step"},
) parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0] # # Load the human stack-exchange-paired dataset for tuning the reward model.
# train_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/reward", split="train")
# if script_args.train_subset > 0:
# train_dataset = train_dataset.select(range(script_args.train_subset))
# eval_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/evaluation", split="train")
# if script_args.eval_subset > 0:
# eval_dataset = eval_dataset.select(range(script_args.eval_subset))
# question_key = "question"
# good_key = "response_j"
# bad_key = "response_k"
# model_name_split = script_args.model_name.split("/")[-1]
# output_name = (
# f"{model_name_split}_peft_stack-exchange-paired_rmts__{script_args.train_subset}_{script_args.learning_rate}"
# ) """
这下面这段自行根据数据集定义:
"""
# load the reward dataset
# - `beyond/rlhf-reward-single-round`` for English
# - `beyond/rlhf-reward-single-round-trans_chinese`` for Chinese
# reward_dataset = load_from_disk('/data/intern/LLM-Tuning-master/data/rlhf-reward-single-round-trans_chinese')
# train_dataset = reward_dataset['train']
# eval_dataset = reward_dataset['test']
# train_dataset= load_dataset("json", data_files="/data/liudianqing/corpus/rlhf/beyond_trans/train.json",split="train", cache_dir=None)
# eval_dataset= load_dataset("json", data_files="/data/liudianqing/corpus/rlhf/beyond_trans/test.json", split="train",cache_dir=None) train_dataset= load_dataset("parquet", data_files="/data/intern/LLM-Tuning-master/data/rlhf-reward-single-round-trans_chinese/train-00000-of-00001-789dc5dece0f1fc1.parquet",split="train", cache_dir=None)
eval_dataset= load_dataset("parquet", data_files="/data/intern/LLM-Tuning-master/data/rlhf-reward-single-round-trans_chinese/test-00000-of-00001-8ecd46436fadcf7f.parquet", split="train",cache_dir=None) if script_args.train_subset > 0:
train_dataset = train_dataset.select(range(script_args.train_subset))
if script_args.eval_subset > 0:
eval_dataset = eval_dataset.select(range(script_args.eval_subset))
# 这个数据集中,chosen字段代表的是更好的回复,rejected代表的是更差的
question_key = "prompt"
good_key = 'chosen'
bad_key = 'rejected'
model_name_split = script_args.model_name.split("/")[-1]
output_name = (
f"../weights/{model_name_split}_beyond_reward_chinese_{script_args.train_subset}"
)
"""""" # Define the training args. Needs to be done before the model is loaded if you are using deepspeed.
training_args = TrainingArguments(
output_dir=output_name,
learning_rate=script_args.learning_rate,
per_device_train_batch_size=script_args.per_device_train_batch_size,
per_device_eval_batch_size=script_args.per_device_eval_batch_size,
num_train_epochs=script_args.num_train_epochs,
weight_decay=script_args.weight_decay,
evaluation_strategy="steps",
eval_steps=script_args.eval_steps,
save_strategy="steps",
save_steps=script_args.save_steps,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
gradient_checkpointing=script_args.gradient_checkpointing,
deepspeed=script_args.deepspeed,
local_rank=script_args.local_rank,
remove_unused_columns=False,
label_names=[],
bf16=script_args.bf16,
logging_strategy="steps",
logging_steps=10,
optim=script_args.optim,
lr_scheduler_type=script_args.lr_scheduler_type,
report_to="wandb", #"none"
save_total_limit = 5
)
# Load the value-head model and tokenizer.
tokenizer_name = script_args.tokenizer_name if script_args.tokenizer_name is not None else script_args.model_name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_auth_token=True,trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=4,
lora_alpha=32,
lora_dropout=0.1,
target_modules = None if script_args.lora_target_models is None else script_args.lora_target_models.split('|')
) # model = AutoModelForSequenceClassification.from_pretrained(
# script_args.model_name, num_labels=1, torch_dtype=torch.bfloat16,trust_remote_code=True
# )
model = BaichuanForSequenceClassification.from_pretrained(
script_args.model_name, num_labels=1, torch_dtype=torch.bfloat16,trust_remote_code=True,
device_map="auto"
)
print(model.hf_device_map) model = get_peft_model(model, peft_config)
model.print_trainable_parameters() # Need to do this for gpt2, because it doesn't have an official pad token.
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
model.config.use_cache = not script_args.gradient_checkpointing
num_proc = 24 # Can adjust to be higher if you have more processors.
original_columns = train_dataset.column_names # Turn the dataset into pairs of post + summaries, where text_j is the preferred question + answer and text_k is the other.
# Then tokenize the dataset.
def preprocess_function(examples):
new_examples = {
"input_ids_j": [],
"attention_mask_j": [],
"input_ids_k": [],
"attention_mask_k": [],
}
for question, response_j, response_k in zip(examples[question_key], examples[good_key], examples[bad_key]):
# 这里是添加了"Question: "和"\n\nAnswer: "作为模板,可以根据自己的模型进行替换。要跟SFT阶段对应
# tokenized_j = tokenizer("Question: " + question + "\n\nAnswer: " + response_j, truncation=True)
# tokenized_k = tokenizer("Question: " + question + "\n\nAnswer: " + response_k, truncation=True)
# 中文数据集:
tokenized_j = tokenizer("问:" + question + "\n\n答:" + response_j, truncation=True)
tokenized_k = tokenizer("问:" + question + "\n\n答:" + response_k, truncation=True)
new_examples["input_ids_j"].append(tokenized_j["input_ids"])
new_examples["attention_mask_j"].append(tokenized_j["attention_mask"])
new_examples["input_ids_k"].append(tokenized_k["input_ids"])
new_examples["attention_mask_k"].append(tokenized_k["attention_mask"]) return new_examples # preprocess the dataset and filter out QAs that are longer than script_args.max_length
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
train_dataset = train_dataset.filter(
lambda x: len(x["input_ids_j"]) <= script_args.max_length and len(x["input_ids_k"]) <= script_args.max_length
) eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
eval_dataset = eval_dataset.filter(
lambda x: len(x["input_ids_j"]) <= script_args.max_length and len(x["input_ids_k"]) <= script_args.max_length
) # We need to define a special data collator that batches the data in our j vs k format.
# 感觉这里主要是为了做padding,因为transformers默认的data collator可能不支持对这种格式、字段输入
@dataclass
class RewardDataCollatorWithPadding:
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt" def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
features_j = []
features_k = []
for feature in features:
features_j.append(
{
"input_ids": feature["input_ids_j"],
"attention_mask": feature["attention_mask_j"],
}
)
features_k.append(
{
"input_ids": feature["input_ids_k"],
"attention_mask": feature["attention_mask_k"],
}
)
batch_j = self.tokenizer.pad(
features_j,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch_k = self.tokenizer.pad(
features_k,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch = {
"input_ids_j": batch_j["input_ids"],
"attention_mask_j": batch_j["attention_mask"],
"input_ids_k": batch_k["input_ids"],
"attention_mask_k": batch_k["attention_mask"],
"return_loss": True,
}
return batch # Define the metric that we'll use for validation.
# accuracy = evaluate.load("accuracy") def compute_metrics(eval_pred):
predictions, _ = eval_pred
# Here, predictions is rewards_j and rewards_k.
# We want to see how much of the time rewards_j > rewards_k.
# 是这么计算的:
# 通过 argmax,得到最大值的 index,当 rewards_j 最大时,返回 0,rewards_k 最大时,返回 1
# 正确标签应该是全部为 0(index都在 0 这里) # Q: model的输出不是一个score吗,为什么这里可以使用argmax?
# A: 下面的 compute_loss 中定义了新的model forward 方法,即会接受两个输入产生两个输出
# Trainer 中会把这种两个输出拼起来,从而得到一个在axis=0维度上有两项的形式,因此argmax就是看哪一项更大
# 具体可以参考 Trainer 中对 涉及到 compute_loss/logits/training_step/prediction_step 的部分,以及 _gather_and_numpify 方法
predictions = np.argmax(predictions, axis=0)
labels = np.zeros(predictions.shape)
# return accuracy.compute(predictions=predictions, references=labels)
return{
"accuracy": float(
accuracy_score(labels, predictions, normalize=True, sample_weight=None)
)
} class RewardTrainer(Trainer):
# Define how to compute the reward loss. We use the InstructGPT pairwise logloss: https://arxiv.org/abs/2203.02155
def compute_loss(self, model, inputs, return_outputs=False):
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
if return_outputs:
return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
return loss def save_model(self, output_dir=None, _internal_call=False):
# 因为交给Trainer的model实际上是PeftModel类型,所以这里的 save_pretrained 会直接使用PeftModel的保存方法
# 从而只保存 LoRA weights
self.model.save_pretrained(output_dir)
# os.makedirs(output_dir, exist_ok=True)
# torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
# saved_params = {
# k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
# }
# torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin")) # Train the model, woohoo.
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=RewardDataCollatorWithPadding(tokenizer=tokenizer, max_length=script_args.max_length),
) if script_args.eval_first_step: class EvaluateFirstStepCallback(TrainerCallback):
def on_step_end(self, args, state, control, **kwargs):
if state.global_step == 1:
control.should_evaluate = True trainer.add_callback(EvaluateFirstStepCallback()) trainer.train(script_args.resume_from_checkpoint) print("Saving last checkpoint of the model")
model.save_pretrained(output_name + "_peft_last_checkpoint")
(4)rlhf的过程,以及代码实现
"""
Mainly copied from https://github.com/lvwerra/trl/blob/main/examples/stack_llama/scripts/rl_training.py
Some changes: """
from dataclasses import dataclass, field
from typing import Optional import torch
from accelerate import Accelerator
from datasets import load_dataset,load_from_disk
from peft import LoraConfig,PeftModel, PeftConfig
from tqdm import tqdm
from transformers import Adafactor, AutoTokenizer, HfArgumentParser, pipeline, AutoModelForCausalLM from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, set_seed, PreTrainedModelWrapper
from trl.core import LengthSampler tqdm.pandas() @dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine with PPO
""" # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
# models like gpt-neo* models are more suitable.
# model_name: Optional[str] = field(default="", metadata={"help": "the model name"})
base_model_name: Optional[str] = field(default="", metadata={"help": "the base model name/path"})
merged_sft_model_path: Optional[str] = field(default="", metadata={"help": "merged_sft_model_path"})
# tokenizer_name: Optional[str] = field(default="", metadata={"help": "the tokenizer name"})
sft_model_lora_path: Optional[str] = field(default="", metadata={"help": "the SFT model LoRA path"})
reward_model_lora_path: Optional[str] = field(default="", metadata={"help": "the Reward model LoRA path"})
# reward_model_name: Optional[str] = field(default="", metadata={"help": "the reward model name"})
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
output_max_length: Optional[int] = field(default=128, metadata={"help": "maximum length for generation"})
mini_batch_size: Optional[int] = field(default=1, metadata={"help": "the PPO minibatch size"})
batch_size: Optional[int] = field(default=32, metadata={"help": "the batch size"})
ppo_epochs: Optional[int] = field(default=4, metadata={"help": "the number of ppo epochs"})
gradient_accumulation_steps: Optional[int] = field(
default=4, metadata={"help": "the number of gradient accumulation steps"}
)
adafactor: Optional[bool] = field(default=False, metadata={"help": "whether to use the adafactor optimizer"})
early_stopping: Optional[bool] = field(default=False, metadata={"help": "whether to early stop"})
target_kl: Optional[float] = field(default=0.1, metadata={"help": "kl target for early stopping"})
reward_baseline: Optional[float] = field(
default=0.0,
metadata={"help": "a baseline value that is subtracted from the reward"},
)
batched_gen: Optional[bool] = field(default=False, metadata={"help": "whether to use the batched text gen"})
save_freq: Optional[int] = field(default=None, metadata={"help": "n steps to save the model"})
output_dir: Optional[str] = field(default="runs/", metadata={"help": "n steps to save the model"})
seed: Optional[int] = field(default=0, metadata={"help": "the seed"})
steps: Optional[int] = field(default=20000, metadata={"help": "number of epochs"})
init_kl_coef: Optional[float] = field(
default=0.5,
metadata={"help": "Initial KL penalty coefficient (used for adaptive and linear control)"},
) adap_kl_ctrl: Optional[bool] = field(default=False, metadata={"help": "Use adaptive KL control, otherwise linear"}) parser = HfArgumentParser(ScriptArguments)
script_args: ScriptArguments = parser.parse_args_into_dataclasses()[0]
# reward_model_name = script_args.reward_model_name # train_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/rl", split="train")
# train_dataset = load_from_disk('../data/rlhf-reward-single-round-trans_chinese', split='train')
# train_dataset = train_dataset.select(range(100000)) tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name, trust_remote_code=True)
# GPT-2 tokenizer has a pad token, but it is not eos_token by default. We need to set it to eos_token.
# only for this model. # tokenizer.pad_token = tokenizer.eos_token
if getattr(tokenizer, "pad_token", None) is None:
tokenizer.pad_token = tokenizer.eos_token # training dataset
# /data/intern/LLM-Tuning-master/data/rlhf-reward-single-round-trans_chinese
# dataset = load_from_disk('../data/rlhf-reward-single-round-trans_chinese')
# dataset = dataset['train'] dataset= load_dataset("parquet", data_files="/data/intern/LLM-Tuning-master/data/rlhf-reward-single-round-trans_chinese/train-00000-of-00001-789dc5dece0f1fc1.parquet",split="train", cache_dir=None)
original_columns = dataset.column_names
num_proc = 24 def preprocess_function(examples):
new_examples = {
"query": [],
"input_ids": [],
}
# for question in examples["question"]:
# query = "Question: " + question + "\n\nAnswer: "
# tokenized_question = tokenizer(query, truncation=True)
# new_examples["query"].append(query)
# new_examples["input_ids"].append(tokenized_question["input_ids"]) # rlhf-reward-single-round-trans_chinese:
for question in examples["prompt"]:
query = "问:" + question + "\n\n答:"
tokenized_question = tokenizer(query, truncation=True)
new_examples["query"].append(query)
new_examples["input_ids"].append(tokenized_question["input_ids"])
return new_examples dataset = dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
dataset = dataset.filter(lambda x: len(x["input_ids"]) < 512, batched=False)
dataset.set_format(type="torch") def collator(data):
return dict((key, [d[key] for d in data]) for key in data[0]) config = PPOConfig(
steps=script_args.steps,
model_name=script_args.merged_sft_model_path, # 没啥用,不会加载对应模型
learning_rate=script_args.learning_rate,
log_with=script_args.log_with,
batch_size=script_args.batch_size,
mini_batch_size=script_args.mini_batch_size,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
optimize_cuda_cache=True,
early_stopping=script_args.early_stopping,
target_kl=script_args.target_kl,
ppo_epochs=script_args.ppo_epochs,
seed=script_args.seed,
init_kl_coef=script_args.init_kl_coef,
adap_kl_ctrl=script_args.adap_kl_ctrl,
) # set seed before initializing value head for deterministic eval
set_seed(config.seed) # Now let's build the model, the reference model, and the tokenizer.
current_device = Accelerator().local_process_index
print('Loading base model for ppo training...') """
下面是原版 StackLLaMA 的实现,是在merge了STF LoRA的模型的基础上,再新增一个LoRA,挺费劲的。
lora_config = LoraConfig(
r=8,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['W_pack']
)
print('Loading base model for ppo training...')
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
load_in_8bit=False,
device_map="auto",
# device_map={"": current_device},
peft_config=lora_config,
trust_remote_code=True
)
"""
# 下面改成不需要merge的方式,直接在SFT LoRA的基础上继续训练: # load the base model
base_model_for_PPO = AutoModelForCausalLM.from_pretrained(
script_args.base_model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='auto'
)
# install the lora modules
base_model_for_PPO_with_sft_lora = PeftModel.from_pretrained(
base_model_for_PPO,
script_args.sft_model_lora_path
)
# wrap with the AutoModelForCausalLMWithValueHead wrapper
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(
base_model_for_PPO_with_sft_lora
)
# make the lora modules trainable
for name, param in ppo_model.named_parameters():
if 'lora' in name:
param.requires_grad = True optimizer = None
if script_args.adafactor:
optimizer = Adafactor(
filter(lambda p: p.requires_grad, model.parameters()),
scale_parameter=False,
relative_step=False,
warmup_init=False,
lr=config.learning_rate,
)
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
script_args.merged_sft_model_path,
trust_remote_code=True
) ppo_trainer = PPOTrainer(
config,
ppo_model, # model with value head
ref_model=ref_model,
tokenizer=tokenizer,
dataset=dataset,
data_collator=collator,
optimizer=optimizer,
) """
# 下面这段代码是将reward model直接merge到原模型中,然后通过pipeline来加载。
# 但我希望 reward model依然以 LoRA 的形式存在,因此这里不使用这样的方式
# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a ` pipeline` bug
sentiment_pipe = pipeline(
"sentiment-analysis",
model=reward_model_name,
device_map={"": current_device},
model_kwargs={"load_in_8bit": True},
tokenizer=tokenizer,
return_token_type_ids=False,
)
"""
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a ` pipeline` bug from modeling_baichuan_for_cls import BaichuanForSequenceClassification
from peft import PeftModel
print('Loading base model for reward model...')
base_model_for_RM = BaichuanForSequenceClassification.from_pretrained(
script_args.base_model_name, num_labels=1,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
# device_map={"": current_device},
)
reward_model = PeftModel.from_pretrained(base_model_for_RM, script_args.reward_model_lora_path)
# 然后需要一个得到 reward value 的函数
def get_reward_value(texts):
output = reward_model(**tokenizer(texts, return_tensors='pt', padding=True, truncation=True))
scores = torch.sigmoid(output.logits).view(-1).tolist()
return scores # We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
# "min_length": -1,
# "top_k": 0.0,
# "top_p": 0.95,
"repetition_penalty": 1.1,
# "do_sample": True,
"do_sample": False,
"begin_suppress_tokens": [tokenizer.eos_token_id],
# "remove_invalid_values": True,
# "pad_token_id": tokenizer.pad_token_id,
# "eos_token_id": tokenizer.eos_token_id,
"max_new_tokens": 512
# "eos_token_id": 100_000, # why?
}
output_min_length = 32
output_max_length = script_args.output_max_length
output_length_sampler = LengthSampler(output_min_length, output_max_length) # We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {
"return_all_scores": True,
"function_to_apply": "none",
"batch_size": 16,
"truncation": True,
} for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
if epoch >= config.total_ppo_epochs:
break question_tensors = batch["input_ids"] try:
"""
generate这一步经常会报一个奇奇怪怪的bug:
RuntimeError: probability tensor contains either inf, nan or element < 0
主要是在model.generate的时候设置了 do_sample=True 就容易报错,但是报错具有随机性,可能在不同的iteration报
关闭 do_sample=True 就不会报错。
可能有用的issue:
https://github.com/huggingface/transformers/issues/15169
https://github.com/huggingface/transformers/issues/23413
https://github.com/huggingface/transformers/issues/22914 目前可能的解决办法:
1. 不使用随机采用: do_sample=False,这个基本不会报错,但是感觉影响PPO的性能
2. do_sample=True 的同时,设置 remove_invalid_values=True 参数...还是会报错...奇了怪,而且是报错之后,模型似乎就崩了,一直输出inf,nan了 update:
发现似乎是由于模型在迭代之后,开始输出空值,而reward却很大,导致模型越学越坏,直接崩了,后面全输出空 - 百川-base之前推荐的是设置repetition_penalty=1.1,前面没有设置,导致输出很容易重复,而这种输出居然也可以得高分,
因此这里改成一样的配置,目前观察下来有了一些缓解,但后面还是会越学越坏; 继续观察,发现当某一次回复为空得到很高的reward之后(得来0.8 的高分,其他的都是0.6的水平),下一次生成的时候就挂了; - 尝试降低learning rate,从 1.4e-5 降低到 1e-5。这个似乎有些效果,可以延缓模型崩溃,但渐渐地回复会越来越短,最终输出空值,属于慢性死亡了。。。 - 尝试提高 init_kl_coef,从0.2到0.5,也不管用; - 继续尝试设置 begin_suppress_tokens 参数,禁止在开头的时候生成 eos token... !!这是目前最有效的办法了 模型基本不崩了。 其实可以发现,主要是reward model太差了,导致对某些不好的输出类型产生了高reward,然后模型就越学越差然后崩了。所以可能问题关键就在于reward model的质量吧。 """
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
# length_sampler=output_length_sampler, # 这个参数,跟 generation_kwargs 中的 max_new_tokens 只用设置一个
**generation_kwargs,
)
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True) # Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])] """下面两行是使用pipeline来做,但我这里不采用这种方式
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]
"""
scores = get_reward_value(texts)
rewards = [torch.tensor(score - script_args.reward_baseline) for score in scores]
for q, r, s in zip(batch["query"], batch["response"], scores):
print(epoch,'query:',q)
print('response:',r)
print('score:',s) # Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards) if script_args.save_freq and epoch and epoch % script_args.save_freq == 0:
ppo_trainer.save_pretrained(script_args.output_dir + f"step_{epoch}") except Exception as e:
print('---------------------')
print(e)
print(epoch)
print(question_tensors)
print('---------------------')
break
基于LoRA的RLHF的更多相关文章
- LORA---关于LORA的30个常见问题解答
1) 什么是LoRa调制? LoRa (Long Range,远距离)是一种调制技术,与同类技术相比,提供更长的通信距离.调制是基于扩频技术,线性调制扩频(CSS)的一个变种,具有前向纠错(FEC). ...
- NB-IOT/LoRa/Zigbee无线组网方案对比
物联网设备节点组网存在2种组网方式, 无线组网和有线组网. 无线组网我们常见到的有Zigbee,LoRa, NB-IOT等,其中Lora/NB-IOT属于LPWAN技术,LPWAN技术有覆盖广.连接多 ...
- (0)Lora及LoraWAN
Lora和LoraWAN的区别 LoRa经常被误用来描述整个LPWAN通信系统,其实Lora是Semtech拥有的专有调制格式. SX1272和SX1276 LoRa芯片使用称为chirp扩频(CSS ...
- LoRa基础知识
摘自:LoRaWAN介绍 - LoRa从业者读这篇就够了 https://blog.csdn.net/iotisan/article/details/69939241 LoRa网络结构 ...
- LoRa技术的特点和组成系统分析
目前,基于LoRa技术的网络层协议主要是LoRaWAN,也有少量的非LoRaWAN协议,但是通信系统网络都是星状网架构,以及在此基础上的简化和改进.主要包括以下3种. (1)点对点通信. 一点对一点通 ...
- LoRaWAN和LoRa的区别在那里?
有很多人都分不清楚LoRaWAN和LoRa到底有什么区别,甚至有人认为它们是一样的,但其实这两个不一样的. LoRa是一个物理层的协议,而LoRaWAN则指的是MAC层的组网协议.虽然现有的LoRaW ...
- LoRa无线传输技术与LoRaWAN无线模块的区别
有不少人分不清LoRaWAN无线模块与LoRa网关无线传输技术到底有什么区别,他们在物联网领域的应用到底是什么样的. LoRaWAN指的是MAC层的组网协议,而LoRa是一个物理层的协议.虽然现有的L ...
- 物联网lora模块应用案例和LoRawan网关通信技术
什么是LoRa LoRa(Long Range) 无线通信技术是 Semtech 在2012年开发出来的一款适合物联网使用的射频IC.其设计理念为低功耗.长距离.低成本.网路简单.易于扩展的无线数传技 ...
- 大功率超远距离lora无线数传电台,多级中继功能
一.在无线通信领域,LoRa是目前市场最被看好的技术之一.随着新一代LoRa调制技术的升级,市场对LoRa技术的认知.认可逐步提高,基于LoRa调制技术开发的产品得到更广泛的应用.受益于其超低的接收灵 ...
- 物联网lora无线数传模块应用案例:LoRawan网关通信技术
什么是LoRa LoRa(Long Range) 无线通信技术是 Semtech 在2012年开发出来的一款适合物联网使用的射频IC.其设计理念为低功耗.长距离.低成本.网路简单.易于扩展的无线数传技 ...
随机推荐
- clickhouse数据操常见执行语句
1.清空本地表数据 truncate table 数据库名.表名 :) select * from test_local; SELECT * FROM test_local Query id: ab1 ...
- Pinpoint对k8s关键业务模块进行全链路监控(17)
一.全链路监控概述 1.1 什么是全链路监控 在分布式微服务架构中,系统为了接收并处理一个前端用户请求,需要让多个微服务应用协同工作,其中 的每一个微服务应用都可以用不同的编程语言构建,由不同的团队开 ...
- 自动生成robot自动化测试用例
背景:java项目使用swagger管理接口,随着需求的开发接口也有增加,要从swagger界面中去查找出新增的接口是件很费时,效率很低的事情. 适用情况: java项目且适用swagger管理接口 ...
- 一篇教程搞定Windows系统中的Docker应用安装
目录 1. 引言 2. "Docker -> WSL -> Windows"的依赖逻辑 3. 安装方法 3.1 安装WSL 3.2 安装Docker Desktop 4 ...
- 06. rails gem 安装mysql
修改Gamefile Gamefile 里添加 gem 'mysql2' 执行命令行 bundle 可以看到下图片上已经安装好依赖了 修改配置文件 修改config/database.yml文件 # ...
- Linux下的目录
FHS 因为利用Linux来开发产品或distributions的社群/公司与个人实在太多了, 如果每个人都用自己的想法来配置文件放置的目录,那么将可能造成很多管理上的困扰. 你能想象,你进入一个企业 ...
- 了解Oracle中的Dual系统表
首发微信公众号:SQL数据库运维 原文链接:https://mp.weixin.qq.com/s?__biz=MzI1NTQyNzg3MQ==&mid=2247485212&idx=1 ...
- MQTT的使用一
MQTT:物联网消息传递标准 简介 MQTT是用于物联网(IoT)的OASIS标准消息传递协议.它被设计为一种非常轻量级的发布/订阅消息传送,非常适合以较小的代码占用量和最小的网络带宽连接远程设备.如 ...
- python教程6.3-time模块datetime模块
由于time是基于Unix Timestamp,所以其所能表述的日期范围被限定在 1970 – 2038 之间.因此2038年后就不能用time了,建议使用datetime. time模块 有下面几 ...
- python教程6.2-OS模块random模块
OS模块 random模块