[实战LLM]本草-华驼-用中文医学知识来微调LLaMa模型

迷途小书僮

加班摸鱼

背景

https:// arxiv.org/pdf/2304.0697 5.pdf

GitHub - SCIR-HI/Huatuo-Llama-Med-Chinese: Repo for BenTsao [original name: HuaTuo (华驼)], Llama-7B tuned with Chinese medical knowledge. 本草（原名：华驼）模型仓库，基于中文医学知识的LLaMA模型指令微调

是哈工大的，一篇使用中文医学知识来微调英文LLaMa的应用型，实验型的文章。

目的主要是想看看：

1. 其内部细节，以及如何对于LLaMa引入其他领域的知识信息，来微调。

2. 英文的LLaMa如何快速支持其他语言（例如中文）的信息注入。

3. 结构化，以及非结构化的知识，如何被引入到LLaMa大模型中。

华佗骑骆驼；骆驼驮华佗。

要是，华佗骑羊驼，羊驼驮华佗，那画面更美了。。。

好好感谢一波作者们，因为他们开放了基于lora的finetune，以及inference的代码了。

迷途小书僮：[速读经典]LoRA-给大语言模型做Low-Rank低秩改造

今天主要是以看代码为主了。

我自己folk了一个repo：

GitHub - Xianchao-Wu/Huatuo-Llama-Med-Chinese: Repo for BenTsao [original name: HuaTuo (华驼)], Llama-7B tuned with Chinese medical knowledge. 本草（原名：华驼）模型仓库，基于中文医学知识的LLaMA模型指令微调

准备

个人用的是nvidia nemo的一个docker，

https:// github.com/NVIDIA/NeMo/ blob/main/Dockerfile

然后，然后就被bitsandbytes这个包，给卡住了。。。原因是，我有两个cuda:

cuda-11和cuda-12

root@a7034605291e:/usr/local# ls -l
total 144
drwxr-xr-x 2 root root  4096 Feb  8 07:39 __pycache__
drwxr-xr-x 1 root root  4096 Apr 27 11:14 bin
lrwxrwxrwx 1 root root    22 Feb  2 05:14 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root    25 Feb  2 05:16 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 4 root root  4096 Feb  2 05:17 cuda-11.8.2
lrwxrwxrwx 1 root root    25 Feb  2 05:14 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root  4096 Feb  3 22:11 cuda-12.0

结果，bitsandbytes总是自动找到cuda-11，然后尝试读取里面的

libcusparse.so .11 文件，然后就报错了，说这个文件找不到。

查了一下，的确只有：

./cuda-12.0/targets/x86_64-linux/lib/libcusparse.so.12

这一个libcusparse.so.12。

网上搜索了半天，都不行。

结果，自己用了个土办法，

把cuda-11.8 这个文件夹，重命名为了cuda-11.8.2，然后bitsandbytes就无法自动查到cuda11.8了。整个世界清净了。

测试一下，ok的：

root@a7034605291e:/usr/local# python -m bitsandbytes
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: 
WARNING: The following directories listed in your path were found to be non-existent: 
{PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/cuda-11/lib64')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 120
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/lib/python3.8/dist-packages/bitsandbytes-0.38.1-py3.8.egg/bitsandbytes/libbitsandbytes_cuda120.so
/usr/local/lib/python3.8/dist-packages/bitsandbytes-0.38.1-py3.8.egg/bitsandbytes/libbitsandbytes_cuda118.so
/usr/local/cuda-12.0/compat/lib.real/libcuda.so
++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
 /usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib CUDA PATHS
++++++++++ /usr/local/cuda/compat/lib CUDA PATHS +++++++++++
+ /usr/local/cuda-12.0/targets/x86_64-linux/lib CUDA PATHS +
/usr/local/cuda-12.0/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so
 /usr/local/lib/python3.8/dist-packages/torch/lib CUDA PATHS
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_linalg.so
/usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.0', '8.0', '8.0', '8.0', '8.0', '8.0', '8.0', '8.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Running a quick check that:
    + library is importable
    + CUDA function is callable
WARNING: Please be sure to sanitize sensible info from any such env vars!
SUCCESS!
Installation was successful!

finetune by LoRA

先看基于LoRA的微调吧，这个说实话，看了之后，非常兴奋！因为毕竟是第一次看到基于LoRA的微调代码，对于其：

网络结构；【在W_q, W_k, W_v的三个线性层上，增加了两个新的AB矩阵，后续给细节。问题：新增加的和LoRA相关的参数的规模？ ->其实是只改造了W_q, W_v。。。 】
tokenizer及词表（因为用的是中文微调数据，来搞原本英文的LLaMa模型的checkpoint；问题：是否词表根据中文的tokens有所更新？ -> 没有 ）；
损失函数【交叉熵损失函数。问题，到底是如何计算这个交叉熵损失函数的呢？ -> 正常按照transformer decoder训练的时候，那样算 】。

的细节，还是比较模糊的。

那，就开启一段神奇的旅程吧~~

微调入口

root@a7034605291e:/workspace/asr/Huatuo-Llama-Med-Chinese# bash scripts/finetune.sh
/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'ipdb.__main__' found in sys.modules after import of package 'ipdb', 
but prior to execution of 'ipdb.__main__'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
> /workspace/asr/Huatuo-Llama-Med-Chinese/finetune.py(1)<module>()
----> 1 import os
      2 import sys
      3 from typing import List

这里只关注最核心的代码逻辑。

我是用了一个A100-80GB的卡：

ipdb> c
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: 
The following directories listed in your path were found to be non-existent: 
{PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/cuda-11/lib64')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 120
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so...
> /workspace/asr/Huatuo-Llama-Med-Chinese/finetune.py(61)train()
     60     import ipdb; ipdb.set_trace()
---> 61     if int(os.environ.get("LOCAL_RANK", 0)) == 0:
     62         print(

导入33个分片模型：

ipdb> c
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./data/llama_data.json
output_dir: ./lora-llama-med-e1
batch_size: 128
micro_batch_size: 128
num_epochs: 10
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 500 # 验证集合中 样本 数
lora_r: 8 # rank of LoRA, e.g., (4096 -> 8 and then 8 -> 4096)
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: False
group_by_length: False
wandb_project: llama_med
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: med_template
Loading checkpoint shards: 100%|████████████████████████████████
███████████████| 33/33 [00:09<00:00,  3.63it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is
 called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
/usr/local/lib/python3.8/dist-packages/peft/utils/other.py:76: 
FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. 
Use prepare_model_for_kbit_training instead.
  warnings.warn(
> /workspace/asr/Huatuo-Llama-Med-Chinese/finetune.py(182)train()
    181     import ipdb; ipdb.set_trace()
--> 182     model = get_peft_model(model, config) # NOTE TODO

peft前的model

ipdb> model
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=31999)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear8bitLt(in_features=11008, out_features=4096, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
          (act_fn): SiLUActivation()
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      ... ...
      (31): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear8bitLt(in_features=11008, out_features=4096, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
          (act_fn): SiLUActivation()
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
    (norm): LlamaRMSNorm()
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

一共是32层transformer "decoder"（说是decoder吧，里面又没有cross-attention。。。更像是encoder layer和decoder layer的复合体：有causal mask的self attention，没有cross-attention）。

这个的细节，之前已经细细的撸过一遍了：

迷途小书僮：[代码学习]也尝试一下LLaMa-7B

ipdb> sum([p.numel() for p in model.parameters()])
6738415616
ipdb> sum([p.numel() for p in model.parameters() if p.requires_grad])

可以看到，6,738,415,616=(67亿)是参数规模。而其中可训练的参数的数量为0.

get_peft_model

> /usr/local/lib/python3.8/dist-packages/peft/mapping.py(104)get_peft_model()

这个方法好，是改造model的。加入LoRA的两种低秩(low rank)矩阵：

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, 
base_model_name_or_path=None, task_type='CAUSAL_LM', 
inference_mode=False, r=8, 
target_modules=['q_proj', 'v_proj'], lora_alpha=16, 
lora_dropout=0.05, fan_in_fan_out=False, bias='none', modules_to_save=None, 
init_lora_weights=True)

注意，是只对q_proj和v_proj改造~~ 这个还可以指定，真是不错。

进入这个方法之后，model_config被扩充了：

ipdb> model_config
{'vocab_size': 32000, 'max_position_embeddings': 2048, 'hidden_size': 4096, 
'intermediate_size': 11008, 'num_hidden_layers': 32, 'num_attention_heads': 32, 
'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-06, 
'use_cache': True, 'return_dict': True, 'output_hidden_states': False, 
'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float16', 
'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 
'tie_word_embeddings': False, 'is_encoder_decoder': False, 'is_decoder': False, 
'cross_attention_hidden_size': None, 'add_cross_attention': False, 
'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 
'do_sample': False, 'early_stopping': False, 'num_beams': 1, 
'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 
'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 
'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 
'bad_words_ids': None, 'num_return_sequences': 1, 'chunk_size_feed_forward': 0, 
'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 
'forced_eos_token_id': None, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 
'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['LLaMAForCausalLM'], 
'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1'}, 
'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'tokenizer_class': None, 'prefix': None, 
'bos_token_id': 0, 'pad_token_id': -1, 'eos_token_id': 1, 'sep_token_id': None, 
'decoder_start_token_id': None, 'task_specific_params': None, 'problem_type': None, 
'_name_or_path': 'decapoda-research/llama-7b-hf', 'transformers_version': '4.28.1', 
'max_sequence_length': 2048, 'model_type': 'llama', 'quantization_config': {'load_in_8bit': True, 
'llm_int8_threshold': 6.0, 'llm_int8_skip_modules': None, 'llm_int8_enable_fp32_cpu_offload': False}}

去调用了：

--> 120 return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config)

ipdb> s
--Call--
> /usr/local/lib/python3.8/dist-packages/peft/peft_model.py(669)__init__()
--> 669     def __init__(self, model, peft_config: PeftConfig, adapter_name="default"):
    670         super().__init__(model, peft_config, adapter_name)

目前所在的位置，核心是add_adapter (其实走的是if，不是else... )

格式比较好的config:

> /usr/local/lib/python3.8/dist-packages/peft/peft_model.py(92)__init__()
     91         self.config = self.base_model.config
---> 92         self.modules_to_save = None
     93         self.peft_config = {}
ipdb> self.config
LlamaConfig {
  "_name_or_path": "decapoda-research/llama-7b-hf",
  "architectures": [
    "LLaMAForCausalLM"
  "bos_token_id": 0,
  "eos_token_id": 1,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "max_sequence_length": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": -1,
  "quantization_config": {
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_8bit": true
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.28.1",
  "use_cache": true,
  "vocab_size": 32000

然后，进入lora.py，开启对于LoRA的初始化：

复习一下，lora接收到的config:

{'default': LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, 
base_model_name_or_path='decapoda-research/llama-7b-hf', 
task_type='CAUSAL_LM', inference_mode=False, r=8, 
target_modules=['q_proj', 'v_proj'], lora_alpha=16, lora_dropout=0.05, 
fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True)}

指定了LORA, 以及causal_lm了。

add_adapter

> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(157)add_adapter()

--> 162         self._find_and_replace(adapter_name)
    174         loaded_in_8bit = getattr(self.model, "is_loaded_in_8bit", False)
kwargs = {'r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'fan_in_fan_out': False, 'init_lora_weights': True}

如下：

188 key_list = [key for key, _ in self.model.named_modules()]

是找出所有的模块，一共454个。例如：

... 'model.layers.31.self_attn.o_proj', 'model.layers.31.self_attn.rotary_emb', 
'model.layers.31.mlp', 'model.layers.31.mlp.gate_proj', 'model.layers.31.mlp.down_proj', 
'model.layers.31.mlp.up_proj', 'model.layers.31.mlp.act_fn', 'model.layers.31.input_layernorm', 
'model.layers.31.post_attention_layernorm', 'model.norm', 'lm_head']

现在是找到了一个线性层：

> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(195)_find_and_replace()
    194             if target_module_found:
1-> 195                 if not is_target_modules_in_base_model:
    196                     is_target_modules_in_base_model = True
ipdb> key
'model.layers.0.self_attn.q_proj'

--> 197 parent, target, target_name = _get_submodules(self.model, key)

这是获取，parent, target，以及target_name

ipdb> parent
LlamaAttention(
  (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
  (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
  (v_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
  (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
  (rotary_emb): LlamaRotaryEmbedding()
ipdb> target
Linear8bitLt(in_features=4096, out_features=4096, bias=False)
ipdb> target_name
'q_proj'

然后，就可以构造一个8-bit的线性层了：

Linear8bitLt

ipdb> eightbit_kwargs
{'r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'fan_in_fan_out': False, 'init_lora_weights': True, 
'has_fp16_weights': False, 'memory_efficient_backward': False, 'threshold': 6.0, 'index': None}
ipdb> l
    215                                 "memory_efficient_backward": target.state.memory_efficient_backward,
    216                                 "threshold": target.state.threshold,
    217                                 "index": target.index,
    218                             }
    219                         )
2-> 220                         new_module = Linear8bitLt(
    221                             adapter_name, target.in_features, target.out_features, bias=bias, **eightbit_kwargs
    222                         )

初始化：

说实话，单看这个初始化方法，真是看不到哪些地方，初始化了哪些东西。那就看看forward吧：

继续看初始化的部分：

> /usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py(243)__init__()
    242 class Linear8bitLt(nn.Linear):
--> 243     def __init__(self, input_features, output_features, bias=True, has_fp16_weights=True,
    244                        memory_efficient_backward=False, threshold=0.0, index=None):

--> 256 self.weight = Int8Params( self.weight.data , has_fp16_weights=has_fp16_weights, requires_grad=has_fp16_weights)

得到的是：

ipdb> self.weight
Parameter containing:
Parameter(Int8Params([[ 1.2290e-02,  1.1910e-02, -2.3133e-03,  ...,  1.3266e-02,
              1.0815e-02,  2.1360e-03],
            [-7.0645e-03,  1.1203e-02, -1.2051e-02,  ..., -1.2900e-02,
              1.5057e-02, -4.0810e-03],
            [ 9.3443e-03,  5.9047e-04, -1.3865e-02,  ..., -9.7122e-03,
             -1.2443e-02, -1.3335e-02],
            ...,
            [ 1.9260e-05,  2.8005e-03,  5.7622e-03,  ...,  1.1858e-02,
              1.2653e-02,  6.4236e-03],
            [-1.3693e-02, -8.0863e-03, -2.2774e-03,  ...,  1.2377e-02,
             -4.1139e-03, -3.0408e-03],
            [-5.2547e-03,  1.3315e-02, -1.3418e-02,  ..., -1.1736e-02,
              9.4526e-03, -5.1278e-03]]))
ipdb> self.weight.shape
torch.Size([4096, 4096])
ipdb> type(self.weight)
<class 'bitsandbytes.nn.modules.Int8Params'>

如此，就是一个线性层了。int8的。

继续看:

> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(701)__init__()
    700             )
--> 701             LoraLayer.__init__(self, in_features=in_features, out_features=out_features)
ipdb> s
--Call--
> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(438)__init__()
    437 class LoraLayer:
--> 438     def __init__(
    439         self,

这里，定义了：lora_A和lora_B：

> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(447)__init__()
    446         self.lora_dropout = nn.ModuleDict({})
--> 447         self.lora_A = nn.ModuleDict({})
    448         self.lora_B = nn.ModuleDict({})

class LoraLayer

整体看看这个的代码：

class LoraLayer:
    def __init__(
        self,
        in_features: int,
        out_features: int,
        self.r = {}
        self.lora_alpha = {}
        self.scaling = {}
        self.lora_dropout = nn.ModuleDict({})
        self.lora_A = nn.ModuleDict({})
        self.lora_B = nn.ModuleDict({})
        # For Embedding layer
        self.lora_embedding_A = nn.ParameterDict({})
        self.lora_embedding_B = nn.ParameterDict({})
        # Mark the weight as unmerged
        self.merged = False
        self.disable_adapters = False
        self.in_features = in_features
        self.out_features = out_features
    def update_layer(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()
        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        if r > 0:
            self.lora_A.update(nn.ModuleDict({adapter_name: nn.Linear(self.in_features, r, bias=False)}))
            self.lora_B.update(nn.ModuleDict({adapter_name: nn.Linear(r, self.out_features, bias=False)}))
            self.scaling[adapter_name] = lora_alpha / r
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)
        self.to(self.weight.device)
    def update_layer_embedding(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()
        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        if r > 0:
            self.lora_embedding_A.update(
                nn.ParameterDict({adapter_name: nn.Parameter(self.weight.new_zeros((r, self.in_features)))})
            self.lora_embedding_B.update(
                nn.ParameterDict({adapter_name: nn.Parameter(self.weight.new_zeros((self.out_features, r)))})
            self.scaling[adapter_name] = lora_alpha / r
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)
        self.to(self.weight.device)
    def reset_lora_parameters(self, adapter_name):
        if adapter_name in self.lora_A.keys():
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A[adapter_name].weight, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B[adapter_name].weight)
        if adapter_name in self.lora_embedding_A.keys():
            # initialize a the same way as the default for nn.linear and b to zero
            nn.init.zeros_(self.lora_embedding_A[adapter_name])
            nn.init.normal_(self.lora_embedding_B[adapter_name])

主要是lora_A和lora_B这两个线性层了。

调用update_layer之后：

> /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py(706)__init__()
    705             init_lora_weights = kwargs.pop("init_lora_weights", True)
--> 706             self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
    707             self.active_adapter = adapter_name

得到一个完整的lora layer:

lora layer:

ipdb> self
Linear8bitLt(
  in_features=4096, out_features=4096, bias=False
  (lora_dropout): ModuleDict(
    (default): Dropout(p=0.05, inplace=False)
  (lora_A): ModuleDict(
    (default): Linear(in_features=4096, out_features=8, bias=False)
  (lora_B): ModuleDict(
    (default): Linear(in_features=8, out_features=4096, bias=False)
  (lora_embedding_A): ParameterDict()
  (lora_embedding_B): ParameterDict()

这样，就对齐到了：

左边的蓝色：

                in_features=4096, out_features=4096, bias=False# 注意，这是原来的

右边的橙色：

                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()

如此，就是把lora的一个layer的内部构造过程，学习了。

读取数据：

Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dd44ca5a42a67d5e
/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|█████████████████████████████████████
██████████████████████████████████████| 1/1 [00:00<00:00, 695.11it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-dd44ca5a42a67d5e/0.0.0
/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-0c907acf641947ff.arrow 
and /root/.cache/huggingface/datasets/json/default-dd44ca5a42a67d5e/0.0.0/
fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-cc201360d6ced764.arrow
> /workspace/asr/Huatuo-Llama-Med-Chinese/finetune.py(271)train()
    270     import ipdb; ipdb.set_trace()
--> 271     trainer.train(resume_from_checkpoint=resume_from_checkpoint) # NOTE

peft之后的model

ipdb> model
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=31999)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False # 注意，这是原来的
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            (mlp): LlamaMLP(
              (gate_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear8bitLt(in_features=11008, out_features=4096, bias=False)
              (up_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
              (act_fn): SiLUActivation()
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          ... ...
          (31): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            (mlp): LlamaMLP(
              (gate_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear8bitLt(in_features=11008, out_features=4096, bias=False)
              (up_proj): Linear8bitLt(in_features=4096, out_features=11008, bias=False)
              (act_fn): SiLUActivation()
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
        (norm): LlamaRMSNorm()
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)

training_step

> /usr/local/lib/python3.8/dist-packages/transformers/trainer.py(2695)training_step()
   2694         import ipdb; ipdb.set_trace() # NOTE training_step
-> 2695         if is_sagemaker_mp_enabled():
   2696             loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)

这里是要计算loss了。

输入的参数：

ipdb> inputs['input_ids'].shape
torch.Size([128, 256]) # 128 = batch sie; 256 = sequence length
ipdb> inputs['attention_mask'].shape
torch.Size([128, 256])
ipdb> inputs['labels'].shape
torch.Size([128, 256])
ipdb> inputs['input_ids'][0]
tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        29871, 30557, 30806, 30392, 30287, 30502, 31658, 31596, 30214, 31894,
        30406,   232,   143,   190, 30415, 31043,   235,   178,   137, 30805,
        30724, 31835, 30742,   234,   176,   151, 31302, 31658, 29889,    13,
         2277, 29937, 29871, 31658, 31596, 29901,    13, 30287, 30956, 29946,
        29945,   232,   181,   132, 30647, 30952,   233,   133,   166, 30767,
        30544, 31424, 31066,   236,   155,   183, 30636,   233,   135,   162,
          233,   162,   150, 30503, 30544,   235,   164,   131, 31184,   234,
          154,   138, 31531, 30214, 31412,   233,   166,   131, 31213, 30910,
        31424,   235,   133,   194,   234,   155,   167, 31666, 31835,   235,
          178,   141, 30573, 31066,   236,   155,   183,   235,   135,   133,
          235,   133,   173,   235,   133,   140,   234,   155,   167, 30214,
        31088, 31658, 31751,   233,   133,   166, 30767, 30417,   232,   150,
          173, 31959,   231,   187,   183,   232,   189,   141,   234,   154,
          138, 31531, 30503, 30988,   232,   193,   132, 30882,    13,  2277,
        29937, 29871, 30742,   234,   176,   151, 29901,    13, 31751,   233,
          133,   166, 30767, 30682, 30815, 30417,   233,   135,   162,   233,
          162,   150, 30503, 30544,   235,   164,   131, 31184,   231,   187,
          183,   232,   189,   141,   234,   154,   138, 31531, 30503,   233,
          189,   134,   234,   153,   164, 30330,   232,   136,   136,   235,
          164,   131, 30330,   235,   133,   194,   234,   155,   167, 31184,
        30988,   232,   193,   132, 30267,     0], device='cuda:0')
ipdb> inputs['labels'][0]
tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100, 31751,   233,
          133,   166, 30767, 30682, 30815, 30417,   233,   135,   162,   233,
          162,   150, 30503, 30544,   235,   164,   131, 31184,   231,   187,
          183,   232,   189,   141,   234,   154,   138, 31531, 30503,   233,
          189,   134,   234,   153,   164, 30330,   232,   136,   136,   235,
          164,   131, 30330,   235,   133,   194,   234,   155,   167, 31184,
        30988,   232,   193,   132, 30267,     0], device='cuda:0')

outputs:

往里面看：

> /usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py(680)forward()

ipdb> hidden_states.shape
torch.Size([128, 256, 4096])

128=batch size,

256 = sequence length

4096 = dimension representation of each token

然后，经历lm_head，得到[128, 256, 32000]

然后是，展平，

shift_logits -> [32640, 32000]，

shift_labels -> [32640]

来计算交叉熵loss，即可。

交叉熵

但是，reference label中有-100，做个调查：

ipdb> shift_logits
tensor([[  0.3999, -23.2656,  -0.0309,  ...,   1.0137,   1.4453,   0.2018],
        [  0.3999, -23.2656,  -0.0309,  ...,   1.0137,   1.4453,   0.2018],
        [  0.3999, -23.2656,  -0.0309,  ...,   1.0137,   1.4453,   0.2018],
        ...,
        [ -0.4448,  -4.2188,   4.6992,  ...,   3.9453,  -0.2756,   3.9531],
        [ -3.9668,  -7.9531,  10.5547,  ...,   7.5898,   2.6328,   8.0781],
        [ -4.4062, -12.9141,  12.9141,  ...,   6.7266,   2.4121,   7.8398]],
       device='cuda:6', dtype=torch.float16, grad_fn=<ViewBackward0>)
ipdb> shift_labels
tensor([ -100,  -100,  -100,  ..., 31559, 30267,     0], device='cuda:6')
ipdb> shift_logits.shape
torch.Size([32640, 32000])
ipdb> shift_logits2 = shift_logits[0]
ipdb> shift_logits2.shape
torch.Size([32000])
ipdb> shift_logits2 = shift_logits2.unsqueeze(0)
ipdb> shift_logits2.shape
torch.Size([1, 32000])
ipdb> shift_labels2 = shift_labels[0]
ipdb> shift_labels2 = shift_labels2.unsqueeze(0)
ipdb> shift_labels2.shape
torch.Size([1])
ipdb> loss_fct(shift_logits2, shift_labels2)
tensor(nan, device='cuda:6', grad_fn=<NllLossBackward0>) # 用-100，得到nan
ipdb> shift_logits3 = shift_logits[-1].unsqueeze(0)
ipdb> shift_logits3.shape
torch.Size([1, 32000])
ipdb> shift_labels3 = shift_labels[-1].unsqueeze(0)
ipdb> shift_labels3.shape
torch.Size([1])
ipdb> loss_fct(shift_logits3, shift_labels3)
tensor(20.9531, device='cuda:6', grad_fn=<NllLossBackward0>) # ！！！
ipdb> shift_logits4 = torch.cat((shift_logits2, shift_logits3), 0)
ipdb> shift_logits4.shape
torch.Size([2, 32000])
ipdb> shift_logits4
tensor([[  0.3999, -23.2656,  -0.0309,  ...,   1.0137,   1.4453,   0.2018],
        [ -4.4062, -12.9141,  12.9141,  ...,   6.7266,   2.4121,   7.8398]],
       device='cuda:6', dtype=torch.float16, grad_fn=<CatBackward0>)
ipdb> shift_labels4 = torch.cat((shift_labels2, shift_labels3), 0)
ipdb> shift_labels4.shape
torch.Size([2])
ipdb> shift_labels4
tensor([-100,    0], device='cuda:6')
ipdb> loss_fct(shift_logits4, shift_labels4)
tensor(20.9531, device='cuda:6', grad_fn=<NllLossBackward0>) # 可以看到，-100的没有被  # ！！！

是算均值：

ipdb> shift_labels4 = torch.cat((shift_labels2, shift_labels3, shift_labels3), 0)
ipdb> shift_labels4
tensor([-100,    0,    0], device='cuda:6')
ipdb> shift_logits4 = torch.cat((shift_logits2, shift_logits3, shift_logits3), 0)
ipdb> loss_fct(shift_logits4, shift_labels4)
tensor(20.9531, device='cuda:6', grad_fn=<NllLossBackward0>)

有了上面的代码，就好了。

抛开lm_head，我们有：

  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
ipdb> sum([p.numel() for p in self.parameters()])