四个大模型轻量级微调训练框架：兼看PPT转Markdown工具

问题1：当前的四个微调训练框架

1、 Firefly

地址：https://github.com/yangjianxin1/Firefly

其给出了训练数据的情况：

支持预训练、指令微调、DPO，支持全量参数训练、LoRA、QLoRA高效训练。通过配置文件的方式训练不同的模型，小白亦可快速上手训练模型。
支持使用Unsloth加速训练，并且节省显存。
支持绝大部分主流的开源大模型，如Llama3、Gemma、MiniCPM、Llama、InternLM、Baichuan、ChatGLM、Yi、Deepseek、Qwen、Orion、Ziya、Xverse、Mistral、Mixtral-8x7B、Zephyr、Vicuna、Bloom，训练时与各个官方的chat模型的template对齐。
整理并开源指令微调数据集：firefly-train-1.1M 、moss-003-sft-data、ultrachat、 WizardLM_evol_instruct_V2_143k、school_math_0.25M。
开源Firefly系列指令微调模型权重。
在Open LLM Leaderboard上验证了QLoRA训练流程的有效性。

2、LLaMA-Factory

地址：https://github.com/hiyouga/LLaMA-Factory

多种模型：LLaMA、LLaVA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。
集成方法：（增量）预训练、（多模态）指令监督微调、奖励模型训练、PPO 训练、DPO 训练、KTO 训练、ORPO 训练等等。
多种精度：16 比特全参数微调、冻结微调、LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ 的 2/3/4/5/6/8 比特 QLoRA 微调。
先进算法：GaLore、BAdam、DoRA、LongLoRA、LLaMA Pro、Mixture-of-Depths、LoRA+、LoftQ、PiSSA 和 Agent 微调。
实用技巧：FlashAttention-2、Unsloth、RoPE scaling、NEFTune 和 rsLoRA。
实验监控：LlamaBoard、TensorBoard、Wandb、MLflow 等等。
极速推理：基于 vLLM 的 OpenAI 风格 API、浏览器界面和命令行接口。

也给出了大致训练的硬件要求：

3、unsloth

其口号是，Finetune Llama 3.1, Mistral, Phi-3 & Gemma 2-5x faster with 80% less memory!

项目特点：

所有核心都使用 OpenAI的Triton语言编写。手动反向传播引擎。
支持自2018年起的NVIDIA GPU。最低 CUDA能力7.0（V100, T4 , Titan V, RTX 20, 30, 40x, A100, H100, L40等）检查你的GPU！GTX 1070, 1080可用，但速度较慢。
适用于Linux和通过WSL的Windows。
通过bitsandbytes支持 4位和16位QLoRA/LoRA 微调。
开源训练速度提高5倍 - 查看 Unsloth Pro，训练速度可提高至30倍！

地址：https://github.com/unslothai/unsloth

4、SWIFT

SWIFTSWIFT (Scalable lightWeight Infrastructure for Fine-Tuning)支持 300+ LLM 和50+ MLLM（多模态大模型）的训练(预训练、微调、对齐)、推理、评测和部署。开发者可以直接将框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。

除支持了 PEFT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持最新的训练技术，如NEFTune、LoRA+、 LLaMA-PRO 等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程。

地址：https://github.com/modelscope/swift

问题2：文档智能工具-PPT转Markdown工具

pptx2md，是一个将PowerPoint pptx文件转换为Markdown格式的工具。可以看下效果：

输入源文件：

生成目录： The table of contents generated .

渲染的markdown文件： Generated markdown file (rendered by madoko) .

其执行逻辑可以看：

https://github.com/ssine/pptx2md/blob/master/pptx2md/parser.py

是基于pptx组件，写了一堆的规，不同的组件：

from __future__ import print_function
import collections
import collections.abc
import pptx
from pptx.enum.shapes import PP_PLACEHOLDER, MSO_SHAPE_TYPE
from pptx.enum.dml import MSO_COLOR_TYPE, MSO_THEME_COLOR
from pptx.util import Length
import numpy as np
from PIL import Image
import os
from rapidfuzz import process as fuze_process
from operator import attrgetter
from tqdm import tqdm
from pptx2md.global_var import g
from pptx2md import global_var
import pptx2md.outputter as outputter
from pptx2md.columns import is_two_column_text, assign_shapes
from pptx2md.utils_optim import normal_pdf, fit_column_model
可以看看其执行逻辑步骤： 
# main
def parse(prs, outputer):
  global out
  out = outputer
  notes = []
  print("Starting conversion")
  # Adding inclusion of header for the first slide
  out.put_header()
  for idx, slide in enumerate(tqdm(prs.slides, desc='Converting slides')):
  ## 每页进行处理
  # for idx, slide in enumerate(prs.slides):
    if g.page is not None and idx + 1 != g.page:
        continue
    shapes = []
      shapes = sorted(ungroup_shapes(slide.shapes), key=attrgetter('top', 'left'))
    except:
      print('Bad shapes encountered in this slide. Please check or move them and try again.')
      print('shapes:')
        for sp in slide.shapes:
          print(sp.shape_type)
          print(sp.top, sp.left, sp.width, sp.height)
      except:
        print('failed to print all bad shapes.')
    ## 处理每页的形状
    process_shapes(shapes, idx + 1)
    if not g.disable_notes and slide.has_notes_slide:
      text = slide.notes_slide.notes_text_frame.text
      if text:
        notes += process_notes(text, idx + 1)
    if idx < len(prs.slides)-1 and g.enable_slides:
        out.put_para("\n---\n")
  out.close()
  if len(notes) > 0:
    print('Process finished with notice:')
    for note in notes:
      print(note)
输出markdown的组织逻辑在：https://github.com/ssine/pptx2md/blob/master/pptx2md/outputter.py 
地址：https://github.com/ssine/pptx2md 
本文主要看了看当前的四个微调训练框架以及文档智能工具-PPT转Markdown工具的一些实现逻辑，实现都很简单。 
大家对微调、文档处理等有需求的，可以跑一跑，会有一定收获。 
1、https://github.com/yangjianxin1/Firefly 
2、https://github.com/hiyouga/LLaMA-Factory 
3、https://github.com/unslothai/unsloth 
4、https://github.com/modelscope/swift 
5、https://github.com/ssine/pptx2md
				- 非常简单的一个v1版本
    - 利用langchain和pdfminer切分pdf文档为k块，设置overlap等参数
    - 先利用prompt1对每个chunk文本块进行摘要生成，然后利用prompt2对多个摘要进行连贯组合/增删
- 评测标准：信息是否涵盖pdf主要主题、分点和pdf一二级标题比大体是否一致、摘要是否连贯、通顺Prompt1：分段总结
```python
prompt1 = '''你是一个摘要生成器。请根据下文进行分段总结,请注意：
            1.输入数据为从