掌握AI摘要技术构建学习助理

1.下载 LaMini 模型

将使用 Hugging Face 的 LaMini-LM ：这是基于 Flan-T5 系列的小型模型。它拥有 248M 参数，在下游 NLP 任务（文本生成、QnA 和摘要）上与 Aplaca-7B 和 LLaMa-7B 的性能相同。

为项目创建一个新文件夹


    ai-summarization

在项目目录中创建一个子文件夹


    model

点击 Hugging Face 存储库 LaMini-Flan-T5–248M ，并将目录中的所有文件下载到刚刚创建的 model 文件夹中。

2. 准备Python环境并安装依赖

有很多库需要安装，注意，与 Hugging Face 模型交互的核心是 torch 和 Transformers 库，有关详细信息，请参阅上面提到的链接。

在


    ai-summarization

目录中，创建一个虚拟环境并激活它：

python -m venv venv
source venv/bin/activate # ubuntu/Mac
venv\Scripts\activate # windows
在 venv 处于激活状态的情况下，安装以下内容：
pip install mkl mkl-include # MAC 使用CPU必须安装 
pip install torch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 # 核心
# 安装 Hugging Face Transformer库，需要与LLM进行交互
pip install git+https://github.com/huggingface/transformers 
# 这些将在下面用于与文档交互
pip install langchain==0.0.173 
pip install faiss-cpu==1.7.4
pip install unstructured==0.6.8
pip install pytesseract==0.3.10
pip install pypdf==3.9.0
pip install pdf2image==1.16.3
pip install sentence_transformers==2.2.2
# 只需要在CPU上运行
pip install accelerate==0.19.0
# 对于GUI和web应用程序
pip install streamlit
在主目录 my-summarization 中创建一个新 python 文件main.py ，将验证所有库是否已正确安装。
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
import torch
import streamlit
转到终端，在激活 venv 状态下运行 python main.py。如果没有看到任何错误，则表示一切都安装成功了。
3. 测试摘要管道
进行摘要的方法有很多种：这里将使用管道方法。来自 Transformers 库的 Pipelines 是专用于特定任务（命名实体识别、屏蔽语言建模、情感分析、特征提取和问答）的工具。使用所需的所有导入和管道的初始化更新 main.py 文件。
########### GUI IMPORTS ################
import streamlit as st
#### IMPORTS FOR AI PIPELINES ###############
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
from transformers import AutoModel, T5Tokenizer, T5Model
from transformers import T5ForConditionalGeneration
from langchain.llms import HuggingFacePipeline
import torch
模型存储在 checkpoint（对于本项目来说是目录 model）：它作为编码器-解码器模型，因此将其进行初始化：
# 设置 model 路径
checkpoint = "./model/"  # 实际上是LaMini-Flan-T5-248M
# 初始化标记器和模型
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
base_model = T5ForConditionalGeneration.from_pretrained(
                                            checkpoint,
                                            device_map='auto',
                                            torch_dtype=torch.float32)
管道指定希望 LLM 执行的任务：设置模型标记器并添加一些特定参数（摘要的 max_length 和 min_length）。
# 初始化管道
pipe_sum = pipeline('summarization', 
                    model = base_model,
                    tokenizer = tokenizer,
                    max_length = 350, 
                    min_length = 25)
为了更好的测试它，将沿着文本分配给一个字符串变量，然后将在其上执行管道：
text = " Automatic text summarization with machine learning is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. It is a challenging task that requires extensive research in the NLP area. There are two different approaches for automatic text summaryization: extraction and abstraction. The extraction method involves identifying important sections of the text and stitching together portions of the content to produce a condensed version. The scoring function assigns a value to each sentence denoting the probability with which it will get picked up in the summary. The process involves constructing an intermediate representation of the input text and scoring the sentences based on the representation. A typical flow of extractive summarization systems involves constructing intermediate representations of the input text, scoring sentences based on the representation, and using Latent semantic analysis (LSA) to identify semantically important sentences. Recent studies have also applied deep learning in extractive text summaryization, such as Sukriti's approach for factual reports using a deep learning model, Yong Zhang's document summarizing framework using convolutional neural networks, and Y. Kim's regression process for sentence ranking. The neural architecture used in the paper is compounded by one single convolution layer built on top of pre-trained word vectors followed by a max-pooling layer. Experiments have shown the proposed model achieved competitive or even better performance compared with baselines. Abstractive summarization methods aim




    
 to produce summary by interpreting the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original text. They take advantage of recent developments in deep learning and use an attention-based encoder-decoder method for generating abstractive summaries. Recent studies have argued that attention to sequence models can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text. Junyang Lin et al proposes a gated unit on top of the encoder outputs at each time step to tackle this problem. The code to reproduce the experiments from the NAMAS paper can be found here. The Pointer Network is a neural attention-based sequence-to-sequence architecture that learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Other methods for abstractive summarization include Pointer-Generator, which allows copying words from the input sequence via pointing of specific positions, and a generator that generates words from a fixed vocabulary of 50k words. To overcome repetition problems, the paper adapts the coverage model of Tu et al. to overcome the lack of coverage of source words in neural machine translation models. To train the extractor on available document-summary pairs, the model uses a policy-based reinforcement learning (RL) with sentence-level metric rewards to connect both extractor and abstractor networks and to learn sentence saliency. The abstractor network is an emphasis-based encoder-decoder which compresses and paraphrases an extracted document sentence to a concise summary sentence. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The extractor agent is a convolutional sentence encoder that computes representations for each sentence based on input embedded word vectors. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The method incorporates abstractive approach advantages of concisely rewriting sentences and generating novel words from the full vocabulary, while adopting intermediate extractive behavior to improve the overall model's quality, speed, and stability. Recent studies have proposed a combination of adversarial processes and reinforcement learning to abstractive summarization. The extractive approach is easier because copying large chunks of text from the source document ensures good levels of grammaticality and accuracy, while the abstractive model generates new phrases, rephrasing or using words that were not in the original text. Recent developments in the deep learning area have allowed for more sophisticated abilities




    
 to be generated."
# 在文本上运行管道并打印结果
result = pipe_sum(text)
print(result)
在 venv 处于激活状态的情况下，从终端运行 python main.py，将会看到下面这样的结果。
由于还需要一些调整，所以存在一些错误 …… 都是根据原文长度进行选择。稍后会这样做。请注意，管道的结果是一个带有字典的列表：因此，要仅调用文本字符串，应该使用 [0] 作为列表中的第一个项目，而 ['summary_text'] 是我们想要的值（字符串）的键。
print(result[0]['summary_text'])
4. 使用 Streamlit 准备并测试图形界面
现在逻辑部分已经完成（除了文本分割器），接下来将深入研究 Streamlit 应用程序。
Streamlit 是一个用于构建数据 Web 应用程序的库，无需了解任何前端技术（例如 HTML 和 CSS）。如果想了解更多信息，请点击此处查看文档。
创建一个名为 LaMini-TextSummarizer.py 的 python 文件：首先创建 GUI 的主干，然后将元素与逻辑结合起来。
import streamlit as st
############# Displaying images on the front end #################
st.set_page_config(page_title="Mockup for single page webapp",
                   page_icon='💻',
                   layout="centered",  #or wide
                   initial_sidebar_state="expanded",
                   menu_items={
                        'Get Help': 'https://docs.streamlit.io/library/api-reference',
                        'Report a bug': "https://www.extremelycoolapp.com/bug",
                        'About': "# This is a header. This is an *extremely* cool app!"}
# Load image placeholder from the web
st.image('https://placehold.co/750x150', width=750)
# Set a Descriptive Title
st.title("Your Beautiful App Name")
st.divider()
your_future_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras rhoncus massa sit amet est congue dapibus. Duis dictum ac nulla sit amet sollicitudin. In non metus ac neque vehicula egestas. Vestibulum quis justo id enim vestibulum venenatis. Cras gravida ex vitae dignissim suscipit. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis efficitur, lorem ut fringilla commodo, lacus orci lobortis turpis, sit amet consequat ante diam ut libero."
st.text_area('Summarized text', your_future_text, 
              height = 150, key = 'result')
导入 streamlit 后，第一条语句必须是 set_page_config（如果将其放在程序中的其他位置，则会抛出错误）：参数是 webapp 页面的总体布局的设置。
然后设置 header image：仅用于测试，在这里使用来自网络的图像 https://placehold.co/750x150。
st.text_area是另一个 Stramlit 小部件：它创建一个带有标题和内容的文本区域。在这里的例子中，内容将由字符串 your_future_text 中的文本填充。
最后一个参数是 key = 'result'：将用它来调用 session_states （应用程序运行时可以调用和更新变量的方式）
# Set 2 colums to make the Buttons wider
col1, col2 = st.columns(2)
btn1 = col1.button(" :star: Click ME ", use_container_width=True, type="secondary")
btn2 = col2.button(" :smile: Click ME ", use_container_width=True, type="primary")
if btn1:
    st.warning('You pressed the wrong one!', icon="⚠️")
if btn2:
    st.success('Good Choice!', icon="⚠️")  
st.divider()
对于本示例，仅在此定义 2 列，并在每一列中放置一个按钮。当在容器（列）内时，调用的小部件不带 st.。使用 use_container_width=True 将 Button 的宽度扩展到列之一。
保存所有内容，在终端并运行 Streamlit 应用程序类型：streamlit run LaMini-TextSummarizer.py
默认浏览器将在默认地址 http://localhost:8501 打开。运行效果如下：
5.将逻辑和界面联调起来
简单介绍完 Streamlit 之后，再来说逻辑部分（AI pipeline）和图形用户界面部分（Streamlit）。不用担心代码：可以在 GitHub 存储库中找到它。
重命名之前的文件 LaMini-TextSummarizer_mockup.py 并创建一个新文件 LaMini-TextSummarizer.py，代码如下：
########### GUI IMPORTS ################
import streamlit as st
import ssl
############# Displaying images on the front end #################
st.set_page_config(page_title="Summarize and Talk ot your Text",
                   page_icon='📖',
                   layout="centered",  #or wide
                   initial_sidebar_state="expanded",
                   menu_items={
                        'Get Help': 'https://docs.streamlit.io/library/api-reference',
                        'Report a bug': "https://www.extremelycoolapp.com/bug",
                        'About': "# This is a header. This is an *extremely* cool app!"
#### IMPORTS FOR AI PIPELINES ###############
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
from transformers import AutoModel, T5Tokenizer, T5Model
from transformers import T5ForConditionalGeneration
from langchain.llms import HuggingFacePipeline
import torch
import datetime
# SET THE MODEL PATH
checkpoint = "./model/"  #it is actually LaMini-Flan-T5-248M
# INITIALIZE TOKENIZER AND MODEL
# this part has been moved inside the AI_SummaryPL function
到目前为止没有什么新的。将在以下代码块中将函数和交互式 Streamli 放在一起，并解释构建块。
######################################################################
#     SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE       #
######################################################################
def AI_SummaryPL(checkpoint, text, chunks, overlap):
    checkpoint is in the format of relative path
    example:  checkpoint = "/content/model/"  #it is actually LaMini-Flan-T5-248M   #tested fine
    text it is either a long string or a input long string or a loaded document into string
    chunks: integer, lenght of the chunks splitting
    ovelap: integer, overlap for cor attention and focus retreival
    RETURNS full_summary (str), delta(str) and reduction(str)
    post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
    USAGE EXAMPLE:
    post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size = chunks,
        chunk_overlap  = overlap,
        length_function = len,
    texts = text_splitter.split_text(text)
    #checkpoint = "/content/model/"  #it is actually LaMini-Flan-T5-248M   #tested fine
    checkpoint = checkpoint
    tokenizer = T5Tokenizer.from_pretrained(checkpoint)
    base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
                                                        device_map='auto',
                                                        torch_dtype=torch.float32)
    ### INITIALIZING PIPELINE
    pipe_sum = pipeline('summarization', 
                        model = base_model,
                        tokenizer = tokenizer,
                        max_length = 350, 
                        min_length = 25
    ## START TIMER
    start = datetime.datetime.now() #not used now but useful
    ## START CHUNKING
    full_summary = ''
    for cnk in range(len(texts)):
      result = pipe_sum(texts[cnk])
      full_summary = full_summary + ' '+ result[0]['summary_text']
    stop = datetime.datetime.now() #not used now but useful  
    ## TIMER STOPPED AND RETURN DURATION
    delta = stop-start  
    ### Calculating Summarization PERCENTAGE
    reduction = '{:.1%}'.format(len(full_summary)/len(text))
    print(f"Completed in {delta}")
    print(f"Reduction percentage: ", reduction)
    return full_summary, delta, reduction
这是主要的功能，接下来需要一个函数，因为单击正确的按钮时将开始汇总（对于此方法，需要一个函数来调用）
from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size = chunks,
        chunk_overlap  = overlap,
        length_function = len,
    texts = text_splitter.split_text(text)
LangChain库是一个非常强大的工具箱：可以使用外部文档和来源与语言模型进行交互。LangChain TextSplitters 的方法不止一种。 RecursiveCharacterSplitter 是推荐的一种，用于将长通用文本分割成小块（称为块），并且不超过 token 限制。
## START CHUNKING
full_summary = ''
for cnk in range(len(texts)):
   result = pipe_sum(texts[cnk])
   full_summary = full_summary + ' '+ result[0]['summary_text']
块存储在列表中：迭代列表中的项目并将每个块提供给摘要管道。然后将所有字符串连接在一起以获得 final_summary。
### HEADER section
st.image('Headline-text.jpg', width=750)
title = st.text_area('Insert here your Copy/Paste text', "", height = 350, key = 'copypaste')
btt = st.button("1. Start Summarization")
txt = st.empty()
timedelta = st.empty()
text_lenght = st.empty()
redux_bar = st.empty()
st.divider()
down_title = st.empty()
down_btn = st.button('2. Download Summarization') 
text_summary = ''
可以看到一些 st.empty()。这是一个占位符：正在页面布局中预订一个位置，稍后将填充该位置。
def start_sum(text):
    if st.session_state.copypaste == "":
        st.warning('You need to paste some text...', icon="⚠️")
    else:
        with st.spinner('Initializing pipelines...'):
            st.success(' AI process started', icon="🤖")
            print("Starting AI pipelines")
            text_summary, duration, reduction = AI_SummaryPL(LaMini,text,3700,500)
        txt.text_area('Summarized text', text_summary, height = 350, key='final')
        timedelta.write(f'Completed in {duration}')
        text_lenght.markdown(f"Initial length = {len(text.split(' '))} words / summarization = **{len(text_summary.split(' '))} words**")
        redux_bar.progress(len(text_summary)/len(text), f'Reduction: **{reduction}**')
        down_title.markdown(f"## Download your text Summarization")
当按下 btt = st.button("1. Start Summarization") 时将调用此函数，开始对粘贴在 text_area 中的文本进行摘要。
if btt:
    start_sum(st.session_state.copypaste)
if down_btn:
    def savefile(generated_summary, filename):
        st.write("Download in progress...")
        with open(filename, 'w') as t:
            t.write(generated_summary)
        t.close()
        st.success(f'AI Summarization saved in {filename}', icon="✅")
    savefile(st.session_state.final, 'text_summarization.txt')
    txt.text_area('Summarized text', st.session_state.final, height = 350)
请注意，start_sum 的唯一参数是 session_state。
Session State 是一种在每个用户会话的重新运行之间共享变量的方法。除了存储和持久状态的能力之外，Streamlit 还公开了使用回调操作状态的能力。会话状态也会在多页面应用程序内的应用程序之间持续存在。
当 venv 处于活动状态时，从终端运行：
streamlit run LaMini-TextSummarizer.py
粘贴想要总结的文章文本，然后按按钮。




    

管道 Pipelines 是惊人的。即使硬件很少，也可以在计算机上运行我们想要的所有内容（LaMini-LM 也仅使用 CPU 运行）。尝试不同的设置以提高摘要的质量。
  
 
   相关推荐
   
        汀丶人工智能
      
    强化学习从基础到进阶--案例与实践[7]：深度确定性策略梯度DDPG算法、双延迟深度确定性策略梯度TD3算法详解
 强化学习从基础到进阶--案例与实践[7]：深度确定性策略梯度DDPG算法、双延迟深度确定性策略梯度TD3算法详解 1. 离散动作与连续动作的区别 离散动作与连续动作是相对的概念，一个是可数的，一个是不
  265
 
 
      
    聊聊transformers库——进阶-模型微调和保存
 对模型进行微调与保存 微调（Fine-tuning）是一种迁移学习技术，通过在预训练模型的基础上进行少量训练，使模型适应新任务或新数据集。 在本节中，我们将介绍如何进行微调和复用。
  563
 
 
      
    重训「羊驼」大模型并彻底开放商用，UC伯克利博士生：Meta不愿做就自己做
 Meta“羊驼”（LLaMA）的开源可商用复现品OpenLLaMA发布重大更新： 在1T token上训练出来的130亿参数版本现已正式上线。 至此，这一训练量和原版羊驼已完全一致。 与此同时，之前发
  139
 
 
        我的AI力量
        ChatGPT
      
    ChatGPT的翻译表现以及提示词技巧
 本文探讨了ChatGPT的翻译表现，和其他工具做了比较，并且在文章的最后用一些提示词技巧一步步优化《再别康桥》这首诗的英文版的翻译效果，希望读者能从中获得一些借鉴。
  699
 
 
        StarHui
      
    深度学习之Softmax回归
 本文介绍了Softmax回归的基本原理和应用。Softmax回归是一种常见的多分类模型，它可以将输入的特征与相应的标签进行映射，并通过使用Softmax函数来计算每个类别的概率分布。
  420
 




    
 
        Python
      
    transformers库进阶之——使用自定义数据集来训练和预测
 在实际应用中，我们通常需要处理自定义的数据集。 为了方便地使用transformers库处理这些数据，我们可以继承Dataset类来实现自定义的数据集类。本文以一个详细的例子来展示使用方法。
  42
 
 
      
    水很深的深度学习-Task03前馈神经网络
   在前馈神经网络中，各神经元分别属于不同的层。每一层的神经元可以接收前一层神经元的信号，并产生信号输出到下一层。第 0 层叫输入层，最后一层叫输出层，其它中间层叫做隐藏层，相邻两层的神经元之间为全连接关系，也称为全连接神经网络(F N N FNNFNN)，表现形式如下图所示。1...
  714
 
 
      
    清华唐杰新作WebGLM：参数100亿、主打联网搜索，性能超OpenAI WebGPT
 清华唐杰团队的新作来了： WebGLM，一个参数100亿的联网问答聊天机器人（论文入选KDD2023）。 你可以问它任何问题，然后它将列举出网上（例如维基百科、相关官网）相关的文章链接，整理出答案。 
  359
 
 
        HuggingFace
      
    Hugging News #0626: 音频课程更新、在线体验 baichuan-7B 模型、ChatGLM2-6B 重磅发
 每一周，我们的同事都会向社区的成员们发布一些关于 Hugging Face 相关的更新，包括我们的产品和平台更新、社区活动、学习资源和内容更新、开源库和模型更新等，我们将其称之为「Hugging Ne
  547
 
 
      
    结合符号性记忆，清华等提出ChatDB，提升大模型的复杂推理能力




    
 随着大语言模型（Large Language Models）的爆火，例如 ChatGPT，GPT-4，PaLM，LLaMA 等，如何让大语言模型更好的应对有很长的上下文信息（超出其最大处理长度）的场景
  102
 
 
      
    多模态大语言模型 LlaVA 论文解读：Visual Instruction Tuning
 在这篇论文中，作者首次尝试使用纯语言 GPT-4 生成多模态语言图像指令遵循数据（insruction-follo
  449
 
 
      
    中科院版「分割一切」模型来了，比Meta原版提速50倍 | GitHub 2.4K+星
 比Meta的「分割一切模型」(SAM)更快的图像分割工具，来了！ 最近中科院团队开源了FastSAM模型，能以50倍的速度达到与原始SAM相近的效果，并实现25FPS的实时推理。 该成果在Github
  391
 
 
        Coder个人博客
        Android
        Linux
      
    FFmpeg架构全面分析
 FFmpeg是多媒体领域的万能工具。只要涉及音视频领域的处理，基本上没有它做不了的事情！通俗点讲，从视频录制、视频编辑再到播放，它都能做！
  567
 
 
      
    基于Transformer的大模型是如何运行的？Meta从全局和上下文学习揭秘
 随着大型语言模型（LLM）在使用和部署方面的不断增加，打开黑箱并了解它们的内部工作原理变得越来越重要。更好地理解这些模型是如何做出决策的，这对改进模型和减轻其故障（如幻觉或推理错误）至关重要。 众所周
  137
 
 
        全栈开发 @ DevPoint
      
   私信
   
         6,749