1. 使用pip安装PyPDF2

PyPDF2支持如下版本的Python解释器：

直接使用pip即可安装： pip install PyPDF2
2. 使用PyPDF2提取PDF文档内容的简单示例

以一篇论文文档为例，展示PyPDF2如何提取PDF文件中的内容。
论文《ImageNet Classification with Deep Convolutional Neural Networks》，一共9页，其首页布局为：

Python脚本代码：
from PyPDF2 import PdfReader
#早期版本里叫PdfFileReader，已经过时，改名为PdfReader了，见：https://pypdf2.readthedocs.io/en/latest/_modules/PyPDF2/_reader.html?highlight=PdfFileReader#
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
#1.28.0版本之前用numPages，已经过时，见：https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.numPages
print(number_of_pages)  #打印页数
page = reader.pages[0]
#1.28.0版本之前用getPage(pageNumber)，已经过时，见：https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.getPage
print(page)  #打印“PDF第一页”这个Page<PyPDF2._page.Page>对象
text = page.extract_text()
#1.28.0版本之前用extractText()，已经过时，见：https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extractText
print(text)  #提取出第一页的文字
{'/Contents': IndirectObject(13, 0), '/Parent': IndirectObject(1, 0), '/Type': '/Page', '/Resources': IndirectObject(14, 0), '/MediaBox': [0, 0, 612, 792]}
ImageNet Classication with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of ve convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a nal 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efcient GPU implemen-
tation of the convolution operation. To reduce overtting in the fully-connected
layers we employed a recently-developed regularization method called ﬁdropoutﬂ
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overtting. Until recently, datasets of labeled images were relatively
small Š on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of tho usands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specied even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don't have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
可以看到页数和PDF中的文字都能正确提取出来。
                    PDF是文档常用格式，使用Python包PyPDF2可以对PDF文档实现批量、迅速的操作，包括提取文字、切分或合并PDF文件、创建annotation、加密和解密等。本文将介绍PyPDF2包的安装及简单使用方式。
				PyPDF2–如何使用python操作你的PDF文档
大家好！最近想操作一下PDF文档，总是收费，于是浅尝辄止地了解了一下python当中的PyPDF2这个库。借助本篇博客总结了一下个人所学到的内容。本人才疏学浅，还望各位大佬多多指正。Python在自动化办公方面有很多实用的第三方库，可以很方便的处理word、excel、ppt、pdf文件，Python处理PDF文档的两个常用库pdfplumber，PyPDF2。在此本人对PyPDF2进行一个简单的介绍。
0.0：PyPDF2简介以及安装
PyPDF
				在PyPDF2中使用PdfFileWriter类,可以添加一个空白页。
示例代码:
from PyPDF2 import PdfFileReader, PdfFileWriter
# 创建一个PdfFileReader对象
pdf_reader =PdfFileReader(pdf_file)
# 创建一个PdfFileWriter对象
pdf_writer = PdfFileWriter()...
				Python系列 之 PyPDF2库 学习
PyPDF2中主要涉及到的几个对象有 PdfFileReader、PdfFileWriter和PdfFileMerger以及PageObject
PdfFileReader
PdfFileReader对象：
import PyPDF2
PyPDF2.PdfFileReader(stream , strict=True , warndest=None , overwriteWarnings=True))
# stream ：一个 File 对象或支持类似于
				2.打开和保存PDF文件
pypdf2有PdfReader和PdfWriter两个对象分别用于读和写，reader()方法直接指定PDF文件的路径即可读取PDF文件，writer可以临时保存PDF内容，然后调用write()方法传入文件句柄即可保存到硬盘
添加空白页可以通过addBlankPage()方法，但注意，如果PdfWriter对象是空的，你需要指定宽高才能添加空白页，如果PdfWriter已有页面不指定宽高则采用上一页的宽高。可以通过PageObject对象的mediabox属性查看宽高信息
				对于pdf文件的相关操作，PyPDF2文档中具有许多功能，阅读完本文后相信你会对pdf的操作变得非常自信！！
https://pythonhosted.org/PyPDF2/
该网址是PyPDF2英文文档的链接，本人从中提取了相应数据，并制作成表格方便比较和分析。
一、准备工作
1.安装PyPDF2
本文采用在pycharm中安装，左上角File---Settings---找到Project---Python interpreter---点击+
---输入pypdf2(大小写均可)---点击in
pdfObj = open("1.pdf", 'rb')
pdfreader = PyPDF2.PdfFileReader(pdfObj)
print(pdfreader.numPages)  # 输出页数
pageObj = pdfreader.getPage(40)
print(pageObj.extractText())  # 输出该页文本
解密PDF：
# 判断是否加密
                Re8：读论文 Hier-SPCNet: A Legal Statute Hierarchy-based Heterogeneous Network for Computing Legal Case
                    Rouge-Eradiction: 
                    今天那它做baseline，大佬有没有复现他的代码