如何用Python将PDF转换为干净的HTML

1 人不认可

我想让用户上传一个PDF，将该PDF转换为HTML代码，并将该代码插入一个 <div> 来显示该PDF文档。我正在使用PDFMiner来分析上传的PDF。当我把它转换为HTML时，HTML很乱，文件显示错误 HTML Mess .我已经试过XML，但还是无法使用，因为文本显示时没有空格。我怎样才能改善这个问题？谢谢你。

def main(): contentRaw = convert_pdf(file.filename, 'html') contentR = json.dumps(contentRaw) contentOut = (contentRaw) return render_template('app.html', title=" App", filename=file.filename, content=Markup(contentOut), instructions=instructions) def convert_pdf(path, format='text', codec='utf-8', password=''): rsrcmgr = PDFResourceManager() retstr = BytesIO() laparams = LAParams() if format == 'text': device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) elif format == 'html': device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) elif format == 'xml': device = XMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) else: raise ValueError('provide format, either text, html or xml!') fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue().decode() fp.close() device.close() retstr.close() return text

3 个评论

Maximouse ：

应该使用PDF.js这样的库来显示PDF，而不是转换为HTML。 mozilla.github.io/pdf.js

lukebaker ：

@MaxiMouse 我想在我的页面的其他元素中显示PDF，比如一个div。PDF.js能帮助实现这个目标吗？另外，我认为PDF.js是在网络服务器上运行的，而我在这个项目中使用了Python。有没有类似于pdf.js的用Python编写的库？我还没有找到可靠的东西。

Maximouse ：

PDF.js在浏览器中运行。它将PDF渲染到一个


            canvas

元素上，你可以把它放到一个div中。

python

html

pdf

flask

pdfminer

lukebaker

发布于 2019-11-26

3 个回答

Maksym Polshcha

发布于 2021-02-16

0 人赞同

PDF是一种非常广泛的格式，它不仅仅是一种标记语言（HTML是）。

编写一个保留文档外观的PDF到HTML的转换器，是一个相当复杂的故事：你的软件必须理解所有的命令、对象，维护图形状态等。做所有符合要求的PDF阅读器和查看器所做的事情。并最终将文档内容转换为HTML。

你可以从 PDF 1.7规格 .

我建议你看一下浏览器并编码一个自定义的PDFViewer或以某种方式处理它可以提取的文本+pdf命令。

mphil4

发布于 2021-02-16

0 人赞同

为什么不尝试使用现有的PDF到HTML转换器呢？一个使用现有库的例子。

import pdftables_api
c = pdftables_api.Client('my-api-key')
c.html('input.pdf', 'output.html')


          
           
           
            Tilal Ahmad
           
          
          
           发布于
           
           2021-02-16


         0
         
         人赞同


          
           如果你有兴趣尝试一些其他的Python包，那么我建议
           
            用于Python的Aspose.Words Cloud SDK
           
           .它尊重PDF到HTML转换中的格式。
          
          # For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.pdf'
dest_name = 'C:/Temp/02_pages.html'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='html')
result = words_api.convert_document(request)