根据我们在OP中的谈话。这里有一些选择供你考虑。
Option 1:
如果你直接使用PDF作为你的输入文件
import fitz
input_file = '/path/to/your/pdfs/'
pdf_file = input_file
doc = fitz.open(pdf_file)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo)
pageTextblocks = page.getText('blocks') # This creates a list of items (x0,y0,x1,y1,"line1\nline2\nline3...",...)
pageTextblocks.sort(key=lambda block: block[3])
for block in pageTextblocks:
targetBlock = block[4] # This gets to the content of each block and you can work your logic here to get relevant data
Option 2:
如果你用图像作为你的输入,并且你需要在使用选项1的代码片段处理它之前将其转换为PDF。
doc = fitz.open(input_file)
pdfbytes = doc.convertToPDF() # open it as a pdf file
pdf = fitz.open("pdf", pdfbytes) # extract data as a pdf file
在PyMuPDF中处理图像的一个有用的提示是,如果图像有些难以识别,可以使用zoom
因子来提高分辨率。
zoom = 1.2 # scale the image by 120%
mat = fitz.Matrix(zoom,zoom)
Option 3:
既然你提到了tesseract,那就用PyMuPDF和pytesseract的混合方法。我不确定这种方法是否符合你提取希伯来语的需要,但这是一个想法。这个例子是用于PDF的。
import fitz
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract/cmd'
input_file = '/path/to/pdfs'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file)
zoom = 1.2
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) #number of page
pix = page.getPixmap(matrix = mat)
output = '/path/to/save/image' + str(pageNo) + '.jpg'
pix.writePNG(output)
print('Converting PDFs to Image ... ' + output)
text_of_each_page = str(((pytesseract.image_to_string(Image.open(output)))))