相关文章推荐
腼腆的柠檬  ·  python ...·  3 周前    · 
有情有义的大白菜  ·  python ...·  3 周前    · 
完美的馒头  ·  python QTreeWidget ...·  2 周前    · 
失眠的烤红薯  ·  python qt textBrowser ...·  2 周前    · 
帅气的领带  ·  【Pyspark ...·  昨天    · 
八块腹肌的眼镜  ·  Java ...·  11 月前    · 
从未表白的啄木鸟  ·  利用Traefik ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams
import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

now, I need to extract a non-standard object from the pdf file.

My object is the one named MYOBJECT and it is a string.

The piece printed by the python script that concernes me is:

{'/MYOBJECT': IndirectObject(584, 0)}

The pdf file is this:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
584 0 obj
<</Length 8>>stream
1_22_4_1     --->>>>  this is the string I need to extract from the object
endstream
endobj

How can I follow the 584 value in order to refer to my string (under pyPdf of course)??

If the information in my answer doesn't help, the as Jehiah says, an example PDF file would make it easy to give you real code. Email it to tony.meyer@gmail.com if you don't want to post it publicly. – Tony Meyer Jan 14, 2009 at 9:34

each element in pdf.pages is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT'] should be the element you want.

You can try to print that individually or poke at it with help and dir in a python prompt for more about how to get the string you want

Edit:

after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] and the value can be retrieved via getData()

the following function gives a more generic way to solve this by recursively looking for the key in question

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)
def findInDict(needle,haystack):
    for key in haystack.keys():
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x
answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()
                pdf.resolvedObjects[0][n] says KeyError: 0. This works for me: pdf.resolvedObjects[(0,n)]
– stenci
                May 30, 2014 at 18:40

An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.

If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.

Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:

13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj

Can be read with this:

>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}

Jehiah's method is good if looking everywhere for the object. My guess (looking at the PDF) is that it is always in the same place (the first page, in the 'MC0' property), and so a much simpler method of finding the string would be:

import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()
                How do i figure out the filters ?? ['/Resources']['/Properties']['/MC0']['/MYOBJECT'] these which you are referring to ?
– Sundeep Pidugu
                Apr 22, 2019 at 8:04
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.