Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have to retrieve tables and previous/next paragraphs from
docx
file, but can't imagine how to obtain this with
python-docx
I can get a list of paragraphs by
document.paragraphs
I can get a list of tables by
document.tables
How can I get an ordered list of document elements like this
Paragraph1,
Paragraph2,
Table1,
Paragraph3,
Table3,
Paragraph4,
python-docx
doesn't yet have API support for this; interestingly, the Microsoft Word API doesn't either.
But you can work around this with the following code. Note that it's a bit brittle because it makes use of
python-docx
internals that are subject to change, but I expect it will work just fine for the foreseeable future:
#!/usr/bin/env python
# encoding: utf-8
Testing iter_block_items()
from __future__ import (
absolute_import, division, print_function, unicode_literals
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
if isinstance(parent, _Document):
parent_elm = parent.element.body
# print(parent_elm.xml)
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
print('found one')
print(block.text if isinstance(block, Paragraph) else '<table>')
There is some more discussion of this here:
https://github.com/python-openxml/python-docx/issues/276
–
Resolved as property Document.story, contains paragraphs and tables in document order
https://github.com/python-openxml/python-docx/pull/395
document = Document('test.docx')
document.story
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.