Retrieve document content with document structure with python-docx

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have to retrieve tables and previous/next paragraphs from docx file, but can't imagine how to obtain this with python-docx

I can get a list of paragraphs by document.paragraphs

I can get a list of tables by document.tables

How can I get an ordered list of document elements like this

Paragraph1, Paragraph2, Table1, Paragraph3, Table3, Paragraph4,

python-docx doesn't yet have API support for this; interestingly, the Microsoft Word API doesn't either.

But you can work around this with the following code. Note that it's a bit brittle because it makes use of python-docx internals that are subject to change, but I expect it will work just fine for the foreseeable future:

#!/usr/bin/env python
# encoding: utf-8
Testing iter_block_items()
from __future__ import (
    absolute_import, division, print_function, unicode_literals
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
    Generate a reference to each paragraph and table child within *parent*,
    in document order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
        # print(parent_elm.xml)
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
    print('found one')
    print(block.text if isinstance(block, Paragraph) else '<table>')
There is some more discussion of this here:

https://github.com/python-openxml/python-docx/issues/276
                Can you help me out on this question which is similar to this? [stackoverflow.com/questions/56787961/…
– Karthick Mohanraj
                Jun 27, 2019 at 9:39
Resolved as property Document.story, contains paragraphs and tables in document order
https://github.com/python-openxml/python-docx/pull/395
document = Document('test.docx')
document.story
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.