Python Convert HTML into JSON using Soup

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

The HTML tags will start with any of the following

<p>


    <ol>


    <ul>

The content of the HTML when any of step 1 tags is found will contain only the following tags:


    <em>


    <strong>


    <span style="text-decoration:underline">

Map step two tags into the following:


    <strong>

will be this item


    {"bold":True}

in a JSON,


    <em>

will


    {"italics":True}

and


    <span style="text-decoration:underline">

will be


    {"decoration":"underline"}

Any text found would be


    {"text": "this is the text"}

in the JSON

Let’s say l have the HTML below: By using this:

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
Which produces this Array:
    <p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,
    <p>2</p>,
    <p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,
    <ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
By Applying the rules above, this will be the result:
First Array element would be processed into this JSON:
    "text": [
        "The name is not mine it is for the people",
        {"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, 
        {"text": "how can", "bold": True, "italics": True},
        {"text": "name", "italics": True},
        {"text": "good", "bold": True},
        {"text": "his name", "italics": True},
        {"text": "moneuet", "decoration": "underline"},
        {"text": "please ", "italics": True},
        {"text": "forever", "decoration": "underline", "bold":True},
        {"text": "tomorrow", "italics": True},
        {"text": "USA", "bold": True, "italics": True}
Second Array element would be processed into this JSON:
{"text": ["2"] }
Third Array element would be processed into this JSON:
    "text": [
        {"text": "moment", "bold": True},
        {"text": "Africa", "italics": True},
        {"text": "China", "italics": True},
        {"text": "home", "decoration": "underline"},
        {"text": "thomas", "italics": True},
        {"text": "nothing", "bold": True}
The fourth Array element would be processed into this JSON:
    "ol": [
        "first item", 
        {"text": "second item", "decoration": "underline", "italics": True, "bold": True}
This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue 
soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:
   foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]
      if len(foundTagStyles ) > 0:
         if str(foundTag.name) == "p":
              arrayOfTextAndStyles = [{"tag": tag.name, "text": 
                  foundTag.find_all(text=True, recursive=False) }] +  
                    [{"tag":tag.name, "text": foundTag.find_all(text=True, 
                    recursive=False) } for tag in foundTag.find_all()]
         elif  str(foundTag.name) == "ol":
         elif  str(foundTag .name) == "ul":
                You need to come up with a more consistent output format; why is the second paragraph not resulting in a list, while the others all do? Why doesn't the third paragraph have an initial text element before all the nested dictionaries?
– Martijn Pieters
                Sep 29, 2017 at 7:05
                Alternatively, why not wrap all text in a dictionary? So for the first example, the first element would be {"text": "The name is not mine it is for the people"}.
– Martijn Pieters
                Sep 29, 2017 at 7:07
                Where did can go in your first example? How should <em><strong> how can</strong>name </em> be handled, really? It's a nested structure with text at two levels.
– Martijn Pieters
                Sep 29, 2017 at 7:45
                There is also a space between '<strong>good</strong>' and <em>his name ..., followed by more nesting.
– Martijn Pieters
                Sep 29, 2017 at 7:46
I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:
from bs4 import NavigableString
def parse(elem):
    if elem.name == 'ol':
        result = []
        for li in elem.find_all('li'):
            if len(li) > 1:
                result.append([parse_text(sub) for sub in li])
            else:
                result.append(parse_text(next(iter(li))))
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}
def parse_text(elem):
    if isinstance(elem, NavigableString):
        return {'text': elem}
    result = {}
    if elem.name == 'em':
        result['italics'] = True
    elif elem.name == 'strong':
        result['bold'] = True
    elif elem.name == 'span':
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
        except ValueError:
            raise ValueError('Invalid structure')
        if 'underline' not in styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'
    else:
        raise ValueError('Invalid structure')
    if len(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    else:
        result.update(parse_text(next(iter(elem))))
    return result
You then parse your document:
for candidate in soup.select('ol,p'):
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignore
        continue
    print(result)
Using pprint, this results in:
{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}
Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.
The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text' only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.
                Is it possible to covert the entire result into valid json array, such as json.dumps(result) for the final result. The result looks very promising, however, the final output is not in a json format i.e. result
– Ernest Appiah
                Sep 29, 2017 at 8:36
                @ErnestAppiah: I've updated the answer to fix a small bug in handling nested elements with multiple children.
– Martijn Pieters
                Sep 29, 2017 at 8:46
                @ErnestAppiah: the final output is trivial to produce. Instead of print(result) in the last snippet (where I loop over soup.select('ol,p')), append the result to a list. Then use json.dumps(list_produced).
– Martijn Pieters
                Sep 29, 2017 at 8:46
                Thanks a lot man. You are awesome. How can l give you 500 stars. Thanks so much. I really appreciate your help
– Ernest Appiah
                Sep 29, 2017 at 8:50
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.