Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))
                But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing </html>?  Or what if the garbage at the end itself has a </html>?  Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. a Content-Length: HTTP header).
– Adam Rosenfield
                Jul 13, 2012 at 18:16

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)
                Is the tag not lowercase, or is it followed by a '\n'? You can make it case-insensitive ((?i) flag) and make . match newlines ((?s) flag) with r'(?is)</html>.+'.
– MRAB
                Jul 13, 2012 at 18:32
                Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern.
– parvus
                Jul 8, 2021 at 5:14

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'
>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

@OleAnders Better, but then you're duplicating that string, which opens another possibility for error. – Daniel Griscom Mar 3, 2018 at 14:30 I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish. – Julian Mar 3, 2018 at 18:42

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python
article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
z=open('out.txt','w')
se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>