def html_to_plain_text(html): text = re.sub('.*?', '', html, flags=re.M | re.S | re.I) text = re.sub('', ' HYPERLINK ', text, flags=re.M | re.S | re.I) text = re.sub('<.*?>', '', text, flags=re.M | re.S) text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S) return unescape(text)

例如有一段html文本

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0"  ...

转换以后的结果

Newsletter Discover Tomorrow's Winners For Immediate Release Cal-Bay (Stock Symbol: CBYI) Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI. CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future. Put CBYI on your watch list, acquire a position TODAY. REASONS TO INVEST IN CBYI A profitable company and is on track to beat ALL earnings estimates! One of the FASTEST growing distributors in environmental & safety equipment instruments. Excellent management team, several EXCLUSIVE contracts. IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research. RAPIDLY GROWING INDUSTRY Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi ...

一个将html格式的email转换为纯text的例子:

def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)
一个python实现的简单转html转纯text的函数import refrom html import unescapedef html_to_plain_text(html): text = re.sub('&lt;head.*?&gt;.*?&lt;/head&gt;', '', html, flags=re.M | re.S | re.I) text = ...
夹以及子目录、子目录里面的 ,获取到该目录下所有的【. html 】文件后,返回一个list对象 2、遍历完成后得到一个 html 文件列表对象,将该列表交给 html _to_txt方法, html _to_txt方法 里面循环逐个读取 html 文件中指定标签中标签中标签中的文字,和中指定标签 里面标签的文字提取出来 3、读取到的 文本 内容输出到txt文件中,这里可以加上一个替换replac
本文实例讲述了 Python 转换 HTML Text 文本 的方法。分享给大家供大家参考。具体分析如下: 今天项目需要将 HTML 转换为纯 文本 ,去网上搜了一下,发现 Python 果然是神通广大,无所不能,方法是五花八门。 拿今天亲自试的两个方法举例,以方便后人: 1. 安装nltk,可以去pipy装 (注:需要依赖以下包:numpy, PyYAML) 2.测试代码: 复制代码 代码如下:>>> import nltk  >>> aa = r””’  <b>Project:</b> De HTML <br>  <b>Description</b>:<br
翻了一些博客,看到有博主是自己写了将 html 转为 text 的函数,但是由于项目时间比较紧,所以自己懒得动脑筋去写了, 这里推荐大家用一下nltk模块中clean_ html ()函数,用法如下: import nltk html =""" <!DOCTYPE html > <title>这是个标题</title> </head> $content = preg_replace("/<img\s*src=(\"|\')(.*?)\\1[^>]*>/is",'【图片】', $content); $content = strip_tags($content); $content = trim($content); $content = ereg_rep
file_names = os.listdir(file_path) i = 1 with open(os.path.join(file_path, file_names[i]), 'r', encoding='utf-8') as f: txt = f.read() 得到结果如下: '\ufeff< html &g
import urllib r=urllib.urlopen(“http://www.w3school.com.cn/ html 5/index.asp“)//获取 html 代码 f=f.open(“\Users\Desktop\123.txt”,”w”) f.write(f.read())//写入文件 f.close()