', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
text = re.sub('<.*?>', '', text, flags=re.M | re.S)
text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
return unescape(text)
例如有一段html文本
<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b> Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners </b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" ...
转换以后的结果
Newsletter
Discover Tomorrow's Winners
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI. CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts. IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi ...
一个将html格式的email转换为纯text的例子:
def email_to_text(email):
html = None
for part in email.walk():
ctype = part.get_content_type()
if not ctype in ("text/plain", "text/html"):
continue
content = part.get_content()
except: # in case of encoding issues
content = str(part.get_payload())
if ctype == "text/plain":
return content
else:
html = content
if html:
return html_to_plain_text(html)
一个python实现的简单转html转纯text的函数import refrom html import unescapedef html_to_plain_text(html): text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I) text = ...
夹以及子目录、子目录里面的 ,获取到该目录下所有的【.
html
】文件后,返回一个list对象
2、遍历完成后得到一个
html
文件列表对象,将该列表交给
html
_to_txt方法,
html
_to_txt方法
里面循环逐个读取
html
文件中指定标签中标签中标签中的文字,和中指定标签
里面标签的文字提取出来
3、读取到的
文本
内容输出到txt文件中,这里可以加上一个替换replac
本文实例讲述了
Python
转换
HTML
到
Text
纯
文本
的方法。分享给大家供大家参考。具体分析如下:
今天项目需要将
HTML
转换为纯
文本
,去网上搜了一下,发现
Python
果然是神通广大,无所不能,方法是五花八门。
拿今天亲自试的两个方法举例,以方便后人:
1. 安装nltk,可以去pipy装
(注:需要依赖以下包:numpy, PyYAML)
2.测试代码:
复制代码 代码如下:>>> import nltk
>>> aa = r””’
<b>Project:</b> De
HTML
<br>
<b>Description</b>:<br
翻了一些博客,看到有博主是自己写了将
html
转为
text
的函数,但是由于项目时间比较紧,所以自己懒得动脑筋去写了,
这里推荐大家用一下nltk模块中clean_
html
()函数,用法如下:
import nltk
html
="""
<!DOCTYPE
html
>
<title>这是个标题</title>
</head>
$content = preg_replace("/<img\s*src=(\"|\')(.*?)\\1[^>]*>/is",'【图片】', $content);
$content = strip_tags($content);
$content = trim($content);
$content = ereg_rep
file_names = os.listdir(file_path)
i = 1
with open(os.path.join(file_path, file_names[i]), 'r', encoding='utf-8') as f:
txt = f.read()
得到结果如下:
'\ufeff<
html
&g
import urllib
r=urllib.urlopen(“http://www.w3school.com.cn/
html
5/index.asp“)//获取
html
代码
f=f.open(“\Users\Desktop\123.txt”,”w”)
f.write(f.read())//写入文件
f.close()