|
|
爱看书的核桃 · python ...· 8 月前 · |
|
|
要出家的米饭 · 在 Azure Databricks ...· 10 月前 · |
|
|
含蓄的炒饭 · PostgreSQL Oracle ...· 2 年前 · |
|
|
飘逸的斑马 · Pandas数据合并与拼接的5种方法-腾讯云 ...· 2 年前 · |
|
|
朝气蓬勃的火腿肠 · MiniGPT-4 and LLaMA ...· 2 年前 · |
我尝试使用下面的代码将html文件数据转换为json。
import html_to_json
import json
def htmltojson():
with open("C:\Extraction\Sample.html", "r") as html_file:
html = html_file.read()
output_json = html_to_json.convert(html,capture_element_attributes=False,capture_element_values=True)
with open('Final.json', 'w') as outfile:
json.dump(output_json, outfile,indent=4)
print(output_json)
我得到的json包含html、span和其他标记,我只想要键和它的值。
Json输出我得到了
{
"html": [
"head": [
"meta": [
"link": [
"title": [
"_value": "252"
"_values": [
"[if gte mso 9]><xml>\n <o:DocumentProperties>\n <o:Author>Sharon Kaufmann</o:Author>\n <o:Template>Normal</o:Template>\n <o:LastAuthor>Aman Pawar</o:LastAuthor>\n <o:Revision>2</o:Revision>\n <o:TotalTime>339</o:TotalTime>\n <o:LastPrinted>2019-11-07T16:41:00Z</o:LastPrinted>\n <o:Created>2022-09-21T22:16:00Z</o:Created>\n <o:LastSaved>2022-09-21T22:16:00Z</o:LastSaved>\n <o:Pages>1</o:Pages>\n <o:Words>1756</o:Words>\n <o:Characters>10014</o:Characters>\n <o:Company>AMS Inc</o:Company>\n <o:Lines>83</o:Lines>\n <o:Paragraphs>23</o:Paragraphs>\n <o:CharactersWithSpaces>11747</o:CharactersWithSpaces>\n <o:Version>16.00</o:Version>\n </o:DocumentProperties>\n <o:CustomDocumentProperties>\n <o:_NewReviewCycle dt:dt=\"string\"></o:_NewReviewCycle>\n </o:CustomDocumentProperties>\n <o:OfficeDocumentSettings>\n <o:RelyOnVML/>\n <o:AllowPNG/>\n </o:OfficeDocumentSettings>\n</xml><![endif]",
"[if gte mso 9]><xml>\n <w:WordDocument>\n <w:DocumentProtectionNotEnforced>ReadOnly</w:DocumentProtectionNotEnforced>\n <w:TrackMoves/>\n <w:TrackFormatting/>\n <w:DoNotHyphenateCaps/>\n <w:PunctuationKerning/>\n <w:DrawingGridHorizontalSpacing>5 pt</w:DrawingGridHorizontalSpacing>\n <w:DrawingGridVerticalSpacing>6 pt</w:DrawingGridVerticalSpacing>\n <w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>\n <w:DisplayVerticalDrawingGridEvery>3</w:DisplayVerticalDrawingGridEvery>\n <w:ValidateAgainstSchemas/>\n <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>\n <w:IgnoreMixedContent>false</w:IgnoreMixedContent>\n <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>\n <w:DoNotPromoteQF/>\n <w:LidThemeOther>EN-US</w:LidThemeOther>\n <w:LidThemeAsian>X-NONE</w:LidThemeAsian>\n <w:LidThemeComplexScript>AR-SA</w:LidThemeComplexScript>\n <w:Compatibility>\n <w:BreakWrappedTables/>\n <w:SnapToGridInCell/>\n <w:WrapTextWithPunct/>\n <w:UseAsianBreakRules/>\n <w:DontGrowAutofit/>\n <w:SplitPgBreakAndParaMark/>\n <w:EnableOpenTypeKerning/>\n <w:DontFlipMirrorIndents/>\n <w:OverrideTableStyleHps/>\n </w:Compatibility>\n <m:mathPr>\n <m:mathFont m:val=\"Cambria Math\"/>\n <m:brkBin m:val=\"before\"/>\n <m:brkBinSub m:val=\"--\"/>\n <m:smallFrac m:val=\"off\"/>\n <m:dispDef/>\n <m:lMargin m:val=\"0\"/>\n <m:rMargin m:val=\"0\"/>\n <m:defJc m:val=\"centerGroup\"/>\n <m:wrapIndent m:val=\"1440\"/>\n <m:intLim m:val=\"subSup\"/>\n <m:naryLim m:val=\"undOvr\"/>\n </m:mathPr></w:WordDocument>\n</xml><![endif]",],
"body": [
"div": [
"p": [
"a": [
"span": [
"span": [
"span": [
"_value": "Performance Work Statement"
"span": [
"span": [
"span": [
"span": [
"_value": "UNITED STATES NAVAL ACADEMY (USNA)"
},
预期的输出是某种形式。
示例预期格式
[{“键”:"1“、”值“:”子“:”[] }、{“键”:"2“、”值“:”子“:[{”键“:"2.1”、“值”:“子”:[] }、{“键”:"2.2“、”值“:”子“:”子“:[ }{“键”:"3",“值”:“子”:[{“键”:"2.1",“值”:“子”:[{“键”:"2.1.1",“值”:"“子”:[ } ]}}]
发布于 2022-09-21 23:22:21
你试过这样的东西吗?只需查找要查找的元素? https://www.w3schools.com/python/gloss_python_json_parse.asp 即可。
蟒蛇的文档可能也有帮助..。 https://docs.python.org/3/library/json.html
请问您为什么要将HTML编码为JSON?
发布于 2022-10-04 13:28:11
--好吧,如果有人想要一个解决方案,我用下面的逻辑 来解决它
from html_to_draftjs import html_to_draftjs
import bleach,json
from bleach.css_sanitizer import CSSSanitizer
def htmltodraftjson():
with open("WorkStatement.html", "r") as html_file:
html = html_file.read()