相关文章推荐
博学的板凳  ·  Java设置POI XSSFCell ...·  5 天前    · 
不羁的饺子  ·  Java POI (3)—— ...·  5 天前    · 
傲视众生的酱肘子  ·  unable to retrieve ...·  5 月前    · 
八块腹肌的小熊猫  ·  基于 MongoDB ...·  1 年前    · 
忐忑的薯片  ·  Python--元组tuple - 知乎·  1 年前    · 
任性的小马驹  ·  HTML DOM Input Text ...·  1 年前    · 

# 读取Excel 使用 pandas 读取二进制文件 excel

pd_df = pd.read_excel(io=file_path)

# 填充 NaN 将 nan 值用空字符串替换 df = rpa_pandas.fillna(df=pd_df)

df = pd_df.fillna('')

return df

Python读取HTML类型的Excel文件并返回DataFrame对象

这里我直接将重新读取Excel的操作封装成函数放在 一系列函数 中使用

def get_data_from_html_excel(file_path, encoding='utf-8'):

从 html 格式的 excel 中获取数据

:param file_path:

:return: DataFrame

dfs = pd.read_html(file_path, encoding=encoding)

pd_df = dfs[0]

df = pd_df.fillna('')

return df

Python读取XML类型的Excel文件

思路:利用Python将Excel文件读取到内存中,再利用正则表达式从XML标签中取出Excel表格中的数据

def get_data_from_xml_excel(file_path):

解决 pandas.read_excel() 读取 excel 解析错误问题.

读取 excel 文件,将每行数据都放入 list 列表中。

返回值有二个,第一个是 excel 表头的长度,第二个是数据 list 集合

with open(file_path, 'rb') as tree:

content = tree.read().decode('utf-8')

# <ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">单位</ss:Data></ss:Cell>

# <ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">单位</ss:Data><ss:NamedCell ss:Name="Print_Titles" />

pattern1 = r"""<ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">(.+?)</ss:Data><ss:NamedCell ss:Name="Print_Titles" /></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">(.+?)</ss:Data></ss:Cell>"""

head_iter = re.finditer(pattern1, content)

head_list = []

for tpl in head_iter:

for i in range(len(tpl.groups())):

ele = tpl.groups()[i]

if ele is not None:

head_list.append(ele)

break

# <ss:Cell ss:StyleID="odd"><ss:Data ss:Type="String"></ss:Data></ss:Cell>

pattern2 = r"""<ss:Cell ss:StyleID="even"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="odd"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="oddfloat"><ss:Data ss:Type="Number">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="oddfloat"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="evenfloat"><ss:Data ss:Type="Number">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="evenfloat"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>"""

data_iter = re.finditer(pattern2, content)

data_list = []

count = 0

for tpl in data_iter:

groups_len = len(tpl.groups())

# print(groups_len, tpl.groups(), tpl.span())

for i in range(groups_len):

ele = tpl.groups()[i]

# 排除不需要的脏数据

if ele is not None and ele != 'nan' and ele != 'null':

data_list.append(ele)

break

count += 1

# 若为空数据,或者 nan, null ,用空字符串替换

if count == groups_len:

data_list.append('')

# 每次内循环后都要清零

count = 0

return len(head_list), data_list

except Exception as e:

print('解析 excel 发生异常 {}'.format(e))

return 0, []

def get_data_from_xml_excel(file_path):

解决 pandas.read_excel() 读取 excel 解析错误问题.

读取 excel 文件,将每行数据都放入 list 列表中。

返回值有二个,第一个是 excel 表头的长度,第二个是数据 list 集合

with open(file_path, 'rb') as tree:

content = tree.read().decode('utf-8')

# <ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">单位</ss:Data></ss:Cell>

# <ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">单位</ss:Data><ss:NamedCell ss:Name="Print_Titles" />

pattern1 = r"""<ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">(.+?)</ss:Data><ss:NamedCell ss:Name="Print_Titles" /></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="headercell"><ss:Data ss:Type="String">(.+?)</ss:Data></ss:Cell>"""

head_iter = re.finditer(pattern1, content)

head_list = []

for tpl in head_iter:

for i in range(len(tpl.groups())):

ele = tpl.groups()[i]

if ele is not None:

head_list.append(ele)

break

# <ss:Cell ss:StyleID="odd"><ss:Data ss:Type="String"></ss:Data></ss:Cell>

pattern2 = r"""<ss:Cell ss:StyleID="even"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="odd"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="oddfloat"><ss:Data ss:Type="Number">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="oddfloat"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="evenfloat"><ss:Data ss:Type="Number">([\s\S\w\W]*?)</ss:Data></ss:Cell>""" \

r"""|<ss:Cell ss:StyleID="evenfloat"><ss:Data ss:Type="String">([\s\S\w\W]*?)</ss:Data></ss:Cell>"""

data_iter = re.finditer(pattern2, content)

data_list = []

count = 0

for tpl in data_iter:

groups_len = len(tpl.groups())

# print(groups_len, tpl.groups(), tpl.span())

for i in range(groups_len):

ele = tpl.groups()[i]

# 排除不需要的脏数据

if ele is not None and ele != 'nan' and ele != 'null':

data_list.append(ele)

break

count += 1

# 若为空数据,或者 nan, null ,用空字符串替换

if count == groups_len:

data_list.append('')

# 每次内循环后都要清零

count = 0

return len(head_list), data_list

except Exception as e:

print('解析 excel 发生异常 {}'.format(e))

return 0, [] 返回搜狐,查看更多

责任编辑:

声明:该文观点仅代表作者本人,搜狐号系信息发布平台,搜狐仅提供信息存储空间服务。