log或者txt文本处理，关键数据提取并存到csv文件

相关文章推荐

犯傻的铅笔 · Spring WebClient使用 - ...· 1 月前 ·

大力的熊猫 · log或者txt文本处理，关键数据提取并存到 ...· 4 周前 ·

飘逸的饭卡 · dataframe split ...· 8 月前 ·

痴情的高山 · Gson获得json数据数组内嵌问题_问答- ...· 1 年前 ·

聪明伶俐的松鼠 · 查询字段类型为numeric时，where条 ...· 1 年前 ·

活泼的双杠 · python中读取文件怎么将原来数据转化为整 ...· 1 年前 ·

小眼睛的跑步机 · 不用写代码，WPS ...· 2 年前 ·

1、有大量的log或者txt文本
2、需要把某些字段提取到csv文件
3、某些字段，要根据改行的某个特征进行聚合（例子：把所有文本中每一行，如果里面包含info：fx43，把它们放到同一个csv文件，并以info命名）
解决方案：
1：根据要提取字段的信息，从原始文档里过滤出需要的信息，保存到新的txt文档里
2：二次过滤，把数据保存为json格式
3：读取json，并把字段提取出来，保存到csv文件

数据格式：

2023-04-01 00:01:02.456  INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result":
[{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"},
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"},
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}
2023-04-01 00:01:02.456  INFO : "{"__AC":true,"error":null,"result":[]}
2023-04-01 00:01:02.456  INFO rxvdf:null
目的：提取出{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}信息，如果info一样，就放到相同的csv文件

import os,re
folder=""  #data address
stage_res1=""  # save result address
for i in range(len(os.listdir(floder))):
	file=os.path.join(folder,os.listdir(floder)[i])
	pattern='"__AC":true,"error":null,"result"'   #匹配信息
	with open(file,"r") as read_file,open(stage_res1,"a+") as writer_file:
		count=0
		for line in read_file:
			if pattern in line:
				writer_file.write(line)
	print("count :",count)
这一步就是把需要的信息都给写到新文件里了。

信息：2023-04-01 00:01:02.456  INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result": [{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}
import re,os
data=[]
folder=""  #data address
stage_res1=""  # save result address
with open(file,"r") as read_file,open(stage_res1,"w") as writer_file:
	for line in read_file:
		pattern=re.search(r'\[(.*?)\]',line)  #找到数组里所有的数据
		if pattern:
			data.append(pattern.group(1)) #匹配出现的数组里所有的数据,并加入数组
	writer_file.write(str(data))
结果：['{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}']
def split_data(filename,output_name):
	with open(filename,"r") as file:
		lines=file.readlines()
	separted_data=[]
	for line in lines:
		line=line.strip() # 去除字符串两端的空白字符(例如空格、制表符、换行符)
		line=line[3:-3]   # 去除前面3个字符['{，和后面3个字符 }']
		data=line.split("},{") # 根据},{把json独立分开
		separted_data.extend(data)   #把值追加到新的数组
	with open(output_name,"w") as file:
		for item in separted_data:
			file.write("{"+item+"}"+'\n')   # 读取每一行数组，给它加上{}，变成json格式
filename=""
output_name=""
split_data(filename,output_name)
执行后结果：
{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}',  '{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
上面出现的结果还存在问题，如  ',  ’    我们目的是让每一行只有一个json，所以继续处理数据。
def split_data(filename,output_name):
	with open(filename,"r") as file:
		lines=file.readlines()
	separted_data=[]
	for line in lines:
		data=line.replace("', '",'\n')
		separted_data.extend(data)   #把值追加到新的数组
	with open(output_name,"w") as file:
		for item in separted_data:
			file.write("{"+item+"}"+'\n')   # 读取每一行数组，给它加上{}，变成json格式
filename=""
output_name=""
split_data(filename,output_name)
执行结果：
{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
import json,csv
def txt2csv(filename):
	with open(filename,"r") as file:
		lines=file.readlines()
	for line in lines:
		data=json.loads(line)
		reator=data.get("reator","")
		dsp=data.get("dsp","")
		name=data.get("name","")
		info=data.get("info","")
		if info:
			csv_file=f"{info}.csv"    #建立和info同名的csv文件
			with open(filename,"a",newline="") as file:
				writer=csv.writer(file)
				writer.writerow([reator,dsp,name,info])
filename=""
txt2csv(filename)
这个是把文本里的json数据解析，并放到csv文件，顺序和代码里的字段顺序一致。但是表头信息没有加上。
添加表头信息
import os
folder=""
for root,dirs,files in os.walk(folder):
	for file in files:
		file_path=os.path.join(root,file)
		data='reator,dsp,name,info'
		with open(filename,"a+") as file:
			content=fi.read()
			fi.seek(0,0)
			fi.write(data+"\n"+content)
 python syslog-to-csv.py /var/log/syslog.1
或者如果您需要速度：
 pypy syslog-to-csv.py /var/log/syslog.1
现在，您在本地目录中有一个syslog.csv文件
使用[csvkit] 处理csv
可视化系统日志
有了csv之后，您就可以使用各种工具（例如Pandas，朋友，甚至Excel）来解释您的syslog数据：
 Usage1: log2csv -i gc.log -o gc.csv
Usage2: GODEBUG=gctrace=1 your-go-program 2>&1 | log2csv -o gc.csv
  -i="stdin": The input file
  -o="stdout": The output file
  -t=true: Add timestamp at line head (the input file must be `stdin`)
 go get github.com/parkghost/log2csv/cmd/log2csv
 $ GODEBUG=gctrace=1 godoc -http=:6060  2>&1 | log2csv -o gc.csv
为一名刚入行不久的网站优化人员，天天也就能看看网站日志啥的，log文件看起来太费劲了，就自己写了一个转csv的程序，毕竟作为曾经“没有对话甩锅程序员网站做的不好，没有成交甩锅业务人员客户跟的不好”的一名专业甩锅竞价，csv才是我的最爱，看着也方便。
很简单的程序，就是这边截一下，那边截一下。
使用方式就是：用pycharm（Visual Studio Code也行）创建一个.py文件，然后把网站日志改成：网站日志.log  放到和.py文件同一文件夹下，运行就行了，就能生成一个csv文件