1、有大量的log或者txt文本
2、需要把某些字段提取到csv文件
3、某些字段,要根据改行的某个特征进行聚合(例子:把所有文本中每一行,如果里面包含info:fx43,把它们放到同一个csv文件,并以info命名)
解决方案:
1:根据要提取字段的信息,从原始文档里过滤出需要的信息,保存到新的txt文档里
2:二次过滤,把数据保存为json格式
3:读取json,并把字段提取出来,保存到csv文件

数据格式:

2023-04-01 00:01:02.456  INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result":
[{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"},
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"},
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}
2023-04-01 00:01:02.456  INFO : "{"__AC":true,"error":null,"result":[]}
2023-04-01 00:01:02.456  INFO rxvdf:null

目的:提取出{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}信息,如果info一样,就放到相同的csv文件

import os,re
folder=""  #data address
stage_res1=""  # save result address
for i in range(len(os.listdir(floder))):
	file=os.path.join(folder,os.listdir(floder)[i])
	pattern='"__AC":true,"error":null,"result"'   #匹配信息
	with open(file,"r") as read_file,open(stage_res1,"a+") as writer_file:
		count=0
		for line in read_file:
			if pattern in line:
				writer_file.write(line)
	print("count :",count)

这一步就是把需要的信息都给写到新文件里了。
信息:2023-04-01 00:01:02.456 INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result": [{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}

import re,os
data=[]
folder=""  #data address
stage_res1=""  # save result address
with open(file,"r") as read_file,open(stage_res1,"w") as writer_file:
	for line in read_file:
		pattern=re.search(r'\[(.*?)\]',line)  #找到数组里所有的数据
		if pattern:
			data.append(pattern.group(1)) #匹配出现的数组里所有的数据,并加入数组
	writer_file.write(str(data))

结果:['{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}']

def split_data(filename,output_name):
	with open(filename,"r") as file:
		lines=file.readlines()
	separted_data=[]
	for line in lines:
		line=line.strip() # 去除字符串两端的空白字符(例如空格、制表符、换行符)
		line=line[3:-3]   # 去除前面3个字符['{,和后面3个字符 }']
		data=line.split("},{") # 根据},{把json独立分开
		separted_data.extend(data)   #把值追加到新的数组
	with open(output_name,"w") as file:
		for item in separted_data:
			file.write("{"+item+"}"+'\n')   # 读取每一行数组,给它加上{},变成json格式
filename=""
output_name=""
split_data(filename,output_name)

执行后结果:

{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}',  '{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
上面出现的结果还存在问题,如  ',  ’    我们目的是让每一行只有一个json,所以继续处理数据。

def split_data(filename,output_name):
	with open(filename,"r") as file:
		lines=file.readlines()
	separted_data=[]
	for line in lines:
		data=line.replace("', '",'\n')
		separted_data.extend(data)   #把值追加到新的数组
	with open(output_name,"w") as file:
		for item in separted_data:
			file.write("{"+item+"}"+'\n')   # 读取每一行数组,给它加上{},变成json格式
filename=""
output_name=""
split_data(filename,output_name)

执行结果:

{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
import json,csv
def txt2csv(filename):
	with open(filename,"r") as file:
		lines=file.readlines()
	for line in lines:
		data=json.loads(line)
		reator=data.get("reator","")
		dsp=data.get("dsp","")
		name=data.get("name","")
		info=data.get("info","")
		if info:
			csv_file=f"{info}.csv"    #建立和info同名的csv文件
			with open(filename,"a",newline="") as file:
				writer=csv.writer(file)
				writer.writerow([reator,dsp,name,info])
filename=""
txt2csv(filename)

这个是把文本里的json数据解析,并放到csv文件,顺序和代码里的字段顺序一致。但是表头信息没有加上。

添加表头信息

import os
folder=""
for root,dirs,files in os.walk(folder):
	for file in files:
		file_path=os.path.join(root,file)
		data='reator,dsp,name,info'
		with open(filename,"a+") as file:
			content=fi.read()
			fi.seek(0,0)
			fi.write(data+"\n"+content)
 python syslog-to-csv.py /var/log/syslog.1
或者如果您需要速度:
 pypy syslog-to-csv.py /var/log/syslog.1
现在,您在本地目录中有一个syslog.csv文件
使用[csvkit] 处理csv
可视化系统日志
有了csv之后,您就可以使用各种工具(例如Pandas,朋友,甚至Excel)来解释您的syslog数据:
 Usage1: log2csv -i gc.log -o gc.csv
Usage2: GODEBUG=gctrace=1 your-go-program 2>&1 | log2csv -o gc.csv
  -i="stdin": The input file
  -o="stdout": The output file
  -t=true: Add timestamp at line head (the input file must be `stdin`)
 go get github.com/parkghost/log2csv/cmd/log2csv
 $ GODEBUG=gctrace=1 godoc -http=:6060  2>&1 | log2csv -o gc.csv
为一名刚入行不久的网站优化人员,天天也就能看看网站日志啥的,log文件看起来太费劲了,就自己写了一个csv的程序,毕竟作为曾经“没有对话甩锅程序员网站做的不好,没有成交甩锅业务人员客户跟的不好”的一名专业甩锅竞价,csv才是我的最爱,看着也方便。
很简单的程序,就是这边截一下,那边截一下。
使用方式就是:用pycharm(Visual Studio Code也行)创建一个.py文件,然后把网站日志改成:网站日志.log  放到和.py文件同一文件夹下,运行就行了,就能生成一个csv文件