1、有大量的log或者txt文本
2、需要把某些字段提取到csv文件
3、某些字段,要根据改行的某个特征进行聚合(例子:把所有文本中每一行,如果里面包含info:fx43,把它们放到同一个csv文件,并以info命名)
解决方案:
1:根据要提取字段的信息,从原始文档里过滤出需要的信息,保存到新的txt文档里
2:二次过滤,把数据保存为json格式
3:读取json,并把字段提取出来,保存到csv文件
数据格式:
2023-04-01 00:01:02.456 INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result":
[{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"},
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"},
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}
2023-04-01 00:01:02.456 INFO : "{"__AC":true,"error":null,"result":[]}
2023-04-01 00:01:02.456 INFO rxvdf:null
目的:提取出{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}信息,如果info一样,就放到相同的csv文件
import os,re
folder=""
stage_res1=""
for i in range(len(os.listdir(floder))):
file=os.path.join(folder,os.listdir(floder)[i])
pattern='"__AC":true,"error":null,"result"'
with open(file,"r") as read_file,open(stage_res1,"a+") as writer_file:
count=0
for line in read_file:
if pattern in line:
writer_file.write(line)
print("count :",count)
这一步就是把需要的信息都给写到新文件里了。
信息:2023-04-01 00:01:02.456 INFO RESULT RESPONSE: "{"__AC":true,"error":null,"result": [{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"},"success":true,"ttxx":null]}
import re,os
data=[]
folder=""
stage_res1=""
with open(file,"r") as read_file,open(stage_res1,"w") as writer_file:
for line in read_file:
pattern=re.search(r'\[(.*?)\]',line)
if pattern:
data.append(pattern.group(1))
writer_file.write(str(data))
结果:['{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}, {"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}, {"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}']
def split_data(filename,output_name):
with open(filename,"r") as file:
lines=file.readlines()
separted_data=[]
for line in lines:
line=line.strip()
line=line[3:-3]
data=line.split("},{")
separted_data.extend(data)
with open(output_name,"w") as file:
for item in separted_data:
file.write("{"+item+"}"+'\n')
filename=""
output_name=""
split_data(filename,output_name)
执行后结果:
{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}', '{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
上面出现的结果还存在问题,如 ', ’ 我们目的是让每一行只有一个json,所以继续处理数据。
def split_data(filename,output_name):
with open(filename,"r") as file:
lines=file.readlines()
separted_data=[]
for line in lines:
data=line.replace("', '",'\n')
separted_data.extend(data)
with open(output_name,"w") as file:
for item in separted_data:
file.write("{"+item+"}"+'\n')
filename=""
output_name=""
split_data(filename,output_name)
执行结果:
{"reator":"fx01","dsp":02,"name":"fdxfgvbde","info":"fxswerf211"}
{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
{"reator":"fwe1","dsp":0312,"name":"fd9090de","info":"fx1"}{"reator":"fx0234","dsp":032,"name":"fd234bde","info":"fxdffverf211"}
import json,csv
def txt2csv(filename):
with open(filename,"r") as file:
lines=file.readlines()
for line in lines:
data=json.loads(line)
reator=data.get("reator","")
dsp=data.get("dsp","")
name=data.get("name","")
info=data.get("info","")
if info:
csv_file=f"{info}.csv"
with open(filename,"a",newline="") as file:
writer=csv.writer(file)
writer.writerow([reator,dsp,name,info])
filename=""
txt2csv(filename)
这个是把文本里的json数据解析,并放到csv文件,顺序和代码里的字段顺序一致。但是表头信息没有加上。
添加表头信息
import os
folder=""
for root,dirs,files in os.walk(folder):
for file in files:
file_path=os.path.join(root,file)
data='reator,dsp,name,info'
with open(filename,"a+") as file:
content=fi.read()
fi.seek(0,0)
fi.write(data+"\n"+content)
python syslog-to-csv.py /var/log/syslog.1
或者如果您需要速度:
pypy syslog-to-csv.py /var/log/syslog.1
现在,您在本地目录中有一个syslog.csv文件
使用[csvkit] 处理csv
可视化系统日志
有了csv之后,您就可以使用各种工具(例如Pandas,朋友,甚至Excel)来解释您的syslog数据:
Usage1: log2csv -i gc.log -o gc.csv
Usage2: GODEBUG=gctrace=1 your-go-program 2>&1 | log2csv -o gc.csv
-i="stdin": The input file
-o="stdout": The output file
-t=true: Add timestamp at line head (the input file must be `stdin`)
go get github.com/parkghost/log2csv/cmd/log2csv
$ GODEBUG=gctrace=1 godoc -http=:6060 2>&1 | log2csv -o gc.csv
为一名刚入行不久的网站优化人员,天天也就能看看网站日志啥的,log文件看起来太费劲了,就自己写了一个转csv的程序,毕竟作为曾经“没有对话甩锅程序员网站做的不好,没有成交甩锅业务人员客户跟的不好”的一名专业甩锅竞价,csv才是我的最爱,看着也方便。
很简单的程序,就是这边截一下,那边截一下。
使用方式就是:用pycharm(Visual Studio Code也行)创建一个.py文件,然后把网站日志改成:网站日志.log 放到和.py文件同一文件夹下,运行就行了,就能生成一个csv文件