在Python中读取非结构化的文本文件，使其成为结构化的文件

1 人关注

File 我有这个所附的文本文件，其中包含非结构化的数据，之前有一些信息行。我怎样才能使这些数据结构化（以结构化方式提取信息）。所以最后我有几列（在这种情况下是5列），并在其中有相应的信息。第50帧包含10个值，第51帧包含10个值，以此类推，还可以分别得到前4行的值。我尝试了一下，得出了以下代码。但这并不是我得到的最好的列表/数组。

frame =[]
frame1 =[]
flag = -1
counter = -1
counter_val = 0
f = open(filepath, "r")
for line in f:
    element = line.split(' ')
    if(len(element) == 4):
        if(element[1] == "Frame_Number") :
            # print(element[1])
            if(flag == 0):
                # print(len(frame1))
                frame.append(frame1)
            flag = 0
            counter = counter + 1
            counter_val = 0
            frame1 =[]
        continue
    if(flag == 0):   
        frame1.append(line)
        counter_val = counter_val + 1
print(frame[1])


         5
         
         个评论


           
            我想最好是包括输入文件的片段和你想作为输出的相应数据结构（如dict、数组、类）。


           
            @sardok 让我们说说它的csv，其中第一列包含Frame_Number# 50的值和标题，等等。


           
            会不会是六列而不是五列，就像在。
            
             Values,Samples_per_Frame, Chirp_Time_sec,Pulse_Repetition_Time_sec,Frame_Period_sec, Frame_Number
            
            ？


           
            @DarrylG 不，目前我不需要上面的值，只需要Frame_Number下面的值。所以列的数量取决于我将有的帧数。


           
            @AR.--那么这五列里有什么，即'所以最后我有几列（在这里是5列）'？  Frame_Number将是一列，那么其他四列是什么？  数据是10个元素，所以要么是10（每个值都有一列），要么是1（将所有数据放在同一列）。


         python


         python-3.x


        2
        
        个回答


          
           
           
            Chami Sangeeth Amarasinghe
           
          
          
           发布于
           
           2020-05-05


          已采纳


         0
         
         人赞同


          
           这里有一个大熊猫的解决方案。
          
          import pandas as pd
# Read in the data as a Pandas Series
df = pd.read_csv('testsd.txt', sep = '\n', header = None, squeeze = True) 
# Get the names of the eventual column names ('# Frame_Number 50', ...)
colNames = df.loc[df.str.startswith('# Frame_Number')]
# Store the first few lines of metadata in another frame and drop them from the original dataframe
meta_df = df[: colNames.index.to_list()[0]]]
df.drop(range(colNames.index.to_list()[0]), inplace = True)
# Drop the eventual column names
df.drop(colNames.index.to_list(), inplace = True)
原始数据框架中剩下的应该只是数据。现在重新塑造数据框架。注意，这只在每一列都有相同数量的条目时才有效。
df = pd.DataFrame(df.values.reshape(len(colNames), int(len(df) / len(colNames))).T, columns = colNames)
reshape函数的参数是所需的行数和列数。它是水平重塑的，所以我们将对结果进行转置。最后，如果你想的话，添加我们保存的元数据，作为数据框架的一列，尽管你真的应该把它保存在其他地方的文件中。
df['meta'] = meta_df
将数据帧写入文件。
df.to_csv('testsd.csv')
Output:


           
            
             
              
               
                Chami Sangeeth Amarasinghe
               
               ：


           
            
             
              
               
                OP，请注意编辑，其中使用了重塑功能的转置。


          
           
            
             
              
               Try the following
              
              import csv
def convert_csv(filenm):
  " Produces structured data by converting to CSV file "
  with open(filenm, 'r') as fin,  open('out.txt', 'w') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    frames = []
    frame_vals = []
    for line in fin:
      line = line.rstrip()
      if line:
        if line[0] == "#":
          field, value = line[1:].split('=')
          field, value = field.strip(), value.strip()
          if field == 'Frame_Number':
            frames.append(value)    # current frame number
            frame_vals.append([])   # new sublist for frame values
        else:
          frame_vals[-1].append(line.strip())  # append to current frame values
    # Write header
    fnames = ['Frame_' + str(v) for v in frames]
    csv_writer.writerow(fnames)
    # write other data
    for row in zip(*frame_vals):  # transposing to get each frame in a column
      csv_writer.writerow(row)
convert_csv('testd.txt')
Input: testd.txt
# Samples_per_Frame = 8192
# Chirp_Time_sec = 0.000133
# Pulse_Repetition_Time_sec = 0.00050355
# Frame_Period_sec = 0.2
# Frame_Number = 50
0.50061053
0.49938953
0.49426132
0.48962152
0.48791212
0.48937732
0.49523813
0.49914533
0.50158733
0.49914533
# Frame_Number = 51
0.50061053
0.49938953
0.49426132
0.48962152
0.48791212
0.48937732
0.49523813
0.49914533
0.50158733
0.49914533
# Frame_Number = 52
0.50793654
0.50647134
0.49841273
0.48937732
0.48644692
0.49035412
0.49768013
0.50647134
0.51282054
0.50940174
# Frame_Number = 53
0.49670333
0.49181932
0.4840049
0.48547012
0.48791212
0.49230772
0.49768013
0.49816853
0.49181932
0.48595852
# Frame_Number = 54
0.49352872
0.49597073
0.49987793
0.50354093
0.50402933
0.50036633
0.49841273
0.49743593
0.49865693
0.50012213
Output: out.txt
Frame_50 Frame_51 Frame_52 Frame_53 Frame_54
0.50061053 0.50061053 0.50793654 0.49670333 0.49352872
0.49938953 0.49938953 0.50647134 0.49181932 0.49597073
0.49426132 0.49426132 0.49841273 0.4840049 0.49987793
0.48962152 0.48962152 0.48937732 0.48547012 0.50354093
0.48791212 0.48791212 0.48644692 0.48791212 0.50402933
0.48937732 0.48937732 0.49035412 0.49230772 0.50036633