读取csv文件中的n个表到独立的pandas DataFrames上

1 人关注

我有一个单一的.csv文件,里面有四个表格,每个都是2001-1986年西南航空公司的不同财务报表。我知道我可以把每个表分开成单独的文件,但它们最初是作为一个文件下载的。

我想把每个表读到自己的pandas DataFrame中进行分析。这里是数据的一个子集。

Balance Sheet               
Report Date               12/31/2001    12/31/2000  12/31/1999  12/31/1998
Cash & cash equivalents   2279861       522995      418819      378511
Short-term investments    -             -           -            -
Accounts & other receivables    71283   138070      73448       88799
Inventories of parts...   70561          80564        65152     50035
Income Statement                
Report Date               12/31/2001    12/31/2000  12/31/1999  12/31/1998
Passenger revenues        5378702       5467965     4499360     3963781
Freight revenues          91270         110742      102990      98500
Charter & other           -              -           -           -
Special revenue adjustment  -            -           -           -
Statement of Retained Earnings              
Report Date              12/31/2001    12/31/2000   12/31/1999  12/31/1998
Previous ret earn...     2902007       2385854      2044975     1632115
Cumulative effect of..    -              -            -          -
Three-for-two stock split   117885  -   78076   -
Issuance of common..     52753           75952       45134       10184

每个表格都有17列,第一列是项目描述,但行数不同,例如,资产负债表是100行,而现金流量表是65行。

What I've Done

import pandas as pd
import numpy as np
# Lines that separate the various financial statements
lines_to_skip = [0, 102, 103, 158, 159, 169, 170]
with open('LUV.csv', 'r') as file:
    fin_statements = pd.read_csv(file, skiprows=lines_to_skip)
balance_sheet = fin_statements[0:100]

我看到有类似目的的帖子,指出要利用nrows和skiprows。我利用skiprows来读取整个文件,然后通过索引来创建各个财务报表。

我正在寻找评论和建设性的批评,以便以更好的Pythonic风格和最佳做法为每张表创建一个数据框架。

4 个评论
AMC
你已经知道每个表开始的行号了吗?
csv文件中的分隔符是什么? 你能把它作为原始文本发布吗?
你可以创建一个函数,把这个文件分割成不同的文件,然后你就可以正常读取它们。看来你可以用空行来识别表的结束。
@AMC 是的,我知道每个人开始的行号。这不是一个标准的格式,每次我下载不同公司的财务数据时都可能不同。
python
pandas
file
csv
dataframe
ExpertPomegranate
ExpertPomegranate
发布于 2020-01-13
2 个回答
Serge Ballesta
Serge Ballesta
发布于 2020-01-13
已采纳
0 人赞同

你想做的事情远远超出了 read_csv 所能做到的。如果你的输入文件结构可以被建模为。

REPEAT:
    Dataframe name
    Header line
    REPEAT:
        Data line
   BLANK LINE OR END OF FILE

IMHO,最简单的方法是解析这一行经手一行接一行,为每个数据框输入一个临时的csv文件,然后加载数据框。代码可以是。

df = {}        # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''                
    # print("Process", df_name, tmp.name)  # uncomment for debugging
    if tmp is not None:
        tmp.close()
        df[df_name] = pd.read_csv(tmp.name)
        os.remove(tmp.name)                # do not forget to remove the temp file
with open('LUV.csv') as file:
    df_name = "NONAME"                     # should never be in resulting dict...
    tmp = None
    for line in file:
        # print(line)                      # uncomment for debugging
        if len(line.strip()) == 0:         # close temp file on empty line
            process(tmp, df_name)          # and process it
            tmp = None
        elif tmp is None:                  # a new part: store the name
            df_name = line.strip()
            state = 1
            tmp = tempfile.NamedTemporaryFile("w", delete=False)
        else:
            tmp.write(line)                # just feed the temp file
    # process the last part if no empty line was present...
    process(tmp, df_name)

这其实并不高效,因为每一行都被写入一个临时文件,然后再被读取,但它是简单而强大的。

一个可能的改进是最初用csv模块解析部分(可以解析一个流,而pandas想要文件)。缺点是csv模块只能解析成字符串,而你失去了pandas自动转换为数字的功能。我的看法是,只有在文件很大,而且必须重复全部操作的情况下才值得这样做。

sammywemmy
sammywemmy
发布于 2020-01-13
0 人赞同

下面是我的解决方案。 我的假设是,每个报表都以一个指标开始('资产负债表'、'利润表'、'留存收益表'),我们可以根据这个来拆分表格,以获得单独的数据框架。这就是以下代码所基于的前提。让我知道这是否是一个有缺陷的假设。

import pandas as pd
import numpy as np
#i copied your data above and created a csv with it
df = pd.read_csv('csvtable_stackoverflow',header=None)
0   Balance Sheet
1   Report Date 12/31/2001 12/31/...
2   Cash & cash equivalents 2279861 522995...
3   Short-term investments - - ...
4   Accounts & other receivables 71283 138070...
5   Inventories of parts... 70561 80564...
6   Income Statement
7   Report Date 12/31/2001 12/31/...
8   Passenger revenues 5378702 546796...
9   Freight revenues 91270 110742...
10  Charter & other - - ...
11  Special revenue adjustment - - ...
12  Statement of Retained Earnings
13  Report Date 12/31/2001 12/31/2...
14  Previous ret earn... 2902007 2385854...
15  Cumulative effect of.. - - ...
16  Three-for-two stock split 117885 - 78076 -
17  Issuance of common.. 52753 75952...

下面的代码简单地使用numpy select来过滤出哪些行含有 资产负债表或利润表或现金流

https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html

bal_sheet = df[0].str.strip()=='Balance Sheet'
income_stmt = df[0].str.strip()=='Income Statement'
cash_flow_sheet = df[0].str.strip()=='Statement of Retained Earnings'
condlist = [bal_sheet, income_stmt, cash_flow_sheet]
choicelist = ['Balance Sheet', 'Income Statement', 'Statement of 
                                                   Retained Earnings']

下面的代码创建了一个表示工作表类型的列,将'0'转换为空,然后向下填充

df = (df.assign(sheet_type = np.select(condlist,choicelist))
      .assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
      .fillna(method='ffill')

最后一步是拉出各个数据帧

df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :     
         0                                            sheet_type
0   Balance Sheet                                    Balance Sheet
1   Report Date 12/31/2001 12/31/...                 Balance Sheet
2   Cash & cash equivalents 2279861 522995...        Balance Sheet
3   Short-term investments - - ...                   Balance Sheet
4   Accounts & other receivables 71283 138070...     Balance Sheet
5   Inventories of parts... 70561 80564...           Balance Sheet
df_income_sheet : 
           0                                     sheet_type
6   Income Statement                           Income Statement
7   Report Date 12/31/2001 12/31/...           Income Statement
8   Passenger revenues 5378702 546796...       Income Statement
9   Freight revenues 91270 110742...           Income Statement
10  Charter & other - - ...                    Income Statement
11  Special revenue adjustment - - ...         Income Statement
df_cash_flow:
              0                                         sheet_type
12  Statement of Retained Earnings              Statement of Retained Earnings
13  Report Date 12/31/2001 12/31/2...           Statement of Retained Earnings
14  Previous ret earn... 2902007 2385854...     Statement of Retained Earnings
15  Cumulative effect of.. - - ...              Statement of Retained Earnings