相关文章推荐
虚心的豌豆  ·  TypeScript ...·  2 周前    · 
爱运动的太阳  ·  TiDB 中的 TimeStamp ...·  2 周前    · 
想发财的镜子  ·  微信支付 JSAPI ...·  6 天前    · 
温文尔雅的豆腐  ·  vue ...·  3 月前    · 

为什么pandas.read_json,会修改长整数的值?

2 人关注

我不知道,为什么《中国青年报》的原创内容 id_1 & id_2 当我打印它时,会发生变化。

我有一个json文件,名为 test_data.json

"objects":{ "value":{ "1298543947669573634":{ "timestamp":"Wed Aug 26 08:52:57 +0000 2020", "id_1":"1298543947669573634", "id_2":"1298519559306190850"

Output

python test_data.py 
                  id_1                 id_2                 timestamp
0  1298543947669573632  1298519559306190848 2020-08-26 08:52:57+00:00

我的代码名为test_data.py is

import pandas as pd
import json
file = "test_data.json"
with open (file, "r")  as f:
    all_data = json.loads(f.read()) 
data = pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
data = data.reset_index(drop=True)
print(data.head())

我怎样才能解决这个问题,使数字值得到正确解释?

1 个评论
Using python 3.7.4 & pandas 0.25.1
python
pandas
json-normalize
Abhisek Chowdhury
Abhisek Chowdhury
发布于 2020-09-06
2 个回答
Trenton McKinney
Trenton McKinney
发布于 2021-05-11
已采纳
0 人赞同
  • Using python 3.8.5 and pandas 1.1.1
  • Current Implementation

  • First, the code reads the file in and converts it from a str type to a dict , with json.loads
  • with open (file, "r")  as f:
        all_data = json.loads(f.read()) 
    
  • Then 'value' is converted back to a str
  • json.dumps(all_data['objects']['value'])
    
  • Using orient='index' sets the keys as columns headers and the values are in the rows.
  • The data is also converted to an int at this point, and the value changes.
  • I'm guessing that there's some floating point conversion issue in this step
  • pandas issues: read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608
  • # use .from_dict data = pd.DataFrame.from_dict(all_data['objects']['value'], orient='index') # convert columns to numeric data[['id_1', 'id_2']] = data[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce') data = data.reset_index(drop=True) # display(data) timestamp id_1 id_2 0 Wed Aug 26 08:52:57 +0000 2020 1298543947669573634 1298519559306190850 print(data.info()) [out]: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 1 non-null object 1 id_1 1 non-null int64 2 id_2 1 non-null int64 dtypes: int64(2), object(1) memory usage: 152.0+ bytes

    Option 2

  • Use pandas.json_normalize and then convert columns to numeric.
  • file = "test_data.json"
    with open (file, "r")  as f:
        all_data = json.loads(f.read()) 
    # read all_data into a dataframe
    df = pd.json_normalize(all_data['objects']['value'])
    # rename the columns
    df.columns = [x.split('.')[1] for x in df.columns]
    # convert to numeric
    df[['id_1', 'id_2']] = df[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
    # display(df)
                            timestamp                 id_1                 id_2
    0  Wed Aug 26 08:52:57 +0000 2020  1298543947669573634  1298519559306190850
    print(df.info()
    [out]:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 3 columns):
     #   Column     Non-Null Count  Dtype 
    ---  ------     --------------  ----- 
     0   timestamp  1 non-null      object
     1   id_1       1 non-null      int64 
     2   id_2       1 non-null      int64 
    dtypes: int64(2), object(1)
    memory usage: 152.0+ bytes
        
    谢谢你,伙计。选项1使用from_dict工作得很好。即使是.apply(pd.to_numeric, errors='coerce')对我的情况来说也是可有可无。但我也使用了它。 你救了我!
    Eli_B
    Eli_B
    发布于 2021-05-11
    0 人赞同

    这是由以下原因造成的 issue 20608 并且在当前1.2.4版本的Pandas中仍然发生。

    这是我的解决方法,在我的数据上甚至比 read_json 稍快。

    def broken_load_json(path):
        """There's an open issue: https://github.com/pandas-dev/pandas/issues/20608
        about read_csv loading large integers incorrectly because it's converting
        from string to float to int, losing precision."""
        df = pd.read_json(pathlib.Path(path), orient='index')
        return df
    def orjson_load_json(path):
        import orjson  # The builting json module would also work
        with open(path) as f:
            d = orjson.loads(f.read())
        df = pd.DataFrame.from_dict(d, orient='index')  # Builds the index from the dict's keys as strings, sadly
        # Fix the dtype of the index
        df = df.reset_index()