Python pandas read_json错误地将大整数读取为字符串

Python pandas read_json错误地将大整数读取为字符串,python,json,python-3.x,pandas,Python,Json,Python 3.x,Pandas,我正在尝试读取存储为json文件的tweet。我用熊猫来加载数据。但是在read\u json函数中发现了一些奇怪的行为。我将提供以下信息: 在我的电脑上输出以下内容: <class 'pandas.core.frame.DataFrame'> Index: 4 entries, 1 to 4 Data columns (total 1 columns): tid 4 non-null int64 dtypes: int64(1) memory usage: 64.0+ byt

我正在尝试读取存储为json文件的tweet。我用熊猫来加载数据。但是在
read\u json
函数中发现了一些奇怪的行为。我将提供以下信息:

在我的电脑上输出以下内容:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002
更新

它在明确指定参数时读取正确
dtype=int
,但我不明白为什么。当我们指定时会发生什么变化 数据类型


可以显式指定数据类型:

In [32]: df=pd.read_json(json_content,
    ...:                 orient='index', # read as transposed
    ...:                 convert_axes=False, # don't convert keys to dates
    ...:                 dtype='int64'   # <------- NOTE
    ...:         )
    ...: print(df.info())
    ...: print(df)
    ...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

因此,它看起来与类型推断有关,因为默认情况下,
dtype=True
,这意味着:
如果为True,则推断dtypes

感谢您提供的信息。我也在寻找对这种行为的解释。我碰巧发现,在您发布之前明确提到dtype是有效的。这是一个解决办法,但它没有回答我的问题。(你刚才得到的反对票不是我的)@UdayrajDeshmukh,它与类型推断有关-如果你传递整数而不是字符串(例如:
“tid”:100000000000002而不是
“tid”:“100000000000002”
),它会正常工作。PS default
dtype True
-
“如果为True,则推断dtypes”
实际上我是从大约100个json文件(twitter数据库的一个示例)中读取的,这些文件中已经有tid列作为strings@UdayrajDeshmukh,是的,我正试图找到一个原因-这似乎是一个“类型推断”提出了同样的问题:
import sys
# original problem 
tid_0 = 956677215197970432 
print(sys.maxsize,tid_0,sys.maxsize/tid_0)    # < 1 if overflow possible
# minimal case
tid = 10000000000000001 
print(sys.maxsize,tid,sys.maxsize/tid)    # < 1 if overflow possible

#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922
In [32]: df=pd.read_json(json_content,
    ...:                 orient='index', # read as transposed
    ...:                 convert_axes=False, # don't convert keys to dates
    ...:                 dtype='int64'   # <------- NOTE
    ...:         )
    ...: print(df.info())
    ...: print(df)
    ...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002
In [61]: %paste
json_content="""
{
    "1": {
        "tid": 9999999999999998,
    },
    "2": {
        "tid": 9999999999999999,
    },
    "3": {
        "tid": 10000000000000001,
    },
    "4": {
        "tid": 10000000000000002,
    }
}
"""

df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.dtypes)
print(df)

## -- End pasted text --
tid    int64
dtype: object
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002