Python pandas read_json错误地将大整数读取为字符串
我正在尝试读取存储为json文件的tweet。我用熊猫来加载数据。但是在Python pandas read_json错误地将大整数读取为字符串,python,json,python-3.x,pandas,Python,Json,Python 3.x,Pandas,我正在尝试读取存储为json文件的tweet。我用熊猫来加载数据。但是在read\u json函数中发现了一些奇怪的行为。我将提供以下信息: 在我的电脑上输出以下内容: <class 'pandas.core.frame.DataFrame'> Index: 4 entries, 1 to 4 Data columns (total 1 columns): tid 4 non-null int64 dtypes: int64(1) memory usage: 64.0+ byt
read\u json
函数中发现了一些奇怪的行为。我将提供以下信息:
在我的电脑上输出以下内容:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 10000000000000000
3 10000000000000000
4 10000000000000002
更新:
它在明确指定参数时读取正确
dtype=int
,但我不明白为什么。当我们指定时会发生什么变化
数据类型
可以显式指定数据类型:
In [32]: df=pd.read_json(json_content,
...: orient='index', # read as transposed
...: convert_axes=False, # don't convert keys to dates
...: dtype='int64' # <------- NOTE
...: )
...: print(df.info())
...: print(df)
...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002
因此,它看起来与类型推断有关,因为默认情况下,
dtype=True
,这意味着:如果为True,则推断dtypes
感谢您提供的信息。我也在寻找对这种行为的解释。我碰巧发现,在您发布之前明确提到dtype是有效的。这是一个解决办法,但它没有回答我的问题。(你刚才得到的反对票不是我的)@UdayrajDeshmukh,它与类型推断有关-如果你传递整数而不是字符串(例如:“tid”:100000000000002而不是“tid”:“100000000000002”
),它会正常工作。PS defaultdtype True
-“如果为True,则推断dtypes”
实际上我是从大约100个json文件(twitter数据库的一个示例)中读取的,这些文件中已经有tid列作为strings@UdayrajDeshmukh,是的,我正试图找到一个原因-这似乎是一个“类型推断”提出了同样的问题:
import sys
# original problem
tid_0 = 956677215197970432
print(sys.maxsize,tid_0,sys.maxsize/tid_0) # < 1 if overflow possible
# minimal case
tid = 10000000000000001
print(sys.maxsize,tid,sys.maxsize/tid) # < 1 if overflow possible
#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922
In [32]: df=pd.read_json(json_content,
...: orient='index', # read as transposed
...: convert_axes=False, # don't convert keys to dates
...: dtype='int64' # <------- NOTE
...: )
...: print(df.info())
...: print(df)
...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002
In [61]: %paste
json_content="""
{
"1": {
"tid": 9999999999999998,
},
"2": {
"tid": 9999999999999999,
},
"3": {
"tid": 10000000000000001,
},
"4": {
"tid": 10000000000000002,
}
}
"""
df=pd.read_json(json_content,
orient='index', # read as transposed
convert_axes=False, # don't convert keys to dates
)
print(df.dtypes)
print(df)
## -- End pasted text --
tid int64
dtype: object
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002