pyarrow.lib.ArrowInvalid:(';无法转换类型为Y的X:在推断箭头数据类型时无法识别Python值类型';)
使用pyarrow.lib.ArrowInvalid:(';无法转换类型为Y的X:在推断箭头数据类型时无法识别Python值类型';),python,pandas,parquet,pyarrow,fastparquet,Python,Pandas,Parquet,Pyarrow,Fastparquet,使用pyarrow将包含Player对象的pandas.DataFrame转换为带有以下代码的pyarrow.Table import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name self.age = age self.gender = gender def __re
pyarrow
将包含Player
对象的pandas.DataFrame
转换为带有以下代码的pyarrow.Table
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm'),
Player('Ryan', 18, 'm'),
Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))
pyarrow
是否可以回退到使用pickle
序列化这些Python对象?还是有更好的解决方案?pyarrow.Table
最终将使用Parquet.write\u Table()
写入磁盘
- 使用Python 3.8.0、pandas 0.25.3和pyarrow 0.13.0
不支持多索引,因此首选使用pandas.DataFrame.to_parquet()
的解决方案pq.write_表(pa.table.from_DataFrame(pandas.DataFrame))
谢谢大家! 据我所知,由于repr 尝试这种方法(它有效):
职业玩家:
定义初始(自我、姓名、年龄、性别):
self.name=名称
self.age=年龄
self.gender=性别
def其他(自我):
返回f“”
数据=[
播放器('Jack',21,'m')。其他(),
玩家('Ryan',18,'m')。其他(),
播放器('Jane',35,'f')。其他(),
]
df=pd.DataFrame(数据,列=['player'])
打印(df)
运动员
0
1.
2.
打印(pa.Table.来自_pandas(df))
pyarrow.桌子
演奏者:弦乐
我的建议是将数据插入已经序列化的数据框中
最佳选项-使用数据类(python>=3.7)
由decorator将Player类定义为dataclass,并让序列化以本机方式为您完成(到JSON)
将熊猫作为pd导入
从数据类导入数据类
@数据类
第二类:
姓名:str
年龄:整数
性别:str
定义报告(自我):
返回f“”
数据v2=[
PlayerV2(姓名为“杰克”,年龄为21岁,性别为“m”),
PlayerV2(姓名='Ryan',年龄=18岁,性别='m'),
PlayerV2(姓名为“Jane”,年龄为35岁,性别为“f”),
]
#序列化是以JSON本机方式完成的
df_v2=pd.DataFrame(数据,列=['player']))
打印(df_v2)
#仍然可以通过反序列化记录来获取对象的属性
加载(df_v2[“player”][0])['name']
手动序列化对象(python<3.7)
在Player类中定义序列化函数,并在创建数据帧之前序列化每个实例
import pandas as pd
import json
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
# The serialization function for JSON, if for some reason you really need pickle you can use it instead
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__)
# Serialize the objects before inserting it into the DataFrame
data = [
Player('Jack', 21, 'm').toJSON(),
Player('Ryan', 18, 'm').toJSON(),
Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])
# You can see all the data inserted as a serialized json into the column player
print(df)
# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
将熊猫作为pd导入
导入json
职业球员:
定义初始(自我、姓名、年龄、性别):
self.name=名称
self.age=年龄
self.gender=性别
定义报告(自我):
返回f“”
#JSON的序列化函数,如果出于某种原因确实需要pickle,可以使用它
def toJSON(self):
返回json.dumps(self,默认值=lambda o:o.。\uuuuu dict\uuuu)
#在将对象插入数据帧之前序列化对象
数据=[
Player('Jack',21,'m').toJSON(),
Player('Ryan',18,'m').toJSON(),
Player('Jane',35,'f').toJSON(),
]
df=pd.DataFrame(数据,列=['player'])
#您可以看到作为序列化json插入到player列中的所有数据
打印(df)
#仍然可以通过反序列化记录来获取对象的属性
加载(df[“player”][0])['name']
您能用Apache Arrow解决JIRA问题吗?我们并没有在StackOverflow上真正与用户或开发人员接触。你有没有想过?
df.to_parquet('players.pq')
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def other(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm').other(),
Player('Ryan', 18, 'm').other(),
Player('Jane', 35, 'f').other(),
]
df = pd.DataFrame(data, columns=['player'])
print(df)
player
0 <Jack (21)>
1 <Ryan (18)>
2 <Jane (35)>
print(pa.Table.from_pandas(df))
pyarrow.Table
player: string
import pandas as pd
from dataclasses import dataclass
@dataclass
class PlayerV2:
name:str
age:int
gender:str
def __repr__(self):
return f'<{self.name} ({self.age})>'
dataV2 = [
PlayerV2(name='Jack', age=21, gender='m'),
PlayerV2(name='Ryan', age=18, gender='m'),
PlayerV2(name='Jane', age=35, gender='f'),
]
# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)
# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']
import pandas as pd
import json
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
# The serialization function for JSON, if for some reason you really need pickle you can use it instead
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__)
# Serialize the objects before inserting it into the DataFrame
data = [
Player('Jack', 21, 'm').toJSON(),
Player('Ryan', 18, 'm').toJSON(),
Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])
# You can see all the data inserted as a serialized json into the column player
print(df)
# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']