Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ember.js/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
pyarrow.lib.ArrowInvalid:(';无法转换类型为Y的X:在推断箭头数据类型时无法识别Python值类型';)_Python_Pandas_Parquet_Pyarrow_Fastparquet - Fatal编程技术网

pyarrow.lib.ArrowInvalid:(';无法转换类型为Y的X:在推断箭头数据类型时无法识别Python值类型';)

pyarrow.lib.ArrowInvalid:(';无法转换类型为Y的X:在推断箭头数据类型时无法识别Python值类型';),python,pandas,parquet,pyarrow,fastparquet,Python,Pandas,Parquet,Pyarrow,Fastparquet,使用pyarrow将包含Player对象的pandas.DataFrame转换为带有以下代码的pyarrow.Table import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name self.age = age self.gender = gender def __re

使用
pyarrow
将包含
Player
对象的
pandas.DataFrame
转换为带有以下代码的
pyarrow.Table

import pandas as pd
import pyarrow as pa

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'

data = [
    Player('Jack', 21, 'm'),
    Player('Ryan', 18, 'm'),
    Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))
pyarrow
是否可以回退到使用
pickle
序列化这些Python对象?还是有更好的解决方案?
pyarrow.Table
最终将使用
Parquet.write\u Table()
写入磁盘

  • 使用Python 3.8.0、pandas 0.25.3和pyarrow 0.13.0
  • pandas.DataFrame.to_parquet()
    不支持多索引,因此首选使用
    pq.write_表(pa.table.from_DataFrame(pandas.DataFrame))
    的解决方案

谢谢大家!

据我所知,由于repr 尝试这种方法(它有效):

职业玩家:
定义初始(自我、姓名、年龄、性别):
self.name=名称
self.age=年龄
self.gender=性别
def其他(自我):
返回f“”
数据=[
播放器('Jack',21,'m')。其他(),
玩家('Ryan',18,'m')。其他(),
播放器('Jane',35,'f')。其他(),
]
df=pd.DataFrame(数据,列=['player'])
打印(df)
运动员
0
1.
2.
打印(pa.Table.来自_pandas(df))
pyarrow.桌子
演奏者:弦乐

我的建议是将数据插入已经序列化的数据框中

最佳选项-使用数据类(python>=3.7) 由decorator将Player类定义为dataclass,并让序列化以本机方式为您完成(到JSON)

将熊猫作为pd导入
从数据类导入数据类
@数据类
第二类:
姓名:str
年龄:整数
性别:str
定义报告(自我):
返回f“”
数据v2=[
PlayerV2(姓名为“杰克”,年龄为21岁,性别为“m”),
PlayerV2(姓名='Ryan',年龄=18岁,性别='m'),
PlayerV2(姓名为“Jane”,年龄为35岁,性别为“f”),
]
#序列化是以JSON本机方式完成的
df_v2=pd.DataFrame(数据,列=['player']))
打印(df_v2)
#仍然可以通过反序列化记录来获取对象的属性
加载(df_v2[“player”][0])['name']
手动序列化对象(python<3.7) 在Player类中定义序列化函数,并在创建数据帧之前序列化每个实例

import pandas as pd
import json

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'
    
    # The serialization function for JSON, if for some reason you really need pickle you can use it instead
    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__)

# Serialize the objects before inserting it into the DataFrame
data = [
    Player('Jack', 21, 'm').toJSON(),
    Player('Ryan', 18, 'm').toJSON(),
    Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])

# You can see all the data inserted as a serialized json into the column player
print(df)

# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
将熊猫作为pd导入
导入json
职业球员:
定义初始(自我、姓名、年龄、性别):
self.name=名称
self.age=年龄
self.gender=性别
定义报告(自我):
返回f“”
#JSON的序列化函数,如果出于某种原因确实需要pickle,可以使用它
def toJSON(self):
返回json.dumps(self,默认值=lambda o:o.。\uuuuu dict\uuuu)
#在将对象插入数据帧之前序列化对象
数据=[
Player('Jack',21,'m').toJSON(),
Player('Ryan',18,'m').toJSON(),
Player('Jane',35,'f').toJSON(),
]
df=pd.DataFrame(数据,列=['player'])
#您可以看到作为序列化json插入到player列中的所有数据
打印(df)
#仍然可以通过反序列化记录来获取对象的属性
加载(df[“player”][0])['name']

您能用Apache Arrow解决JIRA问题吗?我们并没有在StackOverflow上真正与用户或开发人员接触。你有没有想过?
df.to_parquet('players.pq')
class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def other(self):
        return f'<{self.name} ({self.age})>'

data = [
    Player('Jack', 21, 'm').other(),
    Player('Ryan', 18, 'm').other(),
    Player('Jane', 35, 'f').other(),
]
df = pd.DataFrame(data, columns=['player'])
print(df)
        player
0  <Jack (21)>
1  <Ryan (18)>
2  <Jane (35)>

print(pa.Table.from_pandas(df))

pyarrow.Table
player: string
import pandas as pd
from dataclasses import dataclass

@dataclass
class PlayerV2:
    name:str
    age:int
    gender:str

    def __repr__(self):
        return f'<{self.name} ({self.age})>'


dataV2 = [
    PlayerV2(name='Jack', age=21, gender='m'),
    PlayerV2(name='Ryan', age=18, gender='m'),
    PlayerV2(name='Jane', age=35, gender='f'),
]

# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)

# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']
import pandas as pd
import json

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'
    
    # The serialization function for JSON, if for some reason you really need pickle you can use it instead
    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__)

# Serialize the objects before inserting it into the DataFrame
data = [
    Player('Jack', 21, 'm').toJSON(),
    Player('Ryan', 18, 'm').toJSON(),
    Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])

# You can see all the data inserted as a serialized json into the column player
print(df)

# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']