Python 如何在pyspark操作中轻松使用自定义类方法？_Python_Pyspark_User Defined Types

Python 如何在pyspark操作中轻松使用自定义类方法？

python pyspark

Python 如何在pyspark操作中轻松使用自定义类方法？,python,pyspark,user-defined-types,Python,Pyspark,User Defined Types,我有一个班级年龄，一个csv文件和一个pyspark运行时会话 ages.csv Name;Age alpha;noise20noise beta;noi 3 sE 0 gamma;n 4 oi 0 se phi;n50ise detla;3no5ise kappa;No 4 i 5 sE omega;25noIsE 它实际上被读取为（在分析年龄列之后）：定义类别：年龄年龄例如，现在我想让所有20岁的人 >>> from age import Age >>&

我有一个班级年龄，一个csv文件和一个pyspark运行时会话

ages.csv

Name;Age
alpha;noise20noise
beta;noi 3 sE 0
gamma;n 4 oi 0 se
phi;n50ise
detla;3no5ise
kappa;No 4 i 5 sE
omega;25noIsE

它实际上被读取为（在分析年龄列之后）：

定义类别：年龄年龄

例如，现在我想让所有20岁的人

>>> from age import Age
>>> ages.filter(ages.Age == Age(20)).show()

这就是我得到的错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/column.py", line 116, in _
    njc = getattr(self._jc, name)(jc)
File "/opt/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/opt/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/opt/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
File "/opt/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'Age' object has no attribute '_get_object_id'

作为第二次尝试：

>>> ages.filter(ages.Age == Age(20)).show()
+----+---+
|Name|Age|
+----+---+
+----+---+

尽管如此，我们仍然有：

>>> 'noise20noise' == Age(20)
True

正如您所看到的，

属性错误：“Age”对象没有属性“\u get\u object\u id”

消失，但它没有计算正确答案，这是我的第二个问题

我再次尝试：我使用pyspark用户定义函数

>>> import pyspark.sql.functions as F
>>> import pyspark.sql.types as T
>>> eq20 = F.udf(lambda c: c == Age(20), T.BooleanType())
>>> ages.filter(eq20(ages.Age)).show()
+-----+------------+
| Name|         Age|
+-----+------------+
|alpha|noise20noise|
+-----+------------+

现在这是可行的。但问题是：我最喜欢第一个成语

>>> ages.filter(ages.Age == Age(20)).show()

这是更简单和更具表现力。我不想每次都定义像

eq20、eq21、小于50、大于30等这样的函数
我可以在班级年龄本身做这个定义，但我不知道怎么做。尽管如此，到目前为止，我一直在使用python装饰器
年龄
...
class Age(str):
    ....

# other imports here
...

import pyspark.sql.functions as F
import pyspark.sql.types as T

def connect_to_pyspark(function):
    return F.udf(function, T.BooleanType())

class Age(str):
    ...

    @connect_to_pyspark
    def __eq__(self, other):
        return self.age == self.__parse(other)

    ...
    # do the same decorator for the other comparative methods

再次测试：
>>> ages.filter(ages.Age == Age(20)).show()
+----+---+
|Name|Age|
+----+---+
+----+---+

但它不起作用。还是我的装饰师写得不好
如何解决这一切？
我对第一个问题的解决方案足够好吗？如果没有，该怎么办？如果是，如何解决第二个问题？
获得Age.Age==Age（20）
将是相当困难的，因为spark不尊重python实现\uuuuuueq\uuu
的约定。稍后将对此进行详细介绍，但如果您可以执行Age（20）=ages.Age，那么您有一些选择。IMHO，最简单的方法是只在udf中包装解析逻辑：
parse_udf = F.udf(..., T.IntegerType())
class Age:
    ...
    def __eq__(self, other: Column):
        return F.lit(self.age) == parse_udf(other)

请注意，Age
并不是str
的子类，这只会造成伤害。如果您想使用decorator，那么decorator不应该返回一个udf
，它应该返回一个应用udf的函数。像这样：
import re
import pyspark.sql.functions as F
import pyspark.sql.types as T

def connect_to_pyspark(function):
  def helper(age, other):
    myUdf = F.udf(lambda item_from_other: function(age, item_from_other), T.BooleanType())
    return myUdf(other)
  return helper

class Age:

    def __init__(self, age):
      self.age = 45

    def __parse(self, other):
      return int(''.join(re.findall(r'\d', other)))

    @connect_to_pyspark
    def __eq__(self, other):
        return self.age == self.__parse(other)

ages.withColumn("eq20", Age(20) == df.Age).show()

更多关于为什么需要使用Ages（20）=Ages.Age
。在python中，如果您执行a==b
，并且a类不知道如何与b进行比较，它应该返回NotImplemented
，然后python将尝试b，但是spark永远不会返回NotImplemented
，因此Age
的\uuuuueq\uuuu
只有在表达式中先有它时才会被调用：（。
这只适用于一种方式Age（20）=df.Age
，而不适用于另一种方式df.Age==Age（20）
我仍然得到以下错误AttributeError:“Age”对象没有属性“\u get\u object\u id”
。后者是我真正需要的，因为这样我甚至可以做df.Age.between（年龄（20岁），年龄（30岁））之类的事情。我希望这个问题能得到解决。@mctrjalloh在spark邮件列表中提出它，如果不更新pyspark:（好的。我会尝试提出这个问题。
>>> ages.filter(ages.Age == Age(20)).show()
+----+---+
|Name|Age|
+----+---+
+----+---+

parse_udf = F.udf(..., T.IntegerType())
class Age:
    ...
    def __eq__(self, other: Column):
        return F.lit(self.age) == parse_udf(other)

import re
import pyspark.sql.functions as F
import pyspark.sql.types as T

def connect_to_pyspark(function):
  def helper(age, other):
    myUdf = F.udf(lambda item_from_other: function(age, item_from_other), T.BooleanType())
    return myUdf(other)
  return helper

class Age:

    def __init__(self, age):
      self.age = 45

    def __parse(self, other):
      return int(''.join(re.findall(r'\d', other)))

    @connect_to_pyspark
    def __eq__(self, other):
        return self.age == self.__parse(other)

ages.withColumn("eq20", Age(20) == df.Age).show()