Python 如何创建通过数组列进行迭代的PySpark UDF：_Python_Pyspark_Apache Spark Sql_User Defined Functions

Python 如何创建通过数组列进行迭代的PySpark UDF：

python pyspark

Python 如何创建通过数组列进行迭代的PySpark UDF：,python,pyspark,apache-spark-sql,user-defined-functions,Python,Pyspark,Apache Spark Sql,User Defined Functions,这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项我有一个约6百万行的数据框，在其中我将元素提取到单独的列中。以下是一个示例： from pyspark.sql.types import * d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A

这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项

我有一个约6百万行的数据框，在其中我将元素提取到单独的列中。以下是一个示例：

from pyspark.sql.types import *
d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A I20051008RMEP'], '3-Section': ['H', 'B', 'B'],  '4-Class': ['02', '63', '65'], '5-SubClass': ['J', 'H', 'D']}]
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Full_Classification", ArrayType(StringType()), True),
    StructField("Section", ArrayType(StringType()), True),
    StructField("Class", ArrayType(StringType()), True),
    StructField("SubClass", ArrayType(StringType()), True)
    ])
df = spark.createDataFrame(d)
df.printSchema()
df.show(truncate=100)

输出：

root
 |-- 1-ID: string (nullable = true)
 |-- 2-Full_Classification: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 3-Section: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 4-Class: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 5-SubClass: array (nullable = true)
 |    |-- element: string (containsNull = true)
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
| 1-ID|                                                                               2-Full_Classification|3-Section|     4-Class|5-SubClass|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
|Alice|[H02J 1/08 20060101AFI20151217BHEP, B63H 21/17 20060101ALI20151217BHEP, B65D 39/12 20060101A I200...|[H, B, B]|[02, 63, 65]| [J, H, D]|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+

我想创建一个udf来计算每个提取列表/数组的“一致性”分数，如下所示：

List1 = ['A', 'A', 'A', 'A'] #agreement score would be 100%
List2 = ['12', '12', '12', '13', '13', '13'] #agreement score would be 50%
List3 = ['C', 'D', 'E'] #agreement score would be 0%

编辑：在这种情况下所需的输出：

+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
| 1-ID|2-Full_Classification|3-Section|     4-Class|5-SubClass|Class-Score|Section-Score|SubClass-Score|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
|Alice| [H02J 1/08 200601...|[H, B, B]|[02, 63, 65]| [J, H, D]|        0.0|         0.33|           0.0|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+

以下是我的udf：

def agreement(col):
    unique_items = set(list(col))
    unique_items_count = len(unique_items)
    if  unique_items_count == 1:
        agreement = 1.0
    elif  unique_items_count == len(list(col)):
        agreement =  0.0
    else:
        agreement = (1.0 - (unique_items_count / len(list(col)))
    return agreement
agreement_udf = udf(agreement)
df = df.withColumn('Section_Score', agreement_udf('3-Section'))
df.show()

然而，当我执行这段代码时，我被以下错误消息击中（第一条消息用于我的整个6M行DF，最后一条消息用于上面的示例DF）：

AttributeError:“非类型”对象没有属性“\u jvm”

或

TypeError:“非类型”对象不可编辑

或

TypeError:类型为“type”的对象没有len（）

我很困惑，因为每一列都包含一个字符串数组，这些字符串应该是可编辑的。任何建议

请明确说明您在示例中显示的单行的协议分数（最好在预期结果的基础上再添加几行）；另外，您没有显示

协议的定义\u udf

。很好！这两个都是在最近的编辑中添加的。仍然不清楚如何从

部分

，

类

，

子类

中得到显示的各自分数。每个字段的“一致分数”基于各自数组中的每个元素彼此“一致”的程度。例如，“Section Score”基于['H'、'B'、'B']的“Section”数组。根据udf，

score=（1-（唯一值计数/所有数组值计数））=1-2/3=0.33。

Class数组有3个不同的值-['02'，'63'，'65']。这三个值都不相同（都不一致），因此分数为0.0->

elif unique_items\u count==len（list（col））：agreement=0.0