Python 如何创建通过数组列进行迭代的PySpark UDF:
这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项 我有一个约6百万行的数据框,在其中我将元素提取到单独的列中。以下是一个示例:Python 如何创建通过数组列进行迭代的PySpark UDF:,python,pyspark,apache-spark-sql,user-defined-functions,Python,Pyspark,Apache Spark Sql,User Defined Functions,这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项 我有一个约6百万行的数据框,在其中我将元素提取到单独的列中。以下是一个示例: from pyspark.sql.types import * d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A
from pyspark.sql.types import *
d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A I20051008RMEP'], '3-Section': ['H', 'B', 'B'], '4-Class': ['02', '63', '65'], '5-SubClass': ['J', 'H', 'D']}]
schema = StructType([
StructField("ID", StringType(), True),
StructField("Full_Classification", ArrayType(StringType()), True),
StructField("Section", ArrayType(StringType()), True),
StructField("Class", ArrayType(StringType()), True),
StructField("SubClass", ArrayType(StringType()), True)
])
df = spark.createDataFrame(d)
df.printSchema()
df.show(truncate=100)
输出:
root
|-- 1-ID: string (nullable = true)
|-- 2-Full_Classification: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 3-Section: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 4-Class: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 5-SubClass: array (nullable = true)
| |-- element: string (containsNull = true)
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
| 1-ID| 2-Full_Classification|3-Section| 4-Class|5-SubClass|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
|Alice|[H02J 1/08 20060101AFI20151217BHEP, B63H 21/17 20060101ALI20151217BHEP, B65D 39/12 20060101A I200...|[H, B, B]|[02, 63, 65]| [J, H, D]|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
我想创建一个udf来计算每个提取列表/数组的“一致性”分数,如下所示:
List1 = ['A', 'A', 'A', 'A'] #agreement score would be 100%
List2 = ['12', '12', '12', '13', '13', '13'] #agreement score would be 50%
List3 = ['C', 'D', 'E'] #agreement score would be 0%
编辑:在这种情况下所需的输出:
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
| 1-ID|2-Full_Classification|3-Section| 4-Class|5-SubClass|Class-Score|Section-Score|SubClass-Score|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
|Alice| [H02J 1/08 200601...|[H, B, B]|[02, 63, 65]| [J, H, D]| 0.0| 0.33| 0.0|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
以下是我的udf:
def agreement(col):
unique_items = set(list(col))
unique_items_count = len(unique_items)
if unique_items_count == 1:
agreement = 1.0
elif unique_items_count == len(list(col)):
agreement = 0.0
else:
agreement = (1.0 - (unique_items_count / len(list(col)))
return agreement
agreement_udf = udf(agreement)
df = df.withColumn('Section_Score', agreement_udf('3-Section'))
df.show()
然而,当我执行这段代码时,我被以下错误消息击中(第一条消息用于我的整个6M行DF,最后一条消息用于上面的示例DF):
AttributeError:“非类型”对象没有属性“\u jvm”
或
TypeError:“非类型”对象不可编辑
或
TypeError:类型为“type”的对象没有len()
我很困惑,因为每一列都包含一个字符串数组,这些字符串应该是可编辑的。任何建议请明确说明您在示例中显示的单行的协议分数(最好在预期结果的基础上再添加几行);另外,您没有显示
协议的定义\u udf
。很好!这两个都是在最近的编辑中添加的。仍然不清楚如何从部分
,类
,子类
中得到显示的各自分数。每个字段的“一致分数”基于各自数组中的每个元素彼此“一致”的程度。例如,“Section Score”基于['H'、'B'、'B']的“Section”数组。根据udf,score=(1-(唯一值计数/所有数组值计数))=1-2/3=0.33。
Class数组有3个不同的值-['02','63','65']。这三个值都不相同(都不一致),因此分数为0.0->elif unique_items\u count==len(list(col)):agreement=0.0