Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/asp.net-mvc/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建通过数组列进行迭代的PySpark UDF:_Python_Pyspark_Apache Spark Sql_User Defined Functions - Fatal编程技术网

Python 如何创建通过数组列进行迭代的PySpark UDF:

Python 如何创建通过数组列进行迭代的PySpark UDF:,python,pyspark,apache-spark-sql,user-defined-functions,Python,Pyspark,Apache Spark Sql,User Defined Functions,这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项 我有一个约6百万行的数据框,在其中我将元素提取到单独的列中。以下是一个示例: from pyspark.sql.types import * d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A

这里是初学者Pypark问题如何创建在列中迭代字符串数组的自定义项

我有一个约6百万行的数据框,在其中我将元素提取到单独的列中。以下是一个示例:

from pyspark.sql.types import *
d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A I20051008RMEP'], '3-Section': ['H', 'B', 'B'],  '4-Class': ['02', '63', '65'], '5-SubClass': ['J', 'H', 'D']}]
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Full_Classification", ArrayType(StringType()), True),
    StructField("Section", ArrayType(StringType()), True),
    StructField("Class", ArrayType(StringType()), True),
    StructField("SubClass", ArrayType(StringType()), True)
    ])
df = spark.createDataFrame(d)
df.printSchema()
df.show(truncate=100)
输出:

root
 |-- 1-ID: string (nullable = true)
 |-- 2-Full_Classification: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 3-Section: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 4-Class: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 5-SubClass: array (nullable = true)
 |    |-- element: string (containsNull = true)
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
| 1-ID|                                                                               2-Full_Classification|3-Section|     4-Class|5-SubClass|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
|Alice|[H02J 1/08 20060101AFI20151217BHEP, B63H 21/17 20060101ALI20151217BHEP, B65D 39/12 20060101A I200...|[H, B, B]|[02, 63, 65]| [J, H, D]|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
我想创建一个udf来计算每个提取列表/数组的“一致性”分数,如下所示:

List1 = ['A', 'A', 'A', 'A'] #agreement score would be 100%
List2 = ['12', '12', '12', '13', '13', '13'] #agreement score would be 50%
List3 = ['C', 'D', 'E'] #agreement score would be 0%
编辑:在这种情况下所需的输出:

+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
| 1-ID|2-Full_Classification|3-Section|     4-Class|5-SubClass|Class-Score|Section-Score|SubClass-Score|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
|Alice| [H02J 1/08 200601...|[H, B, B]|[02, 63, 65]| [J, H, D]|        0.0|         0.33|           0.0|
+-----+---------------------+---------+------------+----------+-----------+-------------+--------------+
以下是我的udf:

def agreement(col):
    unique_items = set(list(col))
    unique_items_count = len(unique_items)
    if  unique_items_count == 1:
        agreement = 1.0
    elif  unique_items_count == len(list(col)):
        agreement =  0.0
    else:
        agreement = (1.0 - (unique_items_count / len(list(col)))
    return agreement
agreement_udf = udf(agreement)
df = df.withColumn('Section_Score', agreement_udf('3-Section'))
df.show()
然而,当我执行这段代码时,我被以下错误消息击中(第一条消息用于我的整个6M行DF,最后一条消息用于上面的示例DF):

AttributeError:“非类型”对象没有属性“\u jvm”

TypeError:“非类型”对象不可编辑

TypeError:类型为“type”的对象没有len()


我很困惑,因为每一列都包含一个字符串数组,这些字符串应该是可编辑的。任何建议

请明确说明您在示例中显示的单行的协议分数(最好在预期结果的基础上再添加几行);另外,您没有显示
协议的定义\u udf
。很好!这两个都是在最近的编辑中添加的。仍然不清楚如何从
部分
子类
中得到显示的各自分数。每个字段的“一致分数”基于各自数组中的每个元素彼此“一致”的程度。例如,“Section Score”基于['H'、'B'、'B']的“Section”数组。根据udf,
score=(1-(唯一值计数/所有数组值计数))=1-2/3=0.33。
Class数组有3个不同的值-['02','63','65']。这三个值都不相同(都不一致),因此分数为0.0->
elif unique_items\u count==len(list(col)):agreement=0.0