Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/docker/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 将列表作为参数传递给udf pyspark_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark 将列表作为参数传递给udf pyspark

Apache spark 将列表作为参数传递给udf pyspark,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我的json模式如下所示 { "uid": "a7f2e98835c1fb67e9aa9f1fbaae5e98", "gender": "F", "click": [ { "url": "htp://abc.com/1.html?utm_campaign=397" }, { "url

我的json模式如下所示

 {
  "uid": "a7f2e98835c1fb67e9aa9f1fbaae5e98", 
  "gender": "F", 
  "click": [
    {
      "url": "htp://abc.com/1.html?utm_campaign=397"
    },
    {
      "url": "htp://qaz.com/1.html?utm_campaign=397"
    }
  ]
}
我有一个干净访问的自定义项。url,例如我的自定义项(“htp://abc.com/1.html?utm_campaign=397)我得到了abc.com

我想获取带有已清理url的数据帧:

uid                              gender    urls
a7f2e98835c1fb67e9aa9f1fbaae5e98 F         [abc.com,qaz.com]
我的代码:

from pyspark.sql import functions as F
from pyspark.sql.types import *

import re
from urllib.parse import urlparse
from urllib.request import urlretrieve, unquote

clean = F.udf (lambda z:my_udf(z), ArrayType(StringType())) 

def my_udf(url):
    url = re.sub('(http(s)*://)+', 'http://', url)
    parsed_url = urlparse(unquote(url.strip()))
    if parsed_url.scheme not in ['http','https']: return None
    netloc = re.search("(?:www\.)?(.*)", parsed_url.netloc).group(1)
    if netloc is not None: return str(netloc.encode('utf8')).strip()
    return None

dataFrame = spark.read.json('1.json') \
.withColumn("urls", clean(F.col("click.url"))) \
.select( F.col("uid"), F.col("gender"), F.col("urls") ) \
show(3)
但我有错误:

TypeError: expected string or bytes-like object
我做错了什么?

我做了:

dataFrame = spark.read.json('1.json') \
    .withColumn("urls_exploded",  F.explode(  F.col("click.url") )) \
    .withColumn("urls_cleaned", my_udf(F.col("urls_exploded"))) \
    .groupBy(F.col("uid"),F.col("gender") ) \
    .agg(F.collect_set(F.col("urls_cleaned")).alias("urls") ) \
    .select( F.col("uid"), F.col("gender_age"), F.col("urls") ) \
    .show(1,truncate=False)

您对udf的定义有问题-您不需要
lambda
。还可以显示my_udf的源代码吗?添加my_udf的代码(my_udf,ArrayType(StringType())