Scala pyspark在尝试并行发出URL请求时挂起

Scala pyspark在尝试并行发出URL请求时挂起,scala,apache-spark,pyspark,apache-spark-sql,rdd,Scala,Apache Spark,Pyspark,Apache Spark Sql,Rdd,我有一个url的rdd,我想并行地发出url请求。我正在运行带有url字符串rdd的get_菜单。Pyspark在尝试收集()项、获取()项等时挂起 def format_json(response, slug): clean_response = str(response)[:-1] clean_response += ", 'slug': '" + slug + "'}\n" # ', \"store-slug\": \"' + slug +'\"}' clean_r

我有一个url的rdd,我想并行地发出url请求。我正在运行带有url字符串rdd的get_菜单。Pyspark在尝试收集()项、获取()项等时挂起

def format_json(response, slug):
    clean_response = str(response)[:-1]
    clean_response += ", 'slug': '" + slug + "'}\n" # ', \"store-slug\": \"' + slug +'\"}'
    clean_response = clean_response.replace("None", "null").replace("True", 'true').replace("False", "false")
    return clean_response

def get_all(url, slug):
    response = requests.get(url)
    response = json.loads(response).get('data', {})
    clean_response = format_json(response, slug)
    return clean_response

def get_items(rdd_of_urls):
    get_items_udf = udf(lambda x: get_all(x), StringType())
    items = rdd_of_urls.map(get_items_udf)
    return items
我看到许多GC分配错误,但不知道如何解释它们:

2019-08-21T18:40:04.666+0000: [GC (Allocation Failure) 2019-08-21T18:40:04.666+0000: [ParNew: 34369K->1837K(36864K), 0.0029907 secs] 127948K->95416K(186848K), 0.0030414 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-08-21T18:40:04.951+0000: [GC (Allocation Failure) 2019-08-21T18:40:04.951+0000: [ParNew: 34605K->2576K(36864K), 0.0036551 secs] 128184K->96155K(186848K), 0.0037126 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]

请检查此项:而不是映射尝试在spark中使用mapPartition请检查此项:而不是映射尝试在spark中使用mapPartition