Python 使用spark.sql和Cloudant计算偏度_Python_Cloudant_Spark Cloudant

Python 使用spark.sql和Cloudant计算偏度

python

Python 使用spark.sql和Cloudant计算偏度,python,cloudant,spark-cloudant,Python,Cloudant,Spark Cloudant,我对以下代码有问题： def skewTemperature(cloudantdata,spark): return spark.sql("""SELECT (1/count(temperature)) * (sum(POW(temperature-%s,3))/pow(%s,3)) as skew from washing""" %(meanTemperature(cloudantdata,spark),sdTemperature(cloudantdata,spark))).first

我对以下代码有问题：

def skewTemperature(cloudantdata,spark):
    return spark.sql("""SELECT (1/count(temperature)) * (sum(POW(temperature-%s,3))/pow(%s,3)) as skew from washing""" %(meanTemperature(cloudantdata,spark),sdTemperature(cloudantdata,spark))).first().skew

meanstemperature

和

sdTemperature

工作正常，但通过上述查询，我得到以下错误：

Py4JJavaError: An error occurred while calling o2849.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 315.0 failed 10 times, most recent failure: Lost task 3.9 in stage 315.0 (TID 1532, yp-spark-dal09-env5-0045): java.lang.RuntimeException: Database washing request error: {"error":"too_many_requests","reason":"You've exceeded your current limit of 5 requests per second for query class. Please try later.","class":"query","rate":5

有人知道如何解决这个问题吗？

错误表明您超过了查询类的Cloudant API调用阈值，对于您正在使用的服务计划，该阈值似乎为5/秒。一种可能的解决方案是通过定义

jsonstore.rdd.partitions

配置属性来限制分区的数量，如下面的Spark 2示例所示：

spark = SparkSession\    
        .builder\    
        .appName("Cloudant Spark SQL Example in Python using dataframes")\
        .config("cloudant.host","ACCOUNT.cloudant.com")\     
        .config("cloudant.username", "USERNAME")\    
        .config("cloudant.password","PASSWORD")\    
        .config("jsonstore.rdd.partitions", 5)\    
        .getOrCreate()

如果错误持续存在，从5开始，一直到1。此设置基本上限制了将向Cloudant发送多少并发请求。如果一个1的设置不能解决这个问题，您可能需要考虑升级到具有较大阈值的服务计划。

请清楚地询问问题。