Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/.net/23.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
pyspark随机拆分堆栈溢出错误_Pyspark_Apache Spark Sql - Fatal编程技术网

pyspark随机拆分堆栈溢出错误

pyspark随机拆分堆栈溢出错误,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,这里的“数据”是一个pyspark数据框,它包含的负样本比正样本多得多。我试着把阴性样本的抽样率降到(例如)20%。我发现randomSplit方法步骤继续给出java.lang.StackOverflower错误。下面是我的示例代码: def downsample(data,percent=0.2): datap = data.filter(data.label==1) # positive samples datan = data.filter(data.label==0)

这里的“数据”是一个pyspark数据框,它包含的负样本比正样本多得多。我试着把阴性样本的抽样率降到(例如)20%。我发现randomSplit方法步骤继续给出java.lang.StackOverflower错误。下面是我的示例代码:

def downsample(data,percent=0.2):
    datap = data.filter(data.label==1)  # positive samples
    datan = data.filter(data.label==0)  # negative samples
    (data1,_) = datan.randomSplit([percent,1-percent]) # random split the data
    ndata = datap.unionAll(data1)  # new training dataset
    return ndata
datan.randomSplit将给我stackoverflow错误。数据集大约为2GB。我在emr集群中使用amazon 5个节点(1个主节点,4个从节点),它们都是c4.xlarge。我相信它有足够的内存,因为我可以在不进行下采样的情况下完成数据的训练。只有当我尝试下采样时,它才会给我stackoverflow错误。我该怎么修?谢谢大家!