如何使用pyspark for loop打印迭代值

如何使用pyspark for loop打印迭代值,pyspark,Pyspark,我正在尝试使用pyspark打印数据帧值的阈值。 下面是我写的R代码,但我想在Pyspark中使用它,我不知道如何在Pyspark中使用它。任何帮助都将不胜感激 值dataframe看起来像 values dataframe is vote 0.3 0.1 0.23 0.45 0.9 0.80 0.36 我在Pypark尝试的是,但我被困在这里 for row in values.collect(): print('iterations left:',row - i, "Thres

我正在尝试使用pyspark打印数据帧值的阈值。 下面是我写的R代码,但我想在Pyspark中使用它,我不知道如何在Pyspark中使用它。任何帮助都将不胜感激

值dataframe看起来像

values dataframe is

vote
0.3
0.1
0.23
0.45
0.9
0.80
0.36
我在Pypark尝试的是,但我被困在这里

for row in values.collect():
     print('iterations left:',row - i, "Threshold:', ...)

每种语言或工具都有不同的处理方式。下面我以您尝试的方式提供答案-

df = sqlContext.createDataFrame([
[0.3],
[0.1],
[0.23],
[0.45],
[0.9],
[0.80],
[0.36]
], ["vote"])

values = df.collect()
toal_values = len(values)

#By default values from collect are not sorted using sorted to sort values in ascending order for vote column
# If you don't want to sort these values at python level just sort it at spark level by using df = df.sort("vote", ascending=False).collect()
# Using enumerate to knowing about index of row

for index, row in enumerate(sorted(values, key=lambda x:x.vote, reverse = False)):
     print ('iterations left:', toal_values - (index+1), "Threshold:", row.vote)

iterations left: 6 Threshold: 0.1
iterations left: 5 Threshold: 0.23
iterations left: 4 Threshold: 0.3
iterations left: 3 Threshold: 0.36
iterations left: 2 Threshold: 0.45
iterations left: 1 Threshold: 0.8
iterations left: 0 Threshold: 0.9
不鼓励使用collect,如果你处理大数据,它会破坏你的程序


谢谢Rakesh,它起作用了。但是,由于我的数据帧很大,如果不建议使用collect()@Tilo,我应该如何处理它,这取决于您希望从中获得什么?
df = sqlContext.createDataFrame([
[0.3],
[0.1],
[0.23],
[0.45],
[0.9],
[0.80],
[0.36]
], ["vote"])

values = df.collect()
toal_values = len(values)

#By default values from collect are not sorted using sorted to sort values in ascending order for vote column
# If you don't want to sort these values at python level just sort it at spark level by using df = df.sort("vote", ascending=False).collect()
# Using enumerate to knowing about index of row

for index, row in enumerate(sorted(values, key=lambda x:x.vote, reverse = False)):
     print ('iterations left:', toal_values - (index+1), "Threshold:", row.vote)

iterations left: 6 Threshold: 0.1
iterations left: 5 Threshold: 0.23
iterations left: 4 Threshold: 0.3
iterations left: 3 Threshold: 0.36
iterations left: 2 Threshold: 0.45
iterations left: 1 Threshold: 0.8
iterations left: 0 Threshold: 0.9