PythonSpark按键计算元组值的平均值
我正试图按段落找出单词的平均长度。数据以1 |的格式从文本文件中提取五年多。。。其中每行都有一个段落编号 到目前为止,这是我的代码:PythonSpark按键计算元组值的平均值,python,pyspark,rdd,Python,Pyspark,Rdd,我正试图按段落找出单词的平均长度。数据以1 |的格式从文本文件中提取五年多。。。其中每行都有一个段落编号 到目前为止,这是我的代码: from pyspark import SparkContext, SparkConf sc = SparkContext('local', 'longest') text = sc.textFile("walden.txt") lines = text.map(lambda line: (line.split("|")[0
from pyspark import SparkContext, SparkConf
sc = SparkContext('local', 'longest')
text = sc.textFile("walden.txt")
lines = text.map(lambda line: (line.split("|")[0],line))
lines = lines.filter(lambda kv: len(kv[1]) > 0)
words = lines.mapValues(lambda x: x.replace("1|","").replace("2|","").replace("3|",""))
words = words.mapValues(lambda x: x.split())
words = words.mapValues(lambda x: [(len(i),1) for i in x])
words = words.reduceByKey(lambda a,b: a+b)
words.saveAsTextFile("results")
当前输出遵循以下格式:
('1', [(2,1),(6,1),(1,1)..etc)]),('2', [(2,1),(6,1),(1,1)..etc)]),('3', [(2,1),(6,1),(1,1)..etc)])
其中'1'/'2'/'3'是段落ID,元组遵循(字长,1)格式
我需要做的是对元组的值求和(按键/段落ID),使(2,1)、(6,1)、(1,1)变成(9,3),然后除以这些值(9/3),以找到每个段落中单词的平均长度
我尝试了很多不同的方法,但都没能成功。非常感谢你的帮助 对于您的rdd案例,试试这个
text = sc.textFile("test.txt")
lines = text.map(lambda line: (line.split("|")[0],line))
lines = lines.filter(lambda kv: len(kv[1]) > 0)
words = lines.mapValues(lambda x: x.replace("1|","").replace("2|","").replace("3|",""))
words = words.mapValues(lambda x: x.split())
words = words.mapValues(lambda x: [len(i) for i in x])
words = words.mapValues(lambda x: sum(x) / len(x))
words.collect()
[('1', 4.0), ('2', 5.4), ('3', 7.0)]
我使用数据帧得到了这个
import pyspark.sql.functions as f
df = spark.read.option("inferSchema","true").option("sep","|").csv("test.txt").toDF("col1", "col2")
df.show(10, False)
+----+---------------------------------------+
|col1|col2 |
+----+---------------------------------------+
|1 |For more than five years |
|2 |For moasdre than five asdfyears |
|3 |Fasdfor more thasdfan fidafve yearasdfs|
+----+---------------------------------------+
df.withColumn('array', f.split('col2', r'[ ][ ]*')) \
.withColumn('count_arr', f.expr('transform(array, x -> LENGTH(x))')) \
.withColumn('sum', f.expr('aggregate(array, 0, (sum, x) -> sum + LENGTH(x))')) \
.withColumn('size', f.size('array')) \
.withColumn('avg', f.col('sum') / f.col('size')) \
.show(10, False)
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
|col1|col2 |array |count_arr |sum|size|avg|
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
|1 |For more than five years |[For, more, than, five, years] |[3, 4, 4, 4, 5]|20 |5 |4.0|
|2 |For moasdre than five asdfyears |[For, moasdre, than, five, asdfyears] |[3, 7, 4, 4, 9]|27 |5 |5.4|
|3 |Fasdfor more thasdfan fidafve yearasdfs|[Fasdfor, more, thasdfan, fidafve, yearasdfs]|[7, 4, 8, 7, 9]|35 |5 |7.0|
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
我知道这确实是一种不同的方法,但会很有帮助。嘿,拉曼努斯,非常感谢你。我相信对于这项任务,我只能使用RDD。您是否有可能帮助解决RDD问题?对于您的情况,我添加了一点。收到以下错误:“不支持的操作数类型:'int'和'tuple'”嗯,我得到了我的另一种方法的确切结果,它是有效的。请检查您的函数是否与我相同。啊,正如您所指出的,我发现了[len(I)for I in x]的错误。这提供了期望的结果。非常感谢。