Python 与Scala相比,使用groupBy的Pyspark聚合速度非常慢
我将一个Scala代码移植到Python中进行简单的聚合:Python 与Scala相比,使用groupBy的Pyspark聚合速度非常慢,python,scala,apache-spark,pyspark,Python,Scala,Apache Spark,Pyspark,我将一个Scala代码移植到Python中进行简单的聚合: from time import time from utils import notHeader, parse, pprint from pyspark import SparkContext start = time() src = "linkage" sc = SparkContext("spark://aiur.local:7077", "linkage - Python") rawRdd = sc.textFile(src)
from time import time
from utils import notHeader, parse, pprint
from pyspark import SparkContext
start = time()
src = "linkage"
sc = SparkContext("spark://aiur.local:7077", "linkage - Python")
rawRdd = sc.textFile(src)
noheader = rawRdd.filter(notHeader)
parsed = noheader.map(parse)
grouped = parsed.groupBy(lambda md: md.matched)
res = grouped.mapValues(lambda vals: len(vals)).collect()
for x in res: pprint(x)
diff = time() - start
mins, secs = diff / 60, diff % 60
print "Analysis took {} mins and {} secs".format(int(mins), int(secs))
sc.stop()
utils.py:
from collections import namedtuple
def isHeader(line):
return line.find("id_1") >= 0
def notHeader(line):
return not isHeader(line)
def pprint(s):
print s
MatchedData = namedtuple("MatchedData", "id_1 id_2 scores matched")
def parse(line):
pieces = line.split(",")
return MatchedData(pieces[0], pieces[1], pieces[2:11], pieces[11])
以及Scala版本:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SparkTest {
def main(args: Array[String]): Unit = {
val start: Long = System.currentTimeMillis/1000
val filePath = "linkage"
val conf = new SparkConf()
.setAppName("linkage - Scala")
.setMaster("spark://aiur.local:7077")
val sc = new SparkContext(conf)
val rawblocks = sc.textFile(filePath)
val noheader = rawblocks.filter(x => !isHeader(x))
val parsed = noheader.map(line => parse(line))
val grouped = parsed.groupBy(md => md.matched)
grouped.mapValues(x => x.size).collect().foreach(println)
val diff = System.currentTimeMillis/1000 - start
val (mins, secs) = (diff / 60, diff % 60)
val pf = printf("Analysis took %d mins and %d secs", mins, secs)
println(pf)
sc.stop()
}
def isHeader(line: String): Boolean = {
line.contains("id_1")
}
def toDouble(s: String): Double = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
def parse(line: String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
}
Scala版本在26秒内完成,但Python版本需要约6分钟。日志显示,在完成各自的collect()调用方面存在很大差异
Python:
17/01/25 16:22:10 INFO DAGScheduler: ResultStage 1 (collect at /Users/esamson/Hackspace/Spark/run_py/dcg.py:12) finished in 234.860 s
17/01/25 16:22:10 INFO DAGScheduler: Job 0 finished: collect at /Users/esamson/Hackspace/Spark/run_py/dcg.py:12, took 346.675760 s
斯卡拉:
17/01/25 16:26:23 INFO DAGScheduler: ResultStage 1 (collect at Spark.scala:17) finished in 9.619 s
17/01/25 16:26:23 INFO DAGScheduler: Job 0 finished: collect at Spark.scala:17, took 22.022075 s
“群比”似乎是唯一有意义的称呼。那么,有什么方法可以提高Python代码的性能吗?您使用的是RDD,因此,在对它们进行转换(例如groupby、map)时,必须向它们传递函数。当您在scala中传递这些函数时,这些函数将简单地运行。在python中执行相同操作时,spark需要序列化这些函数,在每个执行器上打开一个python VM,然后当函数需要运行时,它需要将scala数据转换为python,将其传递给python VM,然后传递并转换结果 所有这些转换都需要大量工作,因此pyspark中的RDD工作通常比scala慢得多 解决这个问题的一种可能方法是使用数据帧逻辑,它允许使用已经创建的函数(在pyspark.sql.functions中),这些函数在后台使用scala函数。这看起来像这样(对于spark 2.0):
当然,这假设匹配和VAL是列名。有没有不使用dataframe API和CSV读取格式的原因?性能可能会提高到scala级别,如果不是更好的话。@maasg肯定会尝试的
from pyspark import SparkSession
from pyspark.sql.functions import size
src = "linkage"
spark = SparkSession.builder.master(""spark://aiur.local:7077"").appName(""linkage - Python"").getOrCreate()
df = spark.read.option("header", "true").csv(src)
res = df.groupby("md").agg(size(df.vals)).collect()
...