Scala if循环的Spark数据帧耗时较长
我有一个火花DF(DF): 我必须将下面的内容转换为如下内容: 基本上,它应该在找到句号(“.”)和另一行时检测新句子 我已经为上述内容编写了代码:Scala if循环的Spark数据帧耗时较长,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个火花DF(DF): 我必须将下面的内容转换为如下内容: 基本上,它应该在找到句号(“.”)和另一行时检测新句子 我已经为上述内容编写了代码: val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate() val count = df.count.toInt var emptyDF = Seq
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate()
val count = df.count.toInt
var emptyDF = Seq.empty[(Int, Int, String)].toDF("start_time", "end_time", "Sentences")
var b = 0
for (a <- 1 to count){
if(d9.select("words").head(a)(a-1).toSeq.head == "." || a == (count-1))
{
val myList1 = d9.select("words").head(a).toArray.map(_.getString(0))
val myList = d9.select("words").head(a).toArray.map(_.getString(0)).splitAt(b)._2
val text = myList.mkString(" ")
val end_time = d9.select("end_time").head(a)(a-1).toSeq.head.toString.toInt
val start_time = d9.select("start_time").head(a)(b).toSeq.head.toString.toInt
val df1 = spark.sparkContext.parallelize(Seq(start_time)).toDF("start_time")
val df2 = spark.sparkContext.parallelize(Seq(end_time)).toDF("end_time")
val df3 = spark.sparkContext.parallelize(Seq(text)).toDF("Sentences")
val df4 = df1.crossJoin(df2).crossJoin(df3)
emptyDF = emptyDF.union(df4).toDF
b = a
}
}
val spark=SparkSession.builder.appName(“elasticSpark”).master(“local[*]”).config(“spark.scheduler.mode”、“FAIR”).getOrCreate()
val count=df.count.toInt
var emptyDF=Seq.empty[(Int,Int,String)].toDF(“开始时间”、“结束时间”、“句子”)
var b=0
对于(a这是我的尝试。您可以使用一个窗口,通过计算以下行的
数来分隔句子
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("start_time").rowsBetween(Window.currentRow, Window.unboundedFollowing)
val df = Seq((132, 135, "Hi"),
(135, 135, ","),
(143, 152, "I"),
(151, 152, "am"),
(159, 169, "working"),
(194, 197, "on"),
(204, 211, "hadoop"),
(211, 211, "."),
(218, 212, "This"),
(226, 229, "is"),
(234, 239, "Spark"),
(245, 249, "DF"),
(253, 258, "coding"),
(258, 258, "."),
(276, 276, "I")).toDF("start_time", "end_time", "words")
df.withColumn("count", count(when(col("words") === ".", true)).over(w))
.groupBy("count")
.agg(min("start_time").as("start_time"), max("end_time").as("end_time"), concat_ws(" ", collect_list("words")).as("Sentences"))
.drop("count").show(false)
然后,这将给出如下结果,但单词和、
或
之间有一些空格,如下所示:
+----------+--------+-----------------------------+
|start_time|end_time|Sentences |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+
下面是我使用udf而不使用窗口函数的方法
val df=Seq((123,245,"Hi"),(123,245,"."),(123,245,"Hi"),(123,246,"I"),(123,245,".")).toDF("start","end","words")
var count=0
var flag=false
val counterUdf=udf((dot:String) => {
if(flag) {
count+=1
flag=false
}
if (dot == ".")
flag=true
count
})
val df1=df.withColumn("counter",counterUdf(col("words")))
val df2=df1.groupBy("counter").agg(min("start").alias("start"), max("end").alias("end"), concat_ws(" ", collect_list("words")).alias("sentence")).drop("counter")
df2.show()
+-----+---+--------+
|start|end|sentence|
+-----+---+--------+
| 123|246| Hi I .|
| 123|245| Hi .|
+-----+---+--------+
你的话结尾有空格吗?谢谢,但是上面@Nikk的答案对于我的用例来说非常有效。如果你尝试使用我的代码和数据,结果会是一样的。是的,应该是这样。
scala> import org.apache.spark.sql.expressions.Window
scala> df.show(false)
+----------+--------+--------+
|start_time|end_time|words |
+----------+--------+--------+
|132 |135 |Hi |
|135 |135 |, |
|143 |152 |I |
|151 |152 |am |
|159 |169 |working |
|194 |197 |on |
|204 |211 |hadoop |
|211 |211 |. |
|218 |222 |This |
|226 |229 |is |
|234 |239 |Spark |
|245 |249 |DF |
|253 |258 |coding |
|258 |258 |. |
|276 |276 |I |
+----------+--------+--------+
scala> val w = Window.orderBy("start_time", "end_time")
scala> df.withColumn("temp", sum(when(lag(col("words"), 1).over(w) === ".", lit(1)).otherwise(lit(0))).over(w))
.groupBy("temp").agg(min("start_time").alias("start_time"), max("end_time").alias("end_time"),concat_ws(" ",collect_list(trim(col("words")))).alias("sentenses"))
.drop("temp")
.show(false)
+----------+--------+-----------------------------+
|start_time|end_time|sentenses |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+