Scala Spark比较两个数据帧并找到匹配计数
我有两个spark sql DataFram,它们都没有任何唯一的列。第一个数据帧包含n个报文,第二个数据帧包含长文本字符串博客文章。我想在df2上找到匹配项,并在df1中添加计数Scala Spark比较两个数据帧并找到匹配计数,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个spark sql DataFram,它们都没有任何唯一的列。第一个数据帧包含n个报文,第二个数据帧包含长文本字符串博客文章。我想在df2上找到匹配项,并在df1中添加计数 DF1 ------------ words ------------ Stack Stack Overflow users spark scala DF2 -------- POSTS -------- Hello, Stack overflow users , Do you know spark scala
DF1
------------
words
------------
Stack
Stack Overflow
users
spark scala
DF2
--------
POSTS
--------
Hello, Stack overflow users , Do you know spark scala
Spark scala is very fast
Users in stack are good in spark, users
Expected output
------------ ---------------
words match_count
------------ ---------------
Stack 2
Stack Overflow 1
users 3
spark scala 1
似乎join groupBy count可以:
df1
.join(df2, expr("lower(posts) rlike lower(words)"))
.groupBy("words")
.agg(count("*").as("match_count"))
您可以在pyspark中使用pandas功能。下面是我的解决方案
>>> from pyspark.sql import Row
>>> import pandas as pd
>>>
>>> rdd1 = sc.parallelize(['Stack','Stack Overflow','users','spark scala'])
>>> data1 = rdd1.map(lambda x: Row(x))
>>> df1=spark.createDataFrame(data1,['words'])
>>> df1.show()
+--------------+
| words|
+--------------+
| Stack|
|Stack Overflow|
| users|
| spark scala|
+--------------+
>>> rdd2 = sc.parallelize([
... 'Hello, Stack overflow users , Do you know spark scala',
... 'Spark scala is very fast',
... 'Users in stack are good in spark'
... ])
>>> data2 = rdd2.map(lambda x: Row(x))
>>> df2=spark.createDataFrame(data2,['posts'])
>>> df2.show()
+--------------------+
| posts|
+--------------------+
|Hello, Stack over...|
|Spark scala is ve...|
|Users in stack ar...|
+--------------------+
>>> dfPd1 = df1.toPandas()
>>> dfPd2 = df2.toPandas().apply(lambda x: x.str.lower())
>>>
>>> words = dict((x,0) for x in dfPd1['words'])
>>>
>>> for i in words:
... x = dfPd2['posts'].str.contains(i.lower()).sum()
... if i in words:
... words[i] = x
...
>>>
>>> words
{'Stack': 2, 'Stack Overflow': 1, 'users': 2, 'spark scala': 2}
>>>
>>> data = pd.DataFrame.from_dict(words, orient='index').reset_index()
>>> data.columns = ['words','match_count']
>>>
>>> df = spark.createDataFrame(data)
>>> df.show()
+--------------+-----------+
| words|match_count|
+--------------+-----------+
| Stack| 2|
|Stack Overflow| 1|
| users| 2|
| spark scala| 2|
+--------------+-----------+
Scala中的暴力方法(如下所示)不在行上工作,将所有内容都视为小写,可以全部添加,但这是另一天的事情。依赖于不试图检查字符串,而是将ngrams定义为它的本来面目,ngrams与ngrams对比,生成这些字符串,然后进行连接和计数,其中内部连接仅相关。添加一些额外数据以证明匹配
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType,ArrayType,LongType,StringType}
import spark.implicits._
// Sample data, duplicates and items to check it works.
val dfPostsInit = Seq(
( "Hello!!, Stack overflow users, Do you know spark scala users."),
( "Spark scala is very fast,"),
( "Users in stack are good in spark"),
( "Users in stack are good in spark"),
( "xy z"),
( "x yz"),
( "ABC"),
( "abc"),
( "XYZ,!!YYY@#$ Hello Bob..."))
.toDF("posting")
val dfWordsInit = Seq(("Stack"), ("Stack Overflow"),("users"), ("spark scala"), ("xyz"), ("xy"), ("not found"), ("abc")).toDF("words")
val dfWords = dfWordsInit.withColumn("words_perm" ,regexp_replace(dfWordsInit("words"), " ", "^")).withColumn("lower_words_perm" ,lower(regexp_replace(dfWordsInit("words"), " ", "^")))
val dfPostsTemp = dfPostsInit.map(r => (r.getString(0), r.getString(0).split("\\W+").toArray ))
// Tidy Up
val columnsRenamed = Seq("posting", "posting_array")
val dfPosts = dfPostsTemp.toDF(columnsRenamed: _*)
// Generate Ngrams up to some limit N - needs to be set. This so that we can count properly via a JOIN direct comparison. Can parametrize this in calls below.
// Not easy to find string matching over Array and no other answer presented.
def buildNgrams(inputCol: String = "posting_array", n: Int = 3) = {
val ngrams = (1 to n).map(i =>
new NGram().setN(i)
.setInputCol(inputCol).setOutputCol(s"${i}_grams")
)
new Pipeline().setStages((ngrams).toArray)
}
val suffix:String = "_grams"
var i_grams_Cols:List[String] = Nil
for(i <- 1 to 3) {
val iGCS = i.toString.concat(suffix)
i_grams_Cols = i_grams_Cols ::: List(iGCS)
}
// Generate data for checking against later from via rows only and thus not via columns, positional dependency counts, hence permutations.
val dfPostsNGrams = buildNgrams().fit(dfPosts).transform(dfPosts)
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union (dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
val dfPostsNGrams3 = dfPostsNGrams2.withColumn("lower_phrase_concatenated",lower(regexp_replace(dfPostsNGrams2("phrase"), " ", "^")))
val result = dfPostsNGrams3.join(dfWords, col("lower_phrase_concatenated") ===
col("lower_words_perm"), "inner")
.groupBy("words_perm", "words")
.agg(count("*").as("match_count"))
result.select("words", "match_count").show(false)
似乎不是。考虑一下这样的行:您好,堆栈溢出用户,您知道Scala Scala用户。它不计算两次用户,事实上,用户呢?->边case@thebluephantom,你假设OP想要统计帖子中多次出现的单词,但是从问题中不清楚他到底想要什么。事实上,提供的例子意味着没有这种需要。我不是透视者,所以我认为这需要解决。我能理解他为什么问这个问题,并思考为什么-1。我只是想从n-gram中得到文档频率。因此,我有一个数据框架,其中包含其他df中的术语或bigram集合。我只有帖子描述。所以我想把df1中的每一个单词都记下来,然后从所有行中查看第二个df中出现的单词的数量。对我来说仍然很难理解。我将此作为一种有趣(如果不是过激的话)方法的基础。快完成了。你好,阿里,谢谢你的回答,我正在寻找apache scala而不是pyspark的解决方案。我将尝试将您的代码转换为scala@Gowri_为了他人的利益,请发布您的努力。我也在做一个解决方案,所以热衷于比较。正如我们看到的-1是无效的。我不是scala方面的专家。也许有人可以把它转换成scala。这就是我添加python代码的原因。我希望这会有助于对我的答案或其他答案的任何评论?有趣的问题逃脱了一些人的注意。显示SPARK/SCALA的强大功能。
+--------------+-----------+
|words |match_count|
+--------------+-----------+
|spark scala |2 |
|users |4 |
|abc |2 |
|Stack Overflow|1 |
|xy |1 |
|Stack |3 |
|xyz |1 |
+--------------+-----------+