Spark DataFrame未在JDBC数据源中执行group by语句

Spark DataFrame未在JDBC数据源中执行group by语句,jdbc,apache-spark,apache-spark-sql,Jdbc,Apache Spark,Apache Spark Sql,我已经注册了一个MySQL数据源,如下所示: val driver = "com.mysql.jdbc.Driver" val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname" val jdbcDF = sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbt

我已经注册了一个MySQL数据源,如下所示:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"

val jdbcDF = sqlContext.load("jdbc", Map(
  "url" -> url,
  "driver" -> driver,
  "dbtable" -> "videos"))

jdbcDF.registerTempTable("videos")
然后执行以下Spark SQL查询:

select
   uploader, count(*) as items
from
   videos_table
where
   publisher_id = 154
group by
   uploader
order by
   items desc
此调用实际上在MySQL服务器上执行以下查询:

SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154
然后将数据加载到Spark群集,并作为Spark操作执行group by


这种行为是有问题的,因为不在MySQL服务器上执行group by会产生过多的网络流量。有没有办法强制DataFrame在MySQL服务器上运行文本查询?

好吧,这取决于具体情况。Spark只能在JDBC上下推谓词,因此不可能在数据库端动态执行任意查询。尽管如此,仍然可以将任何有效查询用作表参数,以便执行以下操作:

val表格查询= 选择uploader,按uploader tmp将*计为视频组中的项目 val jdbcDF=sqlContext.loadjdbc,映射 url->url, 驱动程序->驱动程序, dbtable->tableQuery 如果这还不够,您可以尝试创建一个自定义