如何将ApacheSpark与MySQL集成,以作为Spark数据帧读取数据库表?
我想用Apache Spark和MySQL运行我现有的应用程序。基于此,尝试以下假设Java或Scala,不确定这将如何与python一起工作: 将添加到spark群集的路径 初始化驱动程序:Class.forNamecom.mysql.jdbc.driver 创建数据源:如何将ApacheSpark与MySQL集成,以作为Spark数据帧读取数据库表?,mysql,apache-spark,Mysql,Apache Spark,我想用Apache Spark和MySQL运行我现有的应用程序。基于此,尝试以下假设Java或Scala,不确定这将如何与python一起工作: 将添加到spark群集的路径 初始化驱动程序:Class.forNamecom.mysql.jdbc.driver 创建数据源: 从pySpark,它对我有用: dataframe_mysql = mySqlContext.read.format("jdbc").options( url="jdbc:mysql://localhost:3306
从pySpark,它对我有用:
dataframe_mysql = mySqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/my_bd_name",
driver = "com.mysql.jdbc.Driver",
dbtable = "my_tablename",
user="root",
password="root").load()
使用Scala,这对我很有用:
使用以下命令:
sudo -u root spark-shell --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar,/mnt/resource/lokeshtest/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar --packages com.databricks:spark-csv_2.10:1.2.0
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
dataframe_mysql.show()
对于Scala,如果使用sbt,这也会起作用
在build.sbt文件中:
然后您只需要声明您对驱动程序的使用情况
Class.forName("com.mysql.jdbc.Driver").newInstance
val conf = new SparkConf().setAppName("MY_APP_NAME").setMaster("MASTER")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val data = sqlContext.read
.format("jdbc")
.option("url", "jdbc:mysql://<HOST>:3306/<database>")
.option("user", <USERNAME>)
.option("password", <PASSWORD>)
.option("dbtable", "MYSQL_QUERY")
.load()
使用spark 2.0.x,您可以使用DataFrameReader和DataFrameWriter。 使用SparkSession.read访问DataFrameReader,使用Dataset.write访问DataFrameWriter 假设使用火花壳 阅读示例 阅读示例2 从 阅读示例3 如果要从查询结果而不是表中读取数据
val sql="""select * from db.your_table where id>1"""
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", s"( $sql ) t")
.option("user", "username")
.option("password", "password")
.load()
举例说明
对于Windows 7操作系统上的Spark 2.1.0和Scala,以下代码对我来说非常有用:
import org.apache.spark.sql.SparkSession
object MySQL {
def main(args: Array[String]) {
//At first create a Spark Session as the entry point of your app
val spark:SparkSession = SparkSession
.builder()
.appName("JDBC")
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/Exp/")
.getOrCreate();
val dataframe_mysql = spark.read.format("jdbc")
.option("url", "jdbc:mysql://localhost/feedback")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "person") //replace with own
.option("user", "root") //replace with own
.option("password", "vertrigo") // replace with own
.load()
dataframe_mysql.show()
}
}
对于Java,这对我很有用:
@Bean
public SparkConf sparkConf() {
SparkConf sparkConf = new SparkConf()
.setAppName(appName)
.setSparkHome(sparkHome)
.setMaster(masterUri);
return sparkConf;
}
@Bean
public JavaSparkContext javaSparkContext() {
return new JavaSparkContext(sparkConf());
}
@Bean
public SparkSession sparkSession() {
return SparkSession
.builder()
.sparkContext(javaSparkContext().sc())
.appName("Java Spark SQL basic example")
.getOrCreate();
}
当然,对于MySQL,我需要连接器:
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
对于使用maven的Java,在pom.xml文件中添加spark依赖项和sql驱动程序依赖项
<properties>
<java.version>1.8</java.version>
<spark.version>1.6.3</spark.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
也许这会有所帮助:现在不鼓励使用JdbcRDD。最好看看Spark 1.4及更高版本中的DataFrame接口。@Mattingenshron这是真的,尽管当问题被提出和回答时,它不可用。是的,理解。我只是在搜索我自己和其他人时发现了这一点,所以我更新了它,以确保新用户找到最新的东西。这段代码将停止连接spark与databaseSQLContext.load,现在已弃用,并将在2.0中删除。看起来像是一个自动完成的错误,com.imf.jdbc.Driver->com.mysql.jdbc.Driver?你说得对!感谢您捕捉到这一点。mySqlContext应该是sqlContext^这只是一个变量。你可以随意命名。如果我使用的是ODBC而不是JDBC,那么在上面的文本中切换这两个名称是否完全相同?对于spark2.x,使用dataframe=spark\u session.read.formatjdbc.options…Load工作得很好,很干净!谢谢,我们如何使用spark从mysql连接中删除记录?我需要在回答中指出驱动程序选项,以使其正常工作
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
val sql="""select * from db.your_table where id>1"""
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", s"( $sql ) t")
.option("user", "username")
.option("password", "password")
.load()
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
import org.apache.spark.sql.SparkSession
object MySQL {
def main(args: Array[String]) {
//At first create a Spark Session as the entry point of your app
val spark:SparkSession = SparkSession
.builder()
.appName("JDBC")
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/Exp/")
.getOrCreate();
val dataframe_mysql = spark.read.format("jdbc")
.option("url", "jdbc:mysql://localhost/feedback")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "person") //replace with own
.option("user", "root") //replace with own
.option("password", "vertrigo") // replace with own
.load()
dataframe_mysql.show()
}
}
@Bean
public SparkConf sparkConf() {
SparkConf sparkConf = new SparkConf()
.setAppName(appName)
.setSparkHome(sparkHome)
.setMaster(masterUri);
return sparkConf;
}
@Bean
public JavaSparkContext javaSparkContext() {
return new JavaSparkContext(sparkConf());
}
@Bean
public SparkSession sparkSession() {
return SparkSession
.builder()
.sparkContext(javaSparkContext().sc())
.appName("Java Spark SQL basic example")
.getOrCreate();
}
Properties properties = new Properties();
properties.put("user", "root");
properties.put("password", "root");
properties.put("driver", "com.mysql.cj.jdbc.Driver");
sparkSession.read()
.jdbc("jdbc:mysql://localhost:3306/books?useSSL=false", "(SELECT books.BOOK_ID as BOOK_ID, books.BOOK_TITLE as BOOK_TITLE, books.BOOK_AUTHOR as BOOK_AUTHOR, borrowers.BORR_NAME as BORR_NAME FROM books LEFT OUTER JOIN borrowers ON books.BOOK_ID = borrowers.BOOK_ID) as t", properties) // join example
.show();
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
+-------+------------------+--------------+---------------+
|BOOK_ID| BOOK_TITLE| BOOK_AUTHOR| BORR_NAME|
+-------+------------------+--------------+---------------+
| 1| Gyűrű kúra|J.R.K. Tolkien| Sára Sarolta|
| 2| Kecske-eledel| Mekk Elek|Maláta Melchior|
| 3| Répás tészta| Vegán Eleazár| null|
| 4|Krumpli és pityóka| Farmer Emília| null|
+-------+------------------+--------------+---------------+
<properties>
<java.version>1.8</java.version>
<spark.version>1.6.3</spark.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
SparkConf sparkConf = new SparkConf();
SparkContext sc = new SparkContext("local", "spark-mysql-test", sparkConf);
SQLContext sqlContext = new SQLContext(sc);
// here you can run sql query
String sql = "(select * from table1 join table2 on table1.id=table2.table1_id) as test_table";
// or use an existed table directly
// String sql = "table1";
DataFrame dataFrame = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.load();
// continue your logical code
......