在JavaMaven项目中通过Spark查询Cassandra中的数据
我试图编写一个简单的代码,创建一个模式,插入一些表,然后提取一些信息并打印出来。然而,我得到了一个错误。我用的是Datastax cassandra火花连接器。我一直在使用这两个例子来帮助我实现这一目标: 但是,第二个示例不使用cassandra spark连接器,也不使用spark 这是我的密码:在JavaMaven项目中通过Spark查询Cassandra中的数据,java,maven,cassandra,apache-spark,connector,Java,Maven,Cassandra,Apache Spark,Connector,我试图编写一个简单的代码,创建一个模式,插入一些表,然后提取一些信息并打印出来。然而,我得到了一个错误。我用的是Datastax cassandra火花连接器。我一直在使用这两个例子来帮助我实现这一目标: 但是,第二个示例不使用cassandra spark连接器,也不使用spark 这是我的密码: package com.angel.testspark.test; import com.datastax.driver.core.ResultSet; import com.datasta
package com.angel.testspark.test;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.Session;
import com.datastax.spark.connector.cql.CassandraConnector;
import com.google.common.base.Optional;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFlatMapFunction;
import scala.Tuple2;
import java.io.Serializable;
import java.math.BigDecimal;
import java.text.MessageFormat;
import java.util.*;
import static com.datastax.spark.connector.CassandraJavaUtil.*;
public class App
{
private transient SparkConf conf;
private App(SparkConf conf) {
this.conf = conf;
}
private void run() {
JavaSparkContext sc = new JavaSparkContext(conf);
createSchema(sc);
sc.stop();
}
private void createSchema(JavaSparkContext sc) {
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
// Prepare the schema
try (Session session = connector.openSession()) {
session.execute("DROP KEYSPACE IF EXISTS tester");
session.execute("CREATE KEYSPACE tester WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}");
session.execute("CREATE TABLE tester.emp (id INT PRIMARY KEY, fname TEXT, lname TEXT, role TEXT)");
session.execute("CREATE TABLE tester.dept (id INT PRIMARY KEY, dname TEXT)");
session.execute(
"INSERT INTO tester.emp (id, fname, lname, role) " +
"VALUES (" +
"0001," +
"'Angel'," +
"'Pay'," +
"'IT Engineer'" +
");");
session.execute(
"INSERT INTO tester.emp (id, fname, lname, role) " +
"VALUES (" +
"0002," +
"'John'," +
"'Doe'," +
"'IT Engineer'" +
");");
session.execute(
"INSERT INTO tester.emp (id, fname, lname, role) " +
"VALUES (" +
"0003," +
"'Jane'," +
"'Doe'," +
"'IT Analyst'" +
");");
session.execute(
"INSERT INTO tester.dept (id, dname) " +
"VALUES (" +
"1553," +
"'Commerce'" +
");");
ResultSet results = session.execute("SELECT * FROM tester.emp " +
"WHERE role = 'IT Engineer';");
for (Row row : results) {
System.out.print(row.getString("fname"));
System.out.print(" ");
System.out.print(row.getString("lname"));
System.out.println();
}
System.out.println();
}
}
public static void main( String[] args )
{
if (args.length != 2) {
System.err.println("Syntax: com.datastax.spark.demo.JavaDemo <Spark Master URL> <Cassandra contact point>");
System.exit(1);
}
SparkConf conf = new SparkConf();
conf.setAppName("Java API demo");
conf.setMaster(args[0]);
conf.set("spark.cassandra.connection.host", args[1]);
App app = new App(conf);
app.run();
}
}
我相信这可能只是一个语法错误,我只是不确定它在哪里,是什么
任何帮助都会很好,谢谢。我浏览过互联网,还没有发现一个简单的例子,可以用cassandra和spark在java中插入数据和提取数据
******编辑:@BryceAtNetwork23和@mikea关于我的语法错误是正确的,所以我编辑了这个问题并修复了它。我收到一个新错误,因此我粘贴了新错误并更新了代码尝试通过cqlsh运行您的CQL,您应该会收到相同/类似的错误:
aploetz@cqlsh:stackoverflow> CREATE TABLE dept (id INT PRIMARY KEY, dname TEXT);
aploetz@cqlsh:stackoverflow> INSERT INTO dept (id, dname) VALUES (1553,Commerce);
<ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:50 no viable alternative at
input ')' (... dname) VALUES (1553,Commerce[)]...)">
我现在又犯了一个错误
也可以试着从cqlsh运行它
aploetz@cqlsh:stackoverflow> SELECT * FROM emp WHERE role = 'IT Engineer';
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
这是因为角色未定义为主键。Cassandra不允许您按任意列值进行查询。解决这个问题的最佳方法是创建一个名为empByRole的附加查询表,并使用role
作为分区键。像这样:
CREATE TABLE empByRole
(id INT, fname TEXT, lname TEXT, role TEXT,
PRIMARY KEY (role,id)
);
aploetz@cqlsh:stackoverflow> INSERT INTO empByRole (id, fname, lname, role) VALUES (0001,'Angel','Pay','IT Engineer');
aploetz@cqlsh:stackoverflow> SELECT * FROM empByRole WHERE role = 'IT Engineer';
role | id | fname | lname
-------------+----+-------+-------
IT Engineer | 1 | Angel | Pay
(1 rows)
似乎某些进程已经在侦听4040。可能是您的应用程序或Spark正在运行(可能在另一个终端会话中)?使用jps
查看Java进程并清理它们。@JacekLaskowski谢谢你的提示。我终止了正在运行的进程,该进程消除了“地址已在使用”错误。然而,我仍然有我的错误的第二部分。我已经编辑了我的原始帖子,以显示我现在有哪些错误。看起来上次插入的内容格式不正确。“商业”应该是“商业”,谢谢。你是对的,但现在我有一个新的错误,我的相等运算符有问题。它需要是“==”吗?也许我会在代码中尝试一下。谢谢你发现我的语法错误!我现在又犯了一个错误。。。“线程中的异常”main“com.datastax.driver.core.exceptions.InvalidQueryException:在com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)”上使用相等运算符的by columns子句中不存在索引列。我已经用我的新代码更新了我的帖子,并且出错了!我上一个println的格式有一些错误,但在修复后,它工作得很好。非常感谢,没问题。很高兴我能帮忙!
aploetz@cqlsh:stackoverflow> SELECT * FROM emp WHERE role = 'IT Engineer';
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
CREATE TABLE empByRole
(id INT, fname TEXT, lname TEXT, role TEXT,
PRIMARY KEY (role,id)
);
aploetz@cqlsh:stackoverflow> INSERT INTO empByRole (id, fname, lname, role) VALUES (0001,'Angel','Pay','IT Engineer');
aploetz@cqlsh:stackoverflow> SELECT * FROM empByRole WHERE role = 'IT Engineer';
role | id | fname | lname
-------------+----+-------+-------
IT Engineer | 1 | Angel | Pay
(1 rows)