Apache spark 在2.2.0中，如果给定Apache Spark数据帧，如何获取Cassandra cql字符串？_Apache Spark_Cassandra_Spark Cassandra Connector

Apache spark 在2.2.0中，如果给定Apache Spark数据帧，如何获取Cassandra cql字符串？

apache-spark cassandra

Apache spark 在2.2.0中，如果给定Apache Spark数据帧，如何获取Cassandra cql字符串？,apache-spark,cassandra,spark-cassandra-connector,Apache Spark,Cassandra,Spark Cassandra Connector,我正在尝试获取给定数据帧的cql字符串。我遇到了这个在那里我可以做这样的事情 TableDef.fromDataFrame(df, "test", "hello", ProtocolVersion.NEWEST_SUPPORTED).cql() 在我看来，库使用第一列作为分区键，而不关心集群键，那么我如何指定使用数据帧的特定列集作为分区键，使用特定列集作为集群键呢看起来我可以创建一个新的TableDef，但是我必须自己完成整个映射，而且在某些情况下，必要的函数（如ColumnType）在J

我正在尝试获取给定数据帧的cql字符串。我遇到了这个

在那里我可以做这样的事情

TableDef.fromDataFrame(df, "test", "hello", ProtocolVersion.NEWEST_SUPPORTED).cql()

在我看来，库使用第一列作为分区键，而不关心集群键，那么我如何指定使用数据帧的特定列集作为分区键，使用特定列集作为集群键呢

看起来我可以创建一个新的TableDef，但是我必须自己完成整个映射，而且在某些情况下，必要的函数（如ColumnType）在Java中无法访问。例如，我尝试创建一个新的ColumnDef，如下所示

new ColumnDef("col5", new PartitionKeyColumn(), ColumnType is not accessible in Java)

目标：从Spark数据帧获取CQL create语句

InputMy dataframe可以有任意数量的列及其各自的Spark类型。假设我有一个有100列的Spark数据框，其中我的数据框的col8，col9对应于cassandra partitionKey列，而我的column10对应于cassandra clustering Key列

col1| col2| ...|col100

现在，我想使用spark cassandra连接器库，根据上面的信息，给我一个CQL create table语句

所需输出

create table if not exists test.hello (
   col1 bigint, (whatever column1 type is from my dataframe I just picked bigint randomly)
   col2 varchar,
   col3 double,
   ...
   ...
   col100 bigint,
   primary key(col8,col9)
) WITH CLUSTERING ORDER BY (col10 DESC);

由于必需的组件（

PartitionKeyColumn

和

ColumnType

的实例）是Scala中的对象，因此需要使用以下语法来访问它们的intance：

// imports
import com.datastax.spark.connector.cql.ColumnDef;
import com.datastax.spark.connector.cql.PartitionKeyColumn$;
import com.datastax.spark.connector.types.TextType$;

// actual code
ColumnDef a = new ColumnDef("col5",  
      PartitionKeyColumn$.MODULE$, TextType$.MODULE$);

请参阅代码以查找对象/类的完整名称列表（&O）

附加要求后更新：代码很长，但应该可以工作

SparkSession spark = SparkSession.builder()
                .appName("Java Spark SQL example").getOrCreate();

Set<String> partitionKeys = new TreeSet<String>() {{
                add("col1");
                add("col2");
        }};
Map<String, Integer> clustereingKeys = new TreeMap<String, Integer>() {{
                put("col8", 0);
                put("col9", 1);
        }};

Dataset<Row> df = spark.read().json("my-test-file.json");
TableDef td = TableDef.fromDataFrame(df, "test", "hello", 
                ProtocolVersion.NEWEST_SUPPORTED);

List<ColumnDef> partKeyList = new ArrayList<ColumnDef>();
List<ColumnDef> clusterColumnList = new ArrayList<ColumnDef>();
List<ColumnDef> regColulmnList = new ArrayList<ColumnDef>();

scala.collection.Iterator<ColumnDef> iter = td.allColumns().iterator();
while (iter.hasNext()) {
        ColumnDef col = iter.next();
        String colName = col.columnName();
        if (partitionKeys.contains(colName)) {
                partKeyList.add(new ColumnDef(colName, 
                                PartitionKeyColumn$.MODULE$, col.columnType()));
        } else if (clustereingKeys.containsKey(colName)) {
                int idx = clustereingKeys.get(colName);
                clusterColumnList.add(new ColumnDef(colName, 
                                new ClusteringColumn(idx), col.columnType()));
        } else {
                regColulmnList.add(new ColumnDef(colName, 
                                RegularColumn$.MODULE$, col.columnType()));
        }
}

TableDef newTd = new TableDef(td.keyspaceName(), td.tableName(), 
                (scala.collection.Seq<ColumnDef>) partKeyList,
                (scala.collection.Seq<ColumnDef>) clusterColumnList, 
                (scala.collection.Seq<ColumnDef>) regColulmnList,
                td.indexes(), td.isView());
String cql = newTd.cql();
System.out.println(cql);

SparkSession spark=SparkSession.builder（）
.appName（“Java Spark SQL示例”）.getOrCreate（）；
Set partitionKeys=new TreeSet（）{{
添加（“col1”）；
添加（“col2”）；
}};
Map clustereingKeys=new TreeMap（）{{
put（“col8”，0）；
put（“col9”，1）；
}};
Dataset df=spark.read（）.json（“我的测试文件.json”）；
TableDef td=TableDef.fromDataFrame（df，“test”，“hello”，
协议版本（支持最新版本）；
List partKeyList=新建ArrayList（）；
List clusterColumnList=新建ArrayList（）；
List regColulmnList=new ArrayList（）；
scala.collection.Iterator iter=td.allColumns（）.Iterator（）；
while（iter.hasNext（））{
ColumnDef col=iter.next（）；
字符串colName=col.columnName（）；
if（partitionKeys.contains（colName））{
partKeyList.add（新列定义）（colName，
PartitionKeyColumn$.MODULE$，col.columnType（））；
}else if（clustereingkey.containsKey（colName））{
int idx=clustereingKeys.get（colName）；
clusterColumnList.add（新列定义）（colName，
新的ClusteringColumn（idx），col.columnType（））；
}否则{
regColulmnList.add（新列定义）（colName，
RegularColumn$.MODULE$，col.columnType（））；
}
}
TableDef newTd=newtabledef（td.keyspaceName（），td.tableName（），
（scala.collection.Seq）partKeyList，
（scala.collection.Seq）clusterColumnList，
（scala.collection.Seq）regColulmnList，
td.index（），td.isView（））；
字符串cql=newTd.cql（）；
系统输出打印LN（cql）；

非常感谢Alex！我不是一个Scala人，但现在你说的有道理了！那么，我是否应该假设在给定数据帧的情况下，没有简单的方法来获取cql字符串？因为我觉得Tabeldf.fromDataFrame就快到了！！我需要多看看。。。据我所知，您希望重用现有的定义？所以我的spark数据框架中有大约100列（col1、col2、…col100）。现在我想从中获得一个cql create语句。我可以从DataFrame（df，“test”，“hello”，ProtocolVersion.newst_SUPPORTED）中执行

TableDef.cql（）

来获得一个cql create语句，但唯一的问题是我没有找到一种方法来指定使用col8、col9作为PARTIONKEY，使用col10作为clusterKey。如果我需要创建新的TableDef，那么我需要为我所有的100列创建一个

ColDef

，这有点乏味，我会做我最后的选择。如果你能让我知道，那将是很大的帮助！您能否更新问题，更详细地说明您希望最终实现的目标？我正确理解您想从表定义生成CQL语句？嗨，Alex！我只是编辑了我的问题并添加了所有细节。