Apache spark 使用列表定义查询中的选择列_Apache Spark_Pyspark_Pyspark Sql

Apache spark 使用列表定义查询中的选择列

apache-spark pyspark

Apache spark 使用列表定义查询中的选择列,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,我需要从一个拼花文件中查询，其中的列名完全不一致。为了解决这个问题并确保我的模型准确地获得它期望的数据，我需要“预取”列列表，然后应用一些正则表达式模式来限定需要检索的列。在伪代码中： PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields # Insert variable statements to check/qualify the columns against rules here dfQualifie

我需要从一个拼花文件中查询，其中的列名完全不一致。为了解决这个问题并确保我的模型准确地获得它期望的数据，我需要“预取”列列表，然后应用一些正则表达式模式来限定需要检索的列。在伪代码中：

PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;

我四处寻找，看看这是否可以实现，但没有任何成功。如果这在语法上是正确的（或相近的），或者如果有人有其他建议，我愿意接受

谢谢

您可以使用

模式

方法，也可以使用

列

方法。注意spark中的select方法有点奇怪，它被定义为

def
选择（col:String，cols:String*）

这样你就不能返回到它

select（fields:*）

，你必须使用

df.select（fields.head，fields.tail:*）

这有点难看，但幸运的是有

selectExpr（exprs:String*）

作为替代。因此，下面的方法将起作用。它只接受以“user”开头的列

fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)

当然，这假设

df

包含您的数据帧，其中加载了

sqlContext.read.parquet（）

我看到了pyspark标记，我假设您使用的是Python buy如果您对使用scala很满意，那么case类可以有效地解决您的问题。