Apache spark PySpark：带前缀/后缀的选择_Apache Spark_Pyspark

Apache spark PySpark：带前缀/后缀的选择

apache-spark pyspark

Apache spark PySpark：带前缀/后缀的选择,apache-spark,pyspark,Apache Spark,Pyspark,我目前正在使用Spark数据帧（使用PySpark），该数据帧表示大量tweet，其中我有以下（修剪过的）模式：我想通过从原始数据框中选择许多列来创建一个新的数据框。例如，allProperties.text和allProperties.entities.hashtags。但是，我还想选择相同的推文，它们是转发或引用推文，分别由前缀allProperties.retweeted\u status或allProperties.quoted\u status表示有没有一种方法可以让我选择所有这些

我目前正在使用Spark数据帧（使用PySpark），该数据帧表示大量tweet，其中我有以下（修剪过的）模式：

我想通过从原始数据框中选择许多列来创建一个新的数据框。例如，

allProperties.text

和

allProperties.entities.hashtags

。但是，我还想选择相同的推文，它们是转发或引用推文，分别由前缀

allProperties.retweeted\u status

或

allProperties.quoted\u status

表示

有没有一种方法可以让我选择所有这些列，而不必使用大量冗余行来指定每个列及其前缀？例如，通过提供某种正则表达式来搜索

allProperties.text

，

allProperties.retweeted\u status.text

，

allProperties.quoted\u status.text

作为旁注，我应该说我确实希望将数据帧保持在顶层，因为我还希望在新的数据帧中包含

topic

到目前为止，我已成功编写了一个与所需列匹配的正则表达式：

def _keep_columns(self):

        def _regex_filter(x):
            tweet_features = [
                'text',
                'entities.hashtags',
                'entities.media',
                'entities.urls',
            ]

            r = (('(^allProperties.(retweeted_status.|quoted_status.)'
                  '?('+"|".join(tweet_features)+')$)'
                  '|(^topic$)'))
            return bool(re.match(r, x))

        df = self.tweets.select(*filter(lambda x: _regex_filter(x), self.tweets.columns))

但是，

self.tweets.columns

只返回顶级列，因此无法在

allProperties

下找到嵌套列。如何以嵌套方式搜索？

您可以使用df.selectExpr（“allProperties.*”、“topic.”等）
（或）其他方式展平结构列

然后我们可以
createTempView
让数据帧从临时视图中选择匹配列的regex

示例：

#sample dataframe after flattening df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"]) df.show() #+------------------+--------------------------------+--------------------------------+------+ #|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample| #+------------------+--------------------------------+--------------------------------+------+ #| a| 1| b| c| #+------------------+--------------------------------+--------------------------------+------+ df.createOrReplaceTempView("tmp") spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show() #(allProperties(..*|).text) regex match allProperties. or allProperties..*.text spark.sql("select `(allProperties(..*|).text)` from tmp").show() #+------------------+--------------------------------+--------------------------------+ #|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text| #+------------------+--------------------------------+--------------------------------+ #| a| 1| b| #+------------------+--------------------------------+--------------------------------+

#sample dataframe after flattening df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"]) df.show() #+------------------+--------------------------------+--------------------------------+------+ #|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample| #+------------------+--------------------------------+--------------------------------+------+ #| a| 1| b| c| #+------------------+--------------------------------+--------------------------------+------+ df.createOrReplaceTempView("tmp") spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show() #(allProperties(..*|).text) regex match allProperties. or allProperties..*.text spark.sql("select `(allProperties(..*|).text)` from tmp").show() #+------------------+--------------------------------+--------------------------------+ #|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text| #+------------------+--------------------------------+--------------------------------+ #| a| 1| b| #+------------------+--------------------------------+--------------------------------+