Apache spark PySpark:带前缀/后缀的选择
我目前正在使用Spark数据帧(使用PySpark),该数据帧表示大量tweet,其中我有以下(修剪过的)模式: 我想通过从原始数据框中选择许多列来创建一个新的数据框。例如,Apache spark PySpark:带前缀/后缀的选择,apache-spark,pyspark,Apache Spark,Pyspark,我目前正在使用Spark数据帧(使用PySpark),该数据帧表示大量tweet,其中我有以下(修剪过的)模式: 我想通过从原始数据框中选择许多列来创建一个新的数据框。例如,allProperties.text和allProperties.entities.hashtags。但是,我还想选择相同的推文,它们是转发或引用推文,分别由前缀allProperties.retweeted\u status或allProperties.quoted\u status表示 有没有一种方法可以让我选择所有这些
allProperties.text
和allProperties.entities.hashtags
。但是,我还想选择相同的推文,它们是转发或引用推文,分别由前缀allProperties.retweeted\u status
或allProperties.quoted\u status
表示
有没有一种方法可以让我选择所有这些列,而不必使用大量冗余行来指定每个列及其前缀?例如,通过提供某种正则表达式来搜索allProperties.text
,allProperties.retweeted\u status.text
,allProperties.quoted\u status.text
作为旁注,我应该说我确实希望将数据帧保持在顶层,因为我还希望在新的数据帧中包含topic
到目前为止,我已成功编写了一个与所需列匹配的正则表达式:
def _keep_columns(self):
def _regex_filter(x):
tweet_features = [
'text',
'entities.hashtags',
'entities.media',
'entities.urls',
]
r = (('(^allProperties.(retweeted_status.|quoted_status.)'
'?('+"|".join(tweet_features)+')$)'
'|(^topic$)'))
return bool(re.match(r, x))
df = self.tweets.select(*filter(lambda x: _regex_filter(x), self.tweets.columns))
但是,
self.tweets.columns
只返回顶级列,因此无法在allProperties
下找到嵌套列。如何以嵌套方式搜索?您可以使用df.selectExpr(“allProperties.*”、“topic.”等)
(或)其他方式展平结构列
- 然后我们可以
让数据帧从临时视图中选择匹配列的regexcreateTempView
示例:
#sample dataframe after flattening
df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"])
df.show()
#+------------------+--------------------------------+--------------------------------+------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample|
#+------------------+--------------------------------+--------------------------------+------+
#| a| 1| b| c|
#+------------------+--------------------------------+--------------------------------+------+
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show()
#(allProperties(..*|).text) regex match allProperties. or allProperties..*.text
spark.sql("select `(allProperties(..*|).text)` from tmp").show()
#+------------------+--------------------------------+--------------------------------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|
#+------------------+--------------------------------+--------------------------------+
#| a| 1| b|
#+------------------+--------------------------------+--------------------------------+
#sample dataframe after flattening
df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"])
df.show()
#+------------------+--------------------------------+--------------------------------+------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample|
#+------------------+--------------------------------+--------------------------------+------+
#| a| 1| b| c|
#+------------------+--------------------------------+--------------------------------+------+
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show()
#(allProperties(..*|).text) regex match allProperties. or allProperties..*.text
spark.sql("select `(allProperties(..*|).text)` from tmp").show()
#+------------------+--------------------------------+--------------------------------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|
#+------------------+--------------------------------+--------------------------------+
#| a| 1| b|
#+------------------+--------------------------------+--------------------------------+