在pyspark数据帧中显示所有匹配的字符串

在pyspark数据帧中显示所有匹配的字符串,pyspark,Pyspark,我想显示相似匹配字符串的所有过滤结果 代码: # Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in rese

我想显示相似匹配字符串的所有过滤结果

代码:

# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers. 

# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext

import pandas as pd
import numpy as np

# Initiate the session
spark = SparkSession\
            .builder\
            .appName('Operations')\
            .getOrCreate()

# sc = SparkContext()
sc =SparkContext.getOrCreate()

# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
    (1,2,'3'),
    (2,2,'yes')],
    ['a', 'b', 'c']
)

# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
    (4,6,'yes'),
    (5,7,'yes')],
    ['a', 'b', 'c']
)

# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)

# Filter out the columns based on respective rules
sdataframe_temp\
    .filter(sdataframe_union_1_2['c'] == 'yes')\
    .select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
    .show()
输出:

+---+---+
|  a|  b|
+---+---+
|  2|  2|
+---+---+
预期产出:

+---+---+
|  a|  b|
+---+---+
|  2|  2|
+---+---+
|  4|  6|
+---+---+
|  5|  7|
+---+---+

有人能提出一些建议或改进吗?

您应该更改最后一行代码。对于col函数,您应该从pyspark.sql.functions导入

您必须从sdataframe\u union\u 1\u 2中选择数据,并且您正在从sdataframe\u temp中选择数据,这就是为什么您会得到一条记录。

以下是一种使用:


在您的最后一行中将sdataframe_temp更改为sdataframe_union_1_2?我能知道您的意思吗?我的最后一行是将sdataframe_temp和sdataframe_temp2I结合起来,我指的是过滤行。您仅在第一个数据帧上进行筛选。所以你们只从第一个数据帧得到数据?啊,我明白了。现在我明白了。天哪,我的错误。非常感谢您告诉我。您能告诉我如何为spark dataframe添加新值吗?我找不到正确的信息。我的代码是这样的。我想在sdataframe_union_1_2=sdataframe_union_1_2中添加名为“e”的新列。使用列“e”[1,2,3,4]谢谢!这太棒了!我想学习新的pyspark方法,所以这就是我想要的方法。谢谢,先生,您认为您能在另一个问题上帮助我吗>
from pyspark.sql.functions import *

sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
df = (sdataframe_temp1
      .unionByName(sdataframe_temp2)
      .where("c == 'yes'")
      .drop('c'))

df.show()

+---+---+
|  a|  b|
+---+---+
|  2|  2|
|  4|  6|
|  5|  7|
+---+---+