Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在pyspark dataframe的其余列中搜索column1中的值_Python_Search_Pyspark - Fatal编程技术网

Python 在pyspark dataframe的其余列中搜索column1中的值

Python 在pyspark dataframe的其余列中搜索column1中的值,python,search,pyspark,Python,Search,Pyspark,假设有一个pyspark数据帧,其形式如下: id col1 col2 col3 col4 ------------------------ as1 4 10 4 6 as2 6 3 6 1 as3 6 0 2 1 as4 8 8 6 1 as5 9 6 6 9 是否有方法在pyspark数据帧的列2-4中搜索列1中的值,并返回(id行名称、列名)? 例如: In col1, 4 i

假设有一个pyspark数据帧,其形式如下:

id  col1  col2 col3 col4
------------------------
as1  4    10    4    6
as2  6    3     6    1
as3  6    0     2    1
as4  8    8     6    1
as5  9    6     6    9
是否有方法在pyspark数据帧的列2-4中搜索列1中的值,并返回(id行名称、列名)? 例如:

In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)

提示:假设col1是一个集合{4,6,8,9},即唯一的

是的,您可以利用Spark SQL
.isin
操作符

让我们首先在您的示例中创建DataFrame

第1部分-创建数据帧

cSchema = StructType([StructField("id", IntegerType()),\
StructField("col1", IntegerType()),\
StructField("col2", IntegerType()),\
StructField("col3", IntegerType()),\
StructField("col4", IntegerType())])


test_data = [[1,4,10,4,6],[2,6,3,6,1],[3,6,0,2,1],[4,8,8,6,1],[5,9,6,6,9]]


df = spark.createDataFrame(test_data,schema=cSchema)

df.show()

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  1|   4|  10|   4|   6|
|  2|   6|   3|   6|   1|
|  3|   6|   0|   2|   1|
|  4|   8|   8|   6|   1|
|  5|   9|   6|   6|   9|
+---+----+----+----+----+
第2部分-搜索匹配值的函数

isin:如果表达式的值包含在参数的计算值中,则该布尔表达式的计算结果为true。

这将引导你朝着正确的方向前进。您可以仅为Id列等进行选择。。或者你想返回的任何东西。该函数可以很容易地进行更改,以获取更多的列进行搜索。希望这有帮助

# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
                      StructField("col1", IntegerType()),
                      StructField("col2", IntegerType()),
                      StructField("col3", IntegerType()),
                      StructField("col4", IntegerType())])

test_data = [['as1', 4, 10, 4, 6],
             ['as2', 6, 3, 6, 1],
             ['as3', 6, 0, 2, 1],
             ['as4', 8, 8, 6, 1],
             ['as5', 9, 6, 6, 9]]

# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)

df.show()

# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']

# implement the search of values in rest columns found in col 1
def search(distinct_list ):
    for i in distinct_list :
        print(str(i) + ' found in: ')

        # for col in df.columns:
        for col in col_list:
            df_search = df.select(*col_list) \
                .filter(df[str(col)] == str(i))

            if (len(df_search.head(1)) > 0):
                df_search.show()


search(distinct_list)
查找完整的示例代码

输出:
+---+----+----+----+----+
|id | col1 | col2 | col3 | col4|
+---+----+----+----+----+
|as1 | 4 | 10 | 4 | 6|
|as2 | 6 | 3 | 6 | 1|
|as3 | 6 | 0 | 2 | 1|
|as4 | 8 | 8 | 6 | 1|
|as5 | 9 | 6 | 6 | 9|
+---+----+----+----+----+
6发现于:
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as5 | 6 | 6 | 9|
+---+----+----+----+
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as2 | 3 | 6 | 1|
|as4 | 8 | 6 | 1|
|as5 | 6 | 6 | 9|
+---+----+----+----+
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as1 | 10 | 4 | 6|
+---+----+----+----+
9发现于:
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as5 | 6 | 6 | 9|
+---+----+----+----+
4发现于:
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as1 | 10 | 4 | 6|
+---+----+----+----+
8发现于:
+---+----+----+----+
|id | col2 | col3 | col4|
+---+----+----+----+
|as4 | 8 | 6 | 1|

+---+----+----+----+
谢谢你,纳迪姆。正如您正确地指出的那样,如果函数可以更改为搜索更多的列,那将是一件好事。我以前实际使用过isin()方法。缺点是它可以用于一对一列匹配。
# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
                      StructField("col1", IntegerType()),
                      StructField("col2", IntegerType()),
                      StructField("col3", IntegerType()),
                      StructField("col4", IntegerType())])

test_data = [['as1', 4, 10, 4, 6],
             ['as2', 6, 3, 6, 1],
             ['as3', 6, 0, 2, 1],
             ['as4', 8, 8, 6, 1],
             ['as5', 9, 6, 6, 9]]

# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)

df.show()

# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']

# implement the search of values in rest columns found in col 1
def search(distinct_list ):
    for i in distinct_list :
        print(str(i) + ' found in: ')

        # for col in df.columns:
        for col in col_list:
            df_search = df.select(*col_list) \
                .filter(df[str(col)] == str(i))

            if (len(df_search.head(1)) > 0):
                df_search.show()


search(distinct_list)