Apache spark 如何删除pyspark中不明确的列？_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 如何删除pyspark中不明确的列？

apache-spark pyspark

Apache spark 如何删除pyspark中不明确的列？,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,有许多与此类似的问题，它们提出了不同的问题，以避免联接中出现重复列；这不是我在这里要问的考虑到我已经有了一个带有不明确列的DataFrame，如何删除特定列例如，假设： df = spark.createDataFrame( spark.sparkContext.parallelize([ [1, 0.0, "ext-0.0"], [1, 1.0, "ext-1.0"], [2, 1.0, &qu

有许多与此类似的问题，它们提出了不同的问题，以避免联接中出现重复列；这不是我在这里要问的

考虑到我已经有了一个带有不明确列的DataFrame，如何删除特定列

例如，假设：

df = spark.createDataFrame(
    spark.sparkContext.parallelize([
        [1, 0.0, "ext-0.0"],
        [1, 1.0, "ext-1.0"],
        [2, 1.0, "ext-2.0"],
        [3, 2.0, "ext-3.0"],
        [4, 3.0, "ext-4.0"],
    ]),
    StructType([
        StructField("id", IntegerType(), True),
        StructField("shared", DoubleType(), True),
        StructField("shared", StringType(), True),
    ])
)

我只希望保留数字列

但是，尝试执行类似于df.selectid、shared.show的操作会导致以下结果：

raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'shared' is ambiguous, could be: shared, shared.;"

这个问题的许多相关解决方案只是“避免陷入这种情况”，例如，在连接上使用['joinkey']而不是a.joinkey=b.joinkey。我重申，这里的情况并非如此；这与已转换为此形式的数据帧有关

DF中的元数据消除了这些列的歧义：

$ df.dtypes
[('id', 'int'), ('shared', 'double'), ('shared', 'string')]

$ df.schema
StructType(List(StructField(id,IntegerType,true),StructField(shared,DoubleType,true),StructField(shared,StringType,true)))

所以数据被保存在内部。。。我就是不知道怎么用它

如何选择一列而不是另一列

我希望能够使用，例如col'shared11'或类似的。。。但我看不到这样的东西

这在spark中根本不可能吗

为了回答这个问题，我想问，请发布一个解决上述问题的工作代码片段，或者发布一个指向spark开发人员提供的某个官方链接，该链接根本不受支持？

在数据帧上使用.rdd.toDf替换模式似乎是可能的

但是，我仍然会接受任何比下面的答案更复杂、更烦人的答案：

import random
import string
from pyspark.sql.types import DoubleType, LongType

def makeId():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))

def makeUnique(column):
    return "%s---%s" % (column.name, makeId())

def makeNormal(column):
    return column.name.split("---")[0]

unique_schema = list(map(makeUnique, df.schema))
df_unique = df.rdd.toDF(schema=unique_schema)
df_unique.show()

numeric_cols = filter(lambda c: c.dataType.__class__ in [LongType, DoubleType], df_unique.schema)
numeric_col_names = list(map(lambda c: c.name, numeric_cols))
df_filtered = df_unique.select(*numeric_col_names)
df_filtered.show()

normal_schema = list(map(makeNormal, df_filtered.schema))
df_fixed = df_filtered.rdd.toDF(schema=normal_schema)
df_fixed.show()

给出：

+-----------+---------------+---------------+
|id---chjruu|shared---aqboua|shared---ehjxor|
+-----------+---------------+---------------+
|          1|            0.0|        ext-0.0|
|          1|            1.0|        ext-1.0|
|          2|            1.0|        ext-2.0|
|          3|            2.0|        ext-3.0|
|          4|            3.0|        ext-4.0|
+-----------+---------------+---------------+

+-----------+---------------+
|id---chjruu|shared---aqboua|
+-----------+---------------+
|          1|            0.0|
|          1|            1.0|
|          2|            1.0|
|          3|            2.0|
|          4|            3.0|
+-----------+---------------+

+---+------+
| id|shared|
+---+------+
|  1|   0.0|
|  1|   1.0|
|  2|   1.0|
|  3|   2.0|
|  4|   3.0|
+---+------+

解决方法：只需按顺序重命名列，然后执行您想执行的操作

renamed_df = df.toDF("id", "shared_double", "shared_string")

这个问题最简单的解决方案是使用df.toDF……重命名，但是如果您不想更改列名，那么将重复的列按其类型分组为struct，如下所示-

请注意，下面的解决方案是用scala编写的，但逻辑上类似的代码可以用python实现。此解决方案也适用于数据帧中的所有重复列-

1.加载测试数据 val df=Seq1，2.0，shared.toDFid，shared，shared df.showfalse 打印模式 /** * +--+---+---+ *| id |共享|共享| * +--+---+---+ *| 1 | 2.0 |共享| * +--+---+---+ * *根 *|-id:integer nullable=false *|-shared:double nullable=false *|-shared:string nullable=true */ 2.获取所有重复的列名 // 1. 获取所有重复的列名 val findDupCols=cols:Array[String]=>cols.map，1.groupBy.\uu 1.filter.\uu 2.length>1.keys.toSeq val dupCols=findDupColsdf.columns printlndupCols.mkString， //共享 3.重命名重复的col，如shared=>shared:string、shared:int，而不触及其他列名 val renamedDF=df //2重命名重复的col，如shared=>shared:string，shared:int .toDFdf.schema .map{case StructFieldname，dt，，=> ifdupCols.containsname s$name:${dt.simpleString}else name}：_* 3.创建所有列的结构 // 3. 创建所有列的结构 val structCols=df.schema.mapf=>f.name->f.groupBy\uu 1 .map{casename，seq=> 如果序列长度>1 结构 seq.map{case}，StructFieldfName，dt，{，}> exprs`$fName:${dt.simpleString}`作为${dt.simpleString} }: _* .asname else colname }托塞克先生 val structDF=renamedDF.selectstructCols:_* structDF.showfalse structDF.printSchema /** * +-------+--+ *|共享| id| * +-------+--+ *|[2.0，共享]| 1| * +-------+--+ * *根 *|-shared:struct nullable=false *| |-double:double nullable=false *| |-string:string nullable=true *|-id:integer nullable=false */ 4.使用按类型获取列。 //在不丢失任何列的情况下使用dataframe structDF.selectExprid，shared.double作为shared.showfalse /** * +--+---+ *| id |共享| * +--+---+ * |1 |2.0 | * +--+---+ */ 希望这是有用的人