Python 如何通过索引而不是名称获取列？_Python_Apache Spark_Pyspark_Apache Spark Sql

Python 如何通过索引而不是名称获取列？

python apache-spark pyspark

Python 如何通过索引而不是名称获取列？,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有以下初始PySpark数据帧： +----------+--------------------------------+ |产品| PK |产品| +----------+--------------------------------+ | 686 | [[686,520.70],[645,2]]| | 685 |[[685,45.556],[678,23],[655,21]]| | 693 |

我有以下初始PySpark数据帧：

+----------+--------------------------------+
|产品| PK |产品|
+----------+--------------------------------+
|      686 |          [[686,520.70],[645,2]]|
|      685 |[[685,45.556],[678,23],[655,21]]|
|      693 |                              []|

df=sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
[“产品主键”，“产品”]
)

列

products

包含嵌套数据。我需要提取每对值中的第二个值。我正在运行以下代码：

temp\u dataframe=dataframe.withColumn（“分解”），explode（col（“产品”））。withColumn（“分数”，col（“分解”）。getItem（“\u 2”））

它可以很好地处理特定的数据帧。但是，我想把这段代码放到一个函数中，并在不同的数据帧上运行它。我所有的数据帧都具有相同的结构。唯一的区别是，子列

“_2”

在某些数据帧中的命名可能不同，例如

“col1”

或

“col2”

例如：

数据帧内容
根
|--product_PK:long（nullable=true）
|--产品：数组（nullable=true）
||--元素：struct（containsnall=true）
|| |--_1:long（nullable=true）
|| |--_2:double（nullable=true）
|--分解：结构（nullable=true）
||--_1:long（nullable=true）
||--_2:double（nullable=true）

数据帧内容
根
|--product_PK:long（nullable=true）
|--产品：数组（nullable=true）
||--元素：struct（containsnall=true）
|| |--product|PK:long（nullable=true）
|| |--col2:integer（nullable=true）
|--分解：结构（nullable=true）
||--product_PK:long（nullable=true）
||--col2:integer（nullable=true）

我尝试使用像

getItem（1）

这样的索引，但它说必须提供列的名称

有没有办法避免指定列名或以某种方式概括代码的这一部分

我的目标是

分解

包含嵌套数据中每对数据的第二个值，即

\u 2

或

col1

或

col2

听起来您的思路是正确的。我认为实现这一点的方法是读取模式，以确定要分解的字段的名称。但是，您需要使用schema.fields来查找struct字段，而不是schema.names，然后使用它的属性来计算struct中的字段。以下是一个例子：

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Setup the test dataframe
data = [
    (686, [(686, 520.70), (645, 2.)]), 
    (685, [(685, 45.556), (678, 23.), (655, 21.)]), 
    (693, [])
]

schema = StructType([
    StructField("product_PK", StringType()),
    StructField("products", 
        ArrayType(StructType([
            StructField("_1", IntegerType()),
            StructField("col2", FloatType())
        ]))
    )
])

df = sqlCtx.createDataFrame(data, schema) 

# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]

# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))

现在，如果检查结果，您会得到以下结果：

>>> temp_dataframe.printSchema()
root
 |-- product_PK: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: integer (nullable = true)
 |    |    |-- col2: float (nullable = true)
 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)
 |-- score: float (nullable = true)

这就是你想要的吗

>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products                                                               |
+----------+-----------------------------------------------------------------------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693       |[]                                                                     |
+----------+-----------------------------------------------------------------------+

>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
...   .withColumn("exploded", F.col("exploded").getItem(1)) \
...   .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products                                                               |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |null    |
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |2       |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null    |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23      |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21      |
+----------+-----------------------------------------------------------------------+--------+

假设分解的

列是一个struct
as
 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)

您可以使用以下逻辑来获取第二个元素，而不知道其名称
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))

您应该将输出设置为
+----------+--------------------------------------+------------+------+
|product_PK|products                              |exploded    |score |
+----------+--------------------------------------+------------+------+
|686       |[[686,520.7], [645,2.0]]              |[686,520.7] |520.7 |
|686       |[[686,520.7], [645,2.0]]              |[645,2.0]   |2.0   |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0]  |23.0  |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0]  |21.0  |
+----------+--------------------------------------+------------+------+

我认为由于struct
不必保持字段的顺序（类似于映射），因此您必须通过按类型查找第二项来访问它，即integer
。那可能行得通。另外，您知道输入数据帧的模式，所以为什么不在分解
之前找到字段名？@JacekLaskowski:我可以使用temp\u DataFrame.schema.names
获取字段名。但是我应该如何访问这些模式中的必填字段呢？在这两种情况下，我都会得到错误pyspark.sql.utils.AnalysisException:u“字段名应该是字符串文字，但它是1；”
我使用的是Spark 2.2.0如果我运行pyspark--version
，我会得到Spark 2.2和Scala 2.11。8@Markus，这很奇怪。我正在使用：Spark版本2.2.0.cloudera1和Scala版本2.11.8我真的不知道为什么它对我不起作用。RyanW的方法没有给我任何错误。让我测试一下你的解决方案。