Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/365.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何通过索引而不是名称获取列?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何通过索引而不是名称获取列?

Python 如何通过索引而不是名称获取列?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有以下初始PySpark数据帧: +----------+--------------------------------+ |产品| PK |产品| +----------+--------------------------------+ | 686 | [[686,520.70],[645,2]]| | 685 |[[685,45.556],[678,23],[655,21]]| | 693 |

我有以下初始PySpark数据帧:

+----------+--------------------------------+
|产品| PK |产品|
+----------+--------------------------------+
|      686 |          [[686,520.70],[645,2]]|
|      685 |[[685,45.556],[678,23],[655,21]]|
|      693 |                              []|
df=sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
[“产品主键”,“产品”]
)
products
包含嵌套数据。我需要提取每对值中的第二个值。我正在运行以下代码:

temp\u dataframe=dataframe.withColumn(“分解”),explode(col(“产品”))。withColumn(“分数”,col(“分解”)。getItem(“\u 2”))
它可以很好地处理特定的数据帧。但是,我想把这段代码放到一个函数中,并在不同的数据帧上运行它。我所有的数据帧都具有相同的结构。唯一的区别是,子列
“_2”
在某些数据帧中的命名可能不同,例如
“col1”
“col2”

例如:

数据帧内容
根
|--product_PK:long(nullable=true)
|--产品:数组(nullable=true)
||--元素:struct(containsnall=true)
|| |--_1:long(nullable=true)
|| |--_2:double(nullable=true)
|--分解:结构(nullable=true)
||--_1:long(nullable=true)
||--_2:double(nullable=true)
数据帧内容
根
|--product_PK:long(nullable=true)
|--产品:数组(nullable=true)
||--元素:struct(containsnall=true)
|| |--product|PK:long(nullable=true)
|| |--col2:integer(nullable=true)
|--分解:结构(nullable=true)
||--product_PK:long(nullable=true)
||--col2:integer(nullable=true)
我尝试使用像
getItem(1)
这样的索引,但它说必须提供列的名称

有没有办法避免指定列名或以某种方式概括代码的这一部分


我的目标是
分解
包含嵌套数据中每对数据的第二个值,即
\u 2
col1
col2
听起来您的思路是正确的。我认为实现这一点的方法是读取模式,以确定要分解的字段的名称。但是,您需要使用schema.fields来查找struct字段,而不是schema.names,然后使用它的属性来计算struct中的字段。以下是一个例子:

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Setup the test dataframe
data = [
    (686, [(686, 520.70), (645, 2.)]), 
    (685, [(685, 45.556), (678, 23.), (655, 21.)]), 
    (693, [])
]

schema = StructType([
    StructField("product_PK", StringType()),
    StructField("products", 
        ArrayType(StructType([
            StructField("_1", IntegerType()),
            StructField("col2", FloatType())
        ]))
    )
])

df = sqlCtx.createDataFrame(data, schema) 

# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]

# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
现在,如果检查结果,您会得到以下结果:

>>> temp_dataframe.printSchema()
root
 |-- product_PK: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: integer (nullable = true)
 |    |    |-- col2: float (nullable = true)
 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)
 |-- score: float (nullable = true)
这就是你想要的吗

>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products                                                               |
+----------+-----------------------------------------------------------------------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693       |[]                                                                     |
+----------+-----------------------------------------------------------------------+

>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
...   .withColumn("exploded", F.col("exploded").getItem(1)) \
...   .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products                                                               |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |null    |
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |2       |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null    |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23      |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21      |
+----------+-----------------------------------------------------------------------+--------+

假设分解的
列是一个
struct
as

 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)
您可以使用以下逻辑来获取第二个元素,而不知道其名称

from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
您应该将输出设置为

+----------+--------------------------------------+------------+------+
|product_PK|products                              |exploded    |score |
+----------+--------------------------------------+------------+------+
|686       |[[686,520.7], [645,2.0]]              |[686,520.7] |520.7 |
|686       |[[686,520.7], [645,2.0]]              |[645,2.0]   |2.0   |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0]  |23.0  |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0]  |21.0  |
+----------+--------------------------------------+------------+------+

我认为由于
struct
不必保持字段的顺序(类似于映射),因此您必须通过按类型查找第二项来访问它,即
integer
。那可能行得通。另外,您知道输入数据帧的模式,所以为什么不在
分解
之前找到字段名?@JacekLaskowski:我可以使用
temp\u DataFrame.schema.names
获取字段名。但是我应该如何访问这些模式中的必填字段呢?在这两种情况下,我都会得到错误
pyspark.sql.utils.AnalysisException:u“字段名应该是字符串文字,但它是1;”
我使用的是Spark 2.2.0如果我运行
pyspark--version
,我会得到Spark 2.2和Scala 2.11。8@Markus,这很奇怪。我正在使用:Spark版本2.2.0.cloudera1和Scala版本2.11.8我真的不知道为什么它对我不起作用。RyanW的方法没有给我任何错误。让我测试一下你的解决方案。