Python 3.x 在数据帧中引入一个新列，该列的值基于PySpark中的条件_Python 3.x_Dataframe_Apache Spark_Pyspark_Apache Spark Sql

Python 3.x 在数据帧中引入一个新列，该列的值基于PySpark中的条件

python-3.x dataframe apache-spark pyspark

Python 3.x 在数据帧中引入一个新列，该列的值基于PySpark中的条件,python-3.x,dataframe,apache-spark,pyspark,apache-spark-sql,Python 3.x,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有如下的JSON数据 {"images": [ { "alt": null, "src": "link_1", }, { "alt": null, "src": "link_2", }, { "alt": "Apple",

我有如下的JSON数据

    {"images": [
    {
    "alt": null,
    "src": "link_1",
    },
    {
    "alt": null,
    "src": "link_2",
    },
    {
    "alt": "Apple",
    "src": "link_3",
    },
    {
    "alt": null,
    "src": "link_4",
    },
"images": [
    {
    "alt": "Orange",
    "src": "link_1",
    },
    {
    "alt": null,
    "src": "link_2",
    }
]}

我需要在数据框中引入一个新列，该列的值为src，条件如下

切勿指定第一个位置值。（示例：链接_1）

alt不应为NULL，然后将src的值分配给新列。如果多个alt包含值，则将拾取除位置1之外的第一个alt值

如果所有alt都等于NULL，则将src的第二个位置值分配给新列

注意：图像总是包含多个元素

对于上述示例，预期输出为

+--------------------+
|      new column    |
+--------------------+
|link_3              |
|link_2              |
+--------------------+

任何人都可以帮助获得预期的输出。提前谢谢。

我今天解决了这个问题

def extractSecondaryImageUrl(self, *htmlValue):
    for element in htmlValue:
        if len(element) == 0:
            return ''
        if len(element) >= 2:
            element.pop(0)
            for x in element:
                if x['alt'] is not None:
                    return x['src']
            a = element.pop(0)
            return a['src']
        else:
            a = element.pop(0)
            return a['src']

    extractURL = udf(self.extractSecondaryImageUrl, StringType())

    productsDF = productsDF.select("*", extractURL("images").alias('new_column'))

你能发布预期输出吗？是的，当然@斯里尼瓦索。。你必须。。。你是怎么做到的？你有什么问题？更新了。。。。。。。