Python 3.x 在数据帧中引入一个新列,该列的值基于PySpark中的条件
我有如下的JSON数据Python 3.x 在数据帧中引入一个新列,该列的值基于PySpark中的条件,python-3.x,dataframe,apache-spark,pyspark,apache-spark-sql,Python 3.x,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有如下的JSON数据 {"images": [ { "alt": null, "src": "link_1", }, { "alt": null, "src": "link_2", }, { "alt": "Apple",
{"images": [
{
"alt": null,
"src": "link_1",
},
{
"alt": null,
"src": "link_2",
},
{
"alt": "Apple",
"src": "link_3",
},
{
"alt": null,
"src": "link_4",
},
"images": [
{
"alt": "Orange",
"src": "link_1",
},
{
"alt": null,
"src": "link_2",
}
]}
我需要在数据框中引入一个新列,该列的值为src,条件如下
+--------------------+
| new column |
+--------------------+
|link_3 |
|link_2 |
+--------------------+
任何人都可以帮助获得预期的输出。提前谢谢。我今天解决了这个问题
def extractSecondaryImageUrl(self, *htmlValue):
for element in htmlValue:
if len(element) == 0:
return ''
if len(element) >= 2:
element.pop(0)
for x in element:
if x['alt'] is not None:
return x['src']
a = element.pop(0)
return a['src']
else:
a = element.pop(0)
return a['src']
extractURL = udf(self.extractSecondaryImageUrl, StringType())
productsDF = productsDF.select("*", extractURL("images").alias('new_column'))
你能发布预期输出吗?是的,当然@斯里尼瓦索。。你必须。。。你是怎么做到的?你有什么问题?更新了。。。。。。。