Apache spark 如何解析和分解pyspark中存储为字符串的词典列表?

Apache spark 如何解析和分解pyspark中存储为字符串的词典列表?,apache-spark,pyspark,Apache Spark,Pyspark,我有一些存储在CSV中的数据。样本数据可在此处获得- 我使用pyspark读取数据 df = spark.read.csv("data.csv",header=True) df.printSchema() root |-- freeform_text: string (nullable = true) |-- entity_object: string (nullable = true) >>> df.show(truncate=False) +---

我有一些存储在CSV中的数据。样本数据可在此处获得-

我使用pyspark读取数据

df = spark.read.csv("data.csv",header=True)
df.printSchema()
root
 |-- freeform_text: string (nullable = true)
 |-- entity_object: string (nullable = true)

>>> df.show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|freeform_text                    |entity_object                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Grapes are good. Bananas are bad.|[{'name': 'Grapes', 'type': 'OTHER', 'salience': '0.8335162997245789', 'sentiment_score': '0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Grapes', 'mention_type': 'COMMON'}}, {'name': 'Bananas', 'type': 'OTHER', 'salience': '0.16648370027542114', 'sentiment_score': '-0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Bananas', 'mention_type': 'COMMON'}}]|
|the weather is not good today    |[{'name': 'weather', 'type': 'OTHER', 'salience': '1.0', 'sentiment_score': '-0.800000011920929', 'sentiment_magnitude': '0.800000011920929', 'metadata': {}, 'mentions': {'mention_text': 'weather', 'mention_type': 'COMMON'}}]                                                                                                                                                                                                                                                                 |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
现在,我想分解并解析此数据帧中
entity\u object
列中的字段。以下是有关本专栏所包含内容的更多技术诀窍-

对于Spark数据框中存储的每一个
freeform_text
,我都编写了一些逻辑来使用google的自然语言API解析实体。当我使用pandas进行计算时,这些实体存储为字典列表。然后,在将它们存储到数据库之前,我将它们转换为字符串

这个CSV是我在spark dataframe中读到的两列内容-
自由形式的文本
实体对象

entity\u object
列作为字符串实际上是一个字典列表。可以想象为
列表[DICT1,DICT2]
等等。因此,根据输出中实体的数量,一些
实体\u对象
行可能有1个元素,其他行可能有1个以上的元素。例如,在第一行中有两个实体-
葡萄
香蕉
,而在第二行中只有实体
天气

我想分解这个
实体对象
列,这样
自由格式文本
的一条记录可以分解成多条记录

下面是我希望输出结果的屏幕截图-


这对您来说是一个有效的解决方案-如果不起作用,请告诉我-

在此处创建数据框

df_new=spark.createDataFrame([
  {
    str({'name':'Grapes','type':'OTHER','salience':'0.8335162997245789','sentiment_score':'0.8999999761581421','sentiment_magnitude':'0.8999999761581421','metadata':{},'mentions':{'mention_text':'Grapes','mention_type':'COMMON'}}),
    str(
{'name':'weather','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
    )
  },
  {
    str(
{'name':'banana','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
    )
  }
],T.StringType())
df = df_new.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
这里的逻辑

df_new=spark.createDataFrame([
  {
    str({'name':'Grapes','type':'OTHER','salience':'0.8335162997245789','sentiment_score':'0.8999999761581421','sentiment_magnitude':'0.8999999761581421','metadata':{},'mentions':{'mention_text':'Grapes','mention_type':'COMMON'}}),
    str(
{'name':'weather','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
    )
  },
  {
    str(
{'name':'banana','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
    )
  }
],T.StringType())
df = df_new.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
输出

+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|name   |type |salience          |sentiment_score   |sentiment_magnitude|mentions                                          |
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|weather|OTHER|1.0               |-0.800000011920929|0.800000011920929  |{"mention_text":"weather","mention_type":"COMMON"}|
|Grapes |OTHER|0.8335162997245789|0.8999999761581421|0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|banana |OTHER|1.0               |-0.800000011920929|0.800000011920929  |{"mention_text":"weather","mention_type":"COMMON"}|
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
更新-而不是createDataFrame-使用
spark.read.csv()
如下所示

df_new = spark.read.csv("/FileStore/tables/data.csv", header=True)
df_new.show(truncate=False)
# Logic Here
df = df_new.withColumn('col', F.from_json("entity_object", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("freeform_text", "name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False) 

+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|freeform_text                    |name   |type |salience           |sentiment_score    |sentiment_magnitude|mentions                                          |
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|Grapes are good. Bananas are bad.|Grapes |OTHER|0.8335162997245789 |0.8999999761581421 |0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|Grapes are good. Bananas are bad.|Bananas|OTHER|0.16648370027542114|-0.8999999761581421|0.8999999761581421 |{"mention_text":"Bananas","mention_type":"COMMON"}|
|the weather is not good today    |weather|OTHER|1.0                |-0.800000011920929 |0.800000011920929  |{"mention_text":"weather","mention_type":"COMMON"}|
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+

你试过使用吗?谢谢,@dsk,我们能不能保留原来的栏目?就像我的例子中的
freeform\u text
。我想把那个专栏继续下去是的是的。。我们可以。。只需在创建数据帧时传递值。。。理想情况下,如果您正在阅读csv文件或其他内容,则不会有任何问题。如果答案对您有帮助,您能否接受并投票?