Apache spark 如何解析和分解pyspark中存储为字符串的词典列表?
我有一些存储在CSV中的数据。样本数据可在此处获得- 我使用pyspark读取数据Apache spark 如何解析和分解pyspark中存储为字符串的词典列表?,apache-spark,pyspark,Apache Spark,Pyspark,我有一些存储在CSV中的数据。样本数据可在此处获得- 我使用pyspark读取数据 df = spark.read.csv("data.csv",header=True) df.printSchema() root |-- freeform_text: string (nullable = true) |-- entity_object: string (nullable = true) >>> df.show(truncate=False) +---
df = spark.read.csv("data.csv",header=True)
df.printSchema()
root
|-- freeform_text: string (nullable = true)
|-- entity_object: string (nullable = true)
>>> df.show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|freeform_text |entity_object |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Grapes are good. Bananas are bad.|[{'name': 'Grapes', 'type': 'OTHER', 'salience': '0.8335162997245789', 'sentiment_score': '0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Grapes', 'mention_type': 'COMMON'}}, {'name': 'Bananas', 'type': 'OTHER', 'salience': '0.16648370027542114', 'sentiment_score': '-0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Bananas', 'mention_type': 'COMMON'}}]|
|the weather is not good today |[{'name': 'weather', 'type': 'OTHER', 'salience': '1.0', 'sentiment_score': '-0.800000011920929', 'sentiment_magnitude': '0.800000011920929', 'metadata': {}, 'mentions': {'mention_text': 'weather', 'mention_type': 'COMMON'}}] |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
现在,我想分解并解析此数据帧中entity\u object
列中的字段。以下是有关本专栏所包含内容的更多技术诀窍-
对于Spark数据框中存储的每一个freeform_text
,我都编写了一些逻辑来使用google的自然语言API解析实体。当我使用pandas进行计算时,这些实体存储为字典列表。然后,在将它们存储到数据库之前,我将它们转换为字符串
这个CSV是我在spark dataframe中读到的两列内容-自由形式的文本
和实体对象
entity\u object
列作为字符串实际上是一个字典列表。可以想象为列表[DICT1,DICT2]
等等。因此,根据输出中实体的数量,一些实体\u对象
行可能有1个元素,其他行可能有1个以上的元素。例如,在第一行中有两个实体-葡萄
和香蕉
,而在第二行中只有实体天气
我想分解这个实体对象
列,这样自由格式文本
的一条记录可以分解成多条记录
下面是我希望输出结果的屏幕截图-
这对您来说是一个有效的解决方案-如果不起作用,请告诉我- 在此处创建数据框
df_new=spark.createDataFrame([
{
str({'name':'Grapes','type':'OTHER','salience':'0.8335162997245789','sentiment_score':'0.8999999761581421','sentiment_magnitude':'0.8999999761581421','metadata':{},'mentions':{'mention_text':'Grapes','mention_type':'COMMON'}}),
str(
{'name':'weather','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
},
{
str(
{'name':'banana','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
}
],T.StringType())
df = df_new.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
这里的逻辑
df_new=spark.createDataFrame([
{
str({'name':'Grapes','type':'OTHER','salience':'0.8335162997245789','sentiment_score':'0.8999999761581421','sentiment_magnitude':'0.8999999761581421','metadata':{},'mentions':{'mention_text':'Grapes','mention_type':'COMMON'}}),
str(
{'name':'weather','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
},
{
str(
{'name':'banana','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
}
],T.StringType())
df = df_new.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
输出
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|name |type |salience |sentiment_score |sentiment_magnitude|mentions |
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|weather|OTHER|1.0 |-0.800000011920929|0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
|Grapes |OTHER|0.8335162997245789|0.8999999761581421|0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|banana |OTHER|1.0 |-0.800000011920929|0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
更新-而不是createDataFrame-使用spark.read.csv()
如下所示
df_new = spark.read.csv("/FileStore/tables/data.csv", header=True)
df_new.show(truncate=False)
# Logic Here
df = df_new.withColumn('col', F.from_json("entity_object", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("freeform_text", "name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|freeform_text |name |type |salience |sentiment_score |sentiment_magnitude|mentions |
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|Grapes are good. Bananas are bad.|Grapes |OTHER|0.8335162997245789 |0.8999999761581421 |0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|Grapes are good. Bananas are bad.|Bananas|OTHER|0.16648370027542114|-0.8999999761581421|0.8999999761581421 |{"mention_text":"Bananas","mention_type":"COMMON"}|
|the weather is not good today |weather|OTHER|1.0 |-0.800000011920929 |0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
你试过使用吗?谢谢,@dsk,我们能不能保留原来的栏目?就像我的例子中的
freeform\u text
。我想把那个专栏继续下去是的是的。。我们可以。。只需在创建数据帧时传递值。。。理想情况下,如果您正在阅读csv文件或其他内容,则不会有任何问题。如果答案对您有帮助,您能否接受并投票?