Json PySpark:无法写入结构（DF->；拼花地板）_Json_Apache Spark_Pyspark_User Defined Functions

Json PySpark:无法写入结构（DF->；拼花地板）

json apache-spark pyspark

Json PySpark:无法写入结构（DF->；拼花地板）,json,apache-spark,pyspark,user-defined-functions,Json,Apache Spark,Pyspark,User Defined Functions,我有一个数据预处理管道，可以从成千上万条推文中清除数据。我想分阶段保存我的数据帧，这样我就可以从管道中的后续阶段加载这些“保存点”。我已经读到以拼花格式保存数据帧是最“有效”的编写方法，因为它快速、可伸缩等。这对我来说是理想的，因为我正试图记住这个项目的可伸缩性但是，我遇到了一个问题，似乎无法将包含结构的字段保存到文件中。在尝试输出数据帧时，我收到一个JSON错误JSON.decoder.jsondeCoderror:Expecting'，delimiter:…（更多详细信息如下）我的数据帧

我有一个数据预处理管道，可以从成千上万条推文中清除数据。我想分阶段保存我的数据帧，这样我就可以从管道中的后续阶段加载这些“保存点”。我已经读到以拼花格式保存数据帧是最“有效”的编写方法，因为它快速、可伸缩等。这对我来说是理想的，因为我正试图记住这个项目的可伸缩性

但是，我遇到了一个问题，似乎无法将包含结构的字段保存到文件中。在尝试输出数据帧时，我收到一个JSON错误

JSON.decoder.jsondeCoderror:Expecting'，delimiter:…

（更多详细信息如下）

我的数据帧当前的格式如下：

+------------------+----------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------+
|                id| timestamp|          tweet_text|      tweet_hashtags|tweet_media|          tweet_urls|               topic|          categories|priority|
+------------------+----------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------+
|266269932671606786|1536170446|Eight dead in the...|                  []|         []|                  []|guatemalaEarthqua...|[Report-EmergingT...|     Low|
|266804609954234369|1536256997|Guys, lets help ... |[[Guatemala, [72,...|         []|[[http:url...       |guatemalaEarthqua...|[CallToAction-Don...|  Medium|
|266250638852243457|1536169939|My heart goes out...|[[Guatemala, [31,...|         []|                  []|guatemalaEarthqua...|[Report-EmergingT...|  Medium|
|266381928989589505|1536251780|Strong earthquake...|                  []|         []|[[http:url...       |guatemalaEarthqua...|[Report-EmergingT...|  Medium|
|266223346520297472|1536167235|Magnitude 7.5 Qua...|                  []|         []|                  []|guatemalaEarthqua...|[Report-EmergingT...|  Medium|
+------------------+----------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------+
only showing top 5 rows

为清晰起见，请使用以下模式：

root
 |-- id: string (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- tweet_text: string (nullable = true)
 |-- tweet_hashtags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = false)
 |    |    |-- indices: array (nullable = false)
 |    |    |    |-- element: integer (containsNull = true)
 |-- tweet_media: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id_str: string (nullable = true)
 |    |    |-- type: string (nullable = false)
 |    |    |-- url: string (nullable = true)
 |    |    |-- media_url: string (nullable = true)
 |    |    |-- media_https: string (nullable = true)
 |    |    |-- display_url: string (nullable = true)
 |    |    |-- expanded_url: string (nullable = true)
 |    |    |-- indices: array (nullable = false)
 |    |    |    |-- element: integer (containsNull = true)
 |-- tweet_urls: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- url: string (nullable = false)
 |    |    |-- display_url: string (nullable = true)
 |    |    |-- expanded_url: string (nullable = true)
 |    |    |-- indices: array (nullable = false)
 |    |    |    |-- element: integer (containsNull = true)
 |-- topic: string (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- priority: string (nullable = true)

我正在尝试使用以下行以拼花格式保存此数据框：

df.write.mode('overwrite').save(
    path=f'{DATA_DIR}/interim/feature_select.parquet',
    format='parquet')

也可以使用

df.write.parquet（f'{DATA\u DIR}/middial/feature\u select.parquet'，mode='overwrite'）

但是，在尝试保存这些文件时，出现了错误

json.decoder.jsondeCoderror:Expecting'，delimiter:…

：

  File "features.py", line 207, in <lambda>
    entities_udf = F.udf(lambda s: _convert_str_to_arr(s), v)
  File "features.py", line 194, in _convert_str_to_arr
    arr = [json.loads(x) for x in arr]
  File "features.py", line 194, in <listcomp>
    arr = [json.loads(x) for x in arr]
  File "/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 93 (char 92)

文件“features.py”，第207行，在
实体\u udf=F.udf（lambda s:\u转换\u str\u为\u arr，v）
文件“features.py”，第194行，在_convert_str_to_arr中
arr=[json.load（x）表示arr中的x]
文件“features.py”，第194行，在
arr=[json.load（x）表示arr中的x]
文件“/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/_init__.py”，第348行，加载
返回\u默认\u解码器。解码
文件“/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/decoder.py”，第337行，在decode中
obj，end=self.raw\u decode（s，idx=\u w（s，0.end（））
文件“/media/ntfs/anaconda3/envs/py37/lib/python3.7/json/decoder.py”，第353行，原始解码
obj，end=self.scan_一次（s，idx）
json.decoder.JSONDecodeError:应为'，'分隔符：第1行第93列（字符92）

错误代码中的行还引用了我在许多列（cols

tweet.*

）上进行的早期

UDF

转换。当我移除writer时，这可以正常工作

我找不到关于为拼花地板文件指定分隔符的更多信息，这是可能的吗？或者我必须序列化任何包含逗号的数据吗？或者我甚至需要将我解析和更改过的Spark结构转换回JSON来保存文件？

这个错误实际上与拼花地板无关。数据帧上的转换只有在执行一次转换（在本例中，保存到拼花地板）后才会应用。因此，直到此时，错误才会发生

从错误中我们可以看出，实际的问题是线路：

arr = [json.loads(x) for x in arr]

这发生在

UDF

转换中

当json出现问题时，会发生

json.decoder.jsondecoderror

错误。两个常见的问题是它不是有效的JSON或存在报价问题，请参阅。所以

确认列包含有效的JSON

尝试用

\\”

替换

\\”

，这可以通过

x.replace（“\\”，r“\\”

完成

你设法解决了这个问题吗？它和我假设的json有关吗？