Python 在Pyspark中解析自定义CSV头_Python_Csv_Apache Spark_Parsing_Pyspark

Python 在Pyspark中解析自定义CSV头

python csv apache-spark parsing pyspark

Python 在Pyspark中解析自定义CSV头,python,csv,apache-spark,parsing,pyspark,Python,Csv,Apache Spark,Parsing,Pyspark,我正在尝试在Pyspark中将CSV文件作为流读取。但是，该文件在实际的多行CSV头之前以自定义头开始。该标头确实包含有关文件内容的重要信息示例CSV文件： "custom-header-start" "string of custom header" "another string of custom header" ... "custom-header-end" "actual-csv-header

我正在尝试在Pyspark中将CSV文件作为流读取。但是，该文件在实际的多行CSV头之前以自定义头开始。该标头确实包含有关文件内容的重要信息

示例CSV文件：

"custom-header-start"
"string of custom header"
"another string of custom header"
...
"custom-header-end"
"actual-csv-header line 1"
...
"actual-csv-header line n"
1;5;9;"any string"; 98.7;....
1;8;6;"any string"; 87.7;....
4;2;4;"any string"; 67.7;....
....

我知道自定义标题的大小总是前9行。因此，我将通过

df.head（9）

获取标题，并在普通Python中使用它来获取相关信息。但是当我在流中时，

df.head（9）

将导致在结构化流中不允许的分支。。。我想知道你们如何解决这个问题，在读取文件的实际数据之前解析自定义头？是否有任何切实可行的解决方案/解决办法

提前谢谢