Python 如何读取文本文件并使用PySpark应用模式？_Python_Apache Spark_Pyspark

Python 如何读取文本文件并使用PySpark应用模式？

python apache-spark pyspark

Python 如何读取文本文件并使用PySpark应用模式？,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,.txt文件如下所示： 1234567813572468 1234567813572468 1234567813572468 1234567813572468 1234567813572468 当我在中阅读它，并将其分为3个不同的列时，我会返回以下完美结果： df=spark.read.optionheader，false\ .optioninferSchema，对\ .text固定宽度-2.txt 排序的_df=df.select df.value.substr1，4.别名“col1”， d

.txt文件如下所示：

1234567813572468
1234567813572468
1234567813572468
1234567813572468
1234567813572468

当我在中阅读它，并将其分为3个不同的列时，我会返回以下完美结果：

df=spark.read.optionheader，false\ .optioninferSchema，对\ .text固定宽度-2.txt 排序的_df=df.select df.value.substr1，4.别名“col1”， df.value.substr5，4.别名“col2”， df.value.substr8，4.别名“col3”，显示然而，如果我再次阅读它，并应用一个模式

从pyspark.sql.types导入* schema=StructType[StructField'col1'，IntegerType，True， StructField'col2'，IntegerType，True， StructField'col3'，IntegerType，True] df_new=spark.read.csvfixed-width-2.txt，schema=schema df_new.printSchema 根 |-col1:integer nullable=true |-col2:integer nullable=true |-col3:integer nullable=true 文件中的数据已丢失：

新秀 +--+--+--+ |col1 | col2 | col3| +--+--+--+ +--+--+--+

因此，我的问题是，如何读取此文本文件并应用模式？

当使用模式读取col1 as int时，此值超过1234567813572468 max int值。而是用长字体阅读

使用RDD Api：

更简单的方法是使用.textFileresults和rdd读取固定宽度的文件，然后使用.map应用转换，然后使用模式转换为数据帧

使用DataFrame Api：

当使用col1的架构作为int读取时，此值超过1234567813572468最大int值。而是用长字体阅读

使用RDD Api：

更简单的方法是使用.textFileresults和rdd读取固定宽度的文件，然后使用.map应用转换，然后使用模式转换为数据帧

使用DataFrame Api：

您可以将新模式应用于以前的数据帧df_new=spark.createDataFramesorted_df.rdd，模式。如果没有分隔符，您不能在数据上使用spark.read.csv。我已经考虑过了，但是它会返回：IntegerType不能接受类型中的对象“1234”。我也将尝试下面的方法，并在这里报告。您可以将新架构应用到以前的数据框df_new=spark.createDataFramesorted_df.rdd，架构。你不能在没有分隔符的数据上使用spark.read.csv。我已经考虑过了，但是它会返回：IntegerType不能接受类型中的对象“1234”。我也要尝试下面的方法，并在这里报告。这正是我要找的！感谢您如此彻底，并提供了第二种方法，因为它帮助我了解了解决此问题的多种方法。对于其他人，我在Jupyter笔记本中运行，发现在PySpark中找不到“col”，如上面的示例所示。为了解决这个问题，我用了这篇帖子：这正是我想要的！感谢您如此彻底，并提供了第二种方法，因为它帮助我了解了解决此问题的多种方法。对于其他人，我在Jupyter笔记本中运行，发现在PySpark中找不到“col”，如上面的示例所示。为了解决这个问题，我使用了以下帖子：


+----+----+----+
|col1|col2|col3|
+----+----+----+
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|

schema = StructType([StructField('col1', LongType(), True)])
spark.read.csv("path",schema=schema).show()
#+----------------+
#|            col1|
#+----------------+
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#+----------------+

from pyspark.sql.types import *
schema = StructType([StructField('col1', IntegerType(), True),
                     StructField('col2', IntegerType(), True),
                     StructField('col3', IntegerType(), True)])
df=spark.createDataFrame(
spark.sparkContext.textFile("fixed_width.csv").\
map(lambda x:(int(x[0:4]),int(x[4:8]),int(x[8:12]))),schema)

df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#+----+----+----+

df.printSchema()
#root
# |-- col1: integer (nullable = true)
# |-- col2: integer (nullable = true)
# |-- col3: integer (nullable = true)

df = spark.read.option("header"     , "false")\
               .option("inferSchema", "true" )\
               .text( "path")

sorted_df = df.select(
    df.value.substr(1, 4).alias('col1'),
    df.value.substr(5, 4).alias('col2'),
    df.value.substr(8, 4).alias('col3'),
)
#dynamic cast expression
casting=[(col(col_name).cast("int")).name(col_name) for col_name in sorted_df.columns]
sorted_df=sorted_df.select(casting)

#required dataframe
sorted_df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#+----+----+----+

#just in case if you want to change the types
schema = StructType([StructField('col1', IntegerType(), True),
                     StructField('col2', IntegerType(), True),
                     StructField('col3', IntegerType(), True)])

df=spark.createDataFrame(sorted_df.rdd,schema)
df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#+----+----+----+