从列表中创建pyspark数据帧?
所以我有一个标题列表,例如从列表中创建pyspark数据帧?,pyspark,Pyspark,所以我有一个标题列表,例如 Headers=["col1", "col2", "col3"] 和行列表 Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ] 其中,val1对应于应归入col1 ect的值 如果我尝试createDataFrame(data=Body)它会给出一个错误无法推断str的schmea类型 有可能将这样的列表放入pyspark数据帧中吗 我已经试着将标题附加到主体上,例如 body.append(he
Headers=["col1", "col2", "col3"]
和行列表
Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ]
其中,val1对应于应归入col1 ect的值
如果我尝试createDataFrame(data=Body)
它会给出一个错误无法推断str的schmea类型
有可能将这样的列表放入pyspark数据帧中吗
我已经试着将标题附加到主体上,例如
body.append(header),然后使用create data frame函数,但它抛出以下错误:
字段\u 22:无法合并类型和
这是我生成正文和标题部分的全部代码:
基本上,我使用openpyxl读取excel文件,其中它跳过前x行ect,只读取具有特定列名的工作表
在主体和标题生成之后,我想将其直接读入spark
我们有一个承包商,他把它写成csv,然后用spark阅读,但直接把它放到spark中似乎更有意义
我希望此时所有列都是字符串
import csv
from os import sys
excel_file = "/dbfs/{}".format(path)
wb = load_workbook(excel_file, read_only=True)
sheet_names = wb.get_sheet_names()
sheets = spark.read.option("multiline", "true").format("json").load(configPath)
if dataFrameContainsColumn(sheets, "sheetNames"):
config_sheets = jsonReader(configFilePath,"sheetNames")
else:
config_sheets= []
skip_rows=-1
#get a list of the required columns
required_fields_list = jsonReader(configFilePath,"requiredColumns")
for worksheet_name in sheet_names:
count=0
sheet_count=0
second_break=False
worksheet = wb.get_sheet_by_name(worksheet_name)
#assign the sheet name to the object sheet
#create empty header and body lists for each sheet
header = []
body = []
#for each row in the sheet we need to append the cells to the header and body
for i,row in enumerate(worksheet.iter_rows()):
#if the row index is greater then skip rows then we want to read that row in as the header
if i==skip_rows+1:
header.append([cell.value for cell in row])
elif i>skip_rows+1:
count=count+1
if count==1:
header=header[0]
header = [w.replace(' ', '_') for w in header]
header = [w.replace('.', '') for w in header]
if(all(elem in header for elem in required_fields_list)==False):
second_break=True
break
else:
count=2
sheet_count=sheet_count+1
body.append([cell.value for cell in row])```
有几种方法可以从列表创建数据帧。 你可以去看看
是的。使用您描述的语法。但是,鉴于这不起作用,请显示生成错误的整个代码。
spark.createDataFrame([[“val1”、“val2”、“val3”]、[“val1”、“val2”、“val3”]、[“col1”、“col2”、“col3”])。show()
我现在应该把更多生成错误@OliverW的代码放进去了。@pissall我刚才试过了,但在类型方面又出现了一个错误?@Tiger\u Stripes错误是什么?
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
df = sc.parallelize(list_of_persons).toDF(['name', 'age', 'score'])
df.printSchema()
df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- score: double (nullable = true)
+-----+---+-----+
| name|age|score|
+-----+---+-----+
|Arike| 28| 78.6|
| Bob| 32|45.32|
|Corry| 65|98.47|
+-----+---+-----+
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
rdd = sc.parallelize(list_of_persons)
person = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), score=float(x[2])))
schemaPeople = sqlContext.createDataFrame(person)
schemaPeople.printSchema()
schemaPeople.show()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- score: double (nullable = true)
+---+-----+-----+
|age| name|score|
+---+-----+-----+
| 28|Arike| 78.6|
| 32| Bob|45.32|
| 65|Corry|98.47|
+---+-----+-----+