从列表中创建pyspark数据帧？_Pyspark

从列表中创建pyspark数据帧？

pyspark

从列表中创建pyspark数据帧？,pyspark,Pyspark,所以我有一个标题列表，例如 Headers=["col1", "col2", "col3"] 和行列表 Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ] 其中，val1对应于应归入col1 ect的值如果我尝试createDataFrame（data=Body）它会给出一个错误无法推断str的schmea类型有可能将这样的列表放入pyspark数据帧中吗我已经试着将标题附加到主体上，例如 body.append（he

所以我有一个标题列表，例如

Headers=["col1", "col2", "col3"]

和行列表

Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ]

其中，val1对应于应归入col1 ect的值

如果我尝试

createDataFrame（data=Body）

它会给出一个错误

无法推断str的schmea类型

有可能将这样的列表放入pyspark数据帧中吗

我已经试着将标题附加到主体上，例如

body.append（header），然后使用create data frame函数，但它抛出以下错误：

字段\u 22:无法合并类型和

这是我生成正文和标题部分的全部代码：

基本上，我使用openpyxl读取excel文件，其中它跳过前x行ect，只读取具有特定列名的工作表

在主体和标题生成之后，我想将其直接读入spark

我们有一个承包商，他把它写成csv，然后用spark阅读，但直接把它放到spark中似乎更有意义

我希望此时所有列都是字符串

import csv
from os import sys
excel_file = "/dbfs/{}".format(path)
wb = load_workbook(excel_file, read_only=True)  
sheet_names = wb.get_sheet_names()
sheets = spark.read.option("multiline", "true").format("json").load(configPath)
if dataFrameContainsColumn(sheets, "sheetNames"):
  config_sheets = jsonReader(configFilePath,"sheetNames")
else: 
  config_sheets= []
skip_rows=-1  
#get a list of the required columns
required_fields_list = jsonReader(configFilePath,"requiredColumns")
for worksheet_name in sheet_names:

  count=0
  sheet_count=0
  second_break=False
  worksheet = wb.get_sheet_by_name(worksheet_name)
      #assign the sheet name to the object sheet


  #create empty header and body lists for each sheet
  header = []
  body = []
  #for each row in the sheet we need to append the cells to the header and body 
  for i,row in enumerate(worksheet.iter_rows()):
   #if the row index is greater then skip rows then we want to read that row in as the header
    if i==skip_rows+1:             
      header.append([cell.value for cell in row])
    elif i>skip_rows+1: 
      count=count+1
      if count==1:
        header=header[0]
        header = [w.replace(' ', '_') for w in header]
        header = [w.replace('.', '') for w in header]

        if(all(elem in header for elem in required_fields_list)==False):
          second_break=True

          break

      else: 
        count=2
        sheet_count=sheet_count+1
        body.append([cell.value for cell in row])```

有几种方法可以从列表创建数据帧。你可以去看看

让Spark来推断模式

使用映射转换指定类型

是的。使用您描述的语法。但是，鉴于这不起作用，请显示生成错误的整个代码。

spark.createDataFrame（[[“val1”、“val2”、“val3”]、[“val1”、“val2”、“val3”]、[“col1”、“col2”、“col3”]）。show（）

我现在应该把更多生成错误@OliverW的代码放进去了。@pissall我刚才试过了，但在类型方面又出现了一个错误？@Tiger\u Stripes错误是什么？

list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
df = sc.parallelize(list_of_persons).toDF(['name', 'age', 'score'])
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- score: double (nullable = true)

+-----+---+-----+
| name|age|score|
+-----+---+-----+
|Arike| 28| 78.6|
|  Bob| 32|45.32|
|Corry| 65|98.47|
+-----+---+-----+

list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
rdd = sc.parallelize(list_of_persons)
person = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), score=float(x[2])))
schemaPeople = sqlContext.createDataFrame(person)

schemaPeople.printSchema()
schemaPeople.show()

root
 |-- age: long (nullable = true) 
 |-- name: string (nullable = true)
 |-- score: double (nullable = true)

+---+-----+-----+
|age| name|score|
+---+-----+-----+
| 28|Arike| 78.6|
| 32|  Bob|45.32|
| 65|Corry|98.47|
+---+-----+-----+