Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何为Pyspark createDataFrame(rdd,模式)定义模式?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 如何为Pyspark createDataFrame(rdd,模式)定义模式?

Python 如何为Pyspark createDataFrame(rdd,模式)定义模式?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我看了看 我将gzipedJSON读入rdd rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz') 我想把它转换成spark数据帧。链接SO问题的第一种方法无效。这是文件的第一行 {"code_event": "1092406", "code_event_system": "LOT

我看了看

我将gzipedJSON读入rdd

rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz')
我想把它转换成spark数据帧。链接SO问题的第一种方法无效。这是文件的第一行

{"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000", "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "", "etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20", "odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "", "odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656, "topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
如何推断模式

答案是这样的

schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])

为什么使用range32?

要回答您的问题,range32只指示StrucField类可应用于所需架构的列数。在您的例子中,有30列。 基于您的数据,我能够使用以下逻辑创建数据帧:

from pyspark.sql.functions import *
from pyspark.sql.types import *

data_json = {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000",
          "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "",
          "etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20",
          "odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "",
          "odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656,
          "topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
column_names = [x for x in data_json.keys()]
row_data = [([x for x in data_json.values()])]

input = []
for i in column_names:
  if str(type(data_json[i])).__contains__('str') :
    input.append(StructField(str(i), StringType(), True))
  elif str(type(data_json[i])).__contains__('int') and len(str(data_json[i])) <= 8:
         input.append(StructField(str(i), IntegerType(), True))
  else :
      input.append(StructField(str(i), LongType(), True))
  
schema = StructType(input)
data = spark.createDataFrame(row_data, schema)
data.show()

为了回答您的问题,range32仅指示StrucField类可应用于所需架构的列数。在您的例子中,有30列。 基于您的数据,我能够使用以下逻辑创建数据帧:

from pyspark.sql.functions import *
from pyspark.sql.types import *

data_json = {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000",
          "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event": "183", "eperiod_event": "",
          "etl_date": "2020-05-27", "event_no": 1, "group_no": 0, "name_event": "Ungaria Putto - 8/20", "name_event_short": "Ungaria Putto - 8/20",
          "odd_coefficient": 1, "odd_coefficient_entry": 1, "odd_coefficient_user": 1, "odd_ekey": "11", "odd_name": "11", "odd_status": "",
          "odd_type": "11", "odd_voidfactor": 0, "odd_win_types": "", "special_bet_value": "", "ticket_id": "899M-E2X93P", "id_update": 8000001036823656,
          "topic_group": "cwg5", "kafka_key": "899M-E2X93P", "kafka_epoch": 1590580609424, "kafka_partition": 0, "kafka_topic": "tickets-calculated_2"}
column_names = [x for x in data_json.keys()]
row_data = [([x for x in data_json.values()])]

input = []
for i in column_names:
  if str(type(data_json[i])).__contains__('str') :
    input.append(StructField(str(i), StringType(), True))
  elif str(type(data_json[i])).__contains__('int') and len(str(data_json[i])) <= 8:
         input.append(StructField(str(i), IntegerType(), True))
  else :
      input.append(StructField(str(i), LongType(), True))
  
schema = StructType(input)
data = spark.createDataFrame(row_data, schema)
data.show()
该示例中的range32只是一个示例—它们正在生成包含32列的模式,每个列都以数字作为名称。如果确实要定义架构,则需要显式定义每一列:

从pyspark.sql.types导入* schema=StructType[ StructField'code_event',IntegerType,True, StructField'code\u event\u system',StringType,True, ... ] 但更好的方法是避免使用RDD API,并使用以下代码直接将文件读取到数据帧中请参见:

>>>data=spark.read.json's3://cw milenko tests/json\u gzip/ticr\u computed\u 2\u 2020-05-27T11-59-06.json.gz' >>>data.printSchema 根 |-代码\事件:字符串nullable=true |-代码\事件\系统:字符串nullable=true |-公司id:string nullable=true |-日期\事件:字符串可空=真 |-日期\u事件\u实值:字符串nullable=true |-ecode_类:字符串可为null=true |-ecode_事件:字符串可空=真 |-eperiod_事件:字符串nullable=true |-etl_日期:字符串nullable=true .... 该示例中的range32只是一个示例—它们正在生成包含32列的模式,每个列都以数字作为名称。如果确实要定义架构,则需要显式定义每一列:

从pyspark.sql.types导入* schema=StructType[ StructField'code_event',IntegerType,True, StructField'code\u event\u system',StringType,True, ... ] 但更好的方法是避免使用RDD API,并使用以下代码直接将文件读取到数据帧中请参见:

>>>data=spark.read.json's3://cw milenko tests/json\u gzip/ticr\u computed\u 2\u 2020-05-27T11-59-06.json.gz' >>>data.printSchema 根 |-代码\事件:字符串nullable=true |-代码\事件\系统:字符串nullable=true |-公司id:string nullable=true |-日期\事件:字符串可空=真 |-日期\u事件\u实值:字符串nullable=true |-ecode_类:字符串可为null=true |-ecode_事件:字符串可空=真 |-eperiod_事件:字符串nullable=true |-etl_日期:字符串nullable=true ....