Python Pyspark:flattem阵列列

Python Pyspark:flattem阵列列,python,pandas,dataframe,apache-spark,pyspark,Python,Pandas,Dataframe,Apache Spark,Pyspark,我正在解析Azure EventHub avro消息。数组中的最后一列。我想把它弄平 之前: {"records":[{"time":"2020-01-28T04:50:20.0975886Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationNa

我正在解析Azure EventHub avro消息。数组中的最后一列。我想把它弄平

之前:

{"records":[{"time":"2020-01-28T04:50:20.0975886Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Start","resultSignature":"Started.","durationMs":"0","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"},{"time":"2020-01-28T04:50:20.1122888Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Success","resultSignature":"Succeeded.NoContent","durationMs":"14","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"}]}
这就是我想到的,我想我已经非常接近了。我得到了结构,可以删除第一个值“records”,但无法处理其中的数组

from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import from_json, col
from pyspark.sql.functions import explode, flatten
from pyspark.sql.types import StringType, StructField, StructType, BooleanType, ArrayType, IntegerType

# Creates a DataFrame from a specified directory
df = spark.read.format("avro").load("/mnt/test/xxxxxx/xxxxxxxx/31.avro")

# cast a binary column(Body) into string
df = df.withColumn("Body", col("Body").cast("string"))

sourceSchema= StructType([
        StructField("records", ArrayType(
            StructType([
                StructField("time", StringType(), True),
                StructField("resourceId", StringType(), True),
                StructField("operationName", StringType(), True),
                StructField("category", StringType(), True),
                StructField("resultType", StringType(), True),
                StructField("resultSignature", StringType(), True),
                StructField("durationMs", StringType(), True),
                StructField("callerIpAddress", StringType(), True),
                StructField("correlationId", StringType(), True)
            ])
        ), True)
    ])

df = df.withColumn("Body", from_json(df.Body, sourceSchema))

# Flatten Body
for c in df.schema['Body'].dataType:
    df2 = df.withColumn(c.name, col("Body." + c.name))
    display(df2)
之后:

[{"time":"2020-01-28T04:50:20.0975886Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Start","resultSignature":"Started.","durationMs":"0","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"},{"time":"2020-01-28T04:50:20.1122888Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Success","resultSignature":"Succeeded.NoContent","durationMs":"14","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"}]
也许可以试试这个:

import pandas as pd
from pandas.io.json import json_normalize
s = {"records":[{"time":"2020-01-28T04:50:20.0975886Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Start","resultSignature":"Started.","durationMs":"0","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"},{"time":"2020-01-28T04:50:20.1122888Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Success","resultSignature":"Succeeded.NoContent","durationMs":"14","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"}]}
json_normalize(s).values
您将得到的结果是:

array([[list([{'time': '2020-01-28T04:50:20.0975886Z', 'resourceId': '/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME', 'operationName': 'MICROSOFT.COMPUTE/DISKS/DELETE', 'category': 'Administrative', 'resultType': 'Start', 'resultSignature': 'Started.', 'durationMs': '0', 'callerIpAddress': '43.121.152.99', 'correlationId': 'xxxxxxx'}, {'time': '2020-01-28T04:50:20.1122888Z', 'resourceId': '/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME', 'operationName': 'MICROSOFT.COMPUTE/DISKS/DELETE', 'category': 'Administrative', 'resultType': 'Success', 'resultSignature': 'Succeeded.NoContent', 'durationMs': '14', 'callerIpAddress': '43.121.152.99', 'correlationId': 'xxxxxxx'}])]],
  dtype=object)
也许可以试试这个:

import pandas as pd
from pandas.io.json import json_normalize
s = {"records":[{"time":"2020-01-28T04:50:20.0975886Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Start","resultSignature":"Started.","durationMs":"0","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"},{"time":"2020-01-28T04:50:20.1122888Z","resourceId":"/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME","operationName":"MICROSOFT.COMPUTE/DISKS/DELETE","category":"Administrative","resultType":"Success","resultSignature":"Succeeded.NoContent","durationMs":"14","callerIpAddress":"43.121.152.99","correlationId":"xxxxxxx"}]}
json_normalize(s).values
您将得到的结果是:

array([[list([{'time': '2020-01-28T04:50:20.0975886Z', 'resourceId': '/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME', 'operationName': 'MICROSOFT.COMPUTE/DISKS/DELETE', 'category': 'Administrative', 'resultType': 'Start', 'resultSignature': 'Started.', 'durationMs': '0', 'callerIpAddress': '43.121.152.99', 'correlationId': 'xxxxxxx'}, {'time': '2020-01-28T04:50:20.1122888Z', 'resourceId': '/SUBSCRIPTIONS/xxxxxxxxxxxx/RESOURCEGROUPS/xxxxx-xxxxxxxI/PROVIDERS/MICROSOFT.COMPUTE/DISKS/7C3E07DE8xxxxxxx-0-SCRATCHVOLUME', 'operationName': 'MICROSOFT.COMPUTE/DISKS/DELETE', 'category': 'Administrative', 'resultType': 'Success', 'resultSignature': 'Succeeded.NoContent', 'durationMs': '14', 'callerIpAddress': '43.121.152.99', 'correlationId': 'xxxxxxx'}])]],
  dtype=object)

我看到很多人都有这个问题,希望这有帮助

# Read Event Hub's stream
# if reading from file: Supported file formats are text, csv, json, orc, parquet

conf = {}
conf["eventhubs.connectionString"] = "Endpoint=sb://xxxxxxxxxxxx.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=jxxxxxxxxxxxx/xxxxxxxxxxxx=;EntityPath=eventhub"

# define dataframe for reading stream
read_df = (
  spark
    .readStream
    .format("eventhubs")
    .options(**conf)
    .option('multiLine', True)
    .option('mode', 'PERMISSIVE')
    .load()
)

# define struct for writing
sourceSchema= StructType([
        StructField("records", ArrayType(
            StructType([
                StructField("time", StringType(), True),
                StructField("resourceId", StringType(), True),
                StructField("operationName", StringType(), True),
                StructField("category", StringType(), True),
                StructField("resultType", StringType(), True),
                StructField("resultSignature", StringType(), True),
                StructField("durationMs", StringType(), True),
                StructField("callerIpAddress", StringType(), True),
                StructField("correlationId", StringType(), True)
            ])
        ), True)
    ])

# convert binary to string
decoded_df = read_df.select(F.from_json(F.col("body").cast("string"), sourceSchema).alias("payload"))

# write to memory
query1 = (
  decoded_df
    .writeStream
    .format("memory")
    .queryName("read_hub")
    .start()
)

我看到很多人都有这个问题,希望这有帮助

# Read Event Hub's stream
# if reading from file: Supported file formats are text, csv, json, orc, parquet

conf = {}
conf["eventhubs.connectionString"] = "Endpoint=sb://xxxxxxxxxxxx.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=jxxxxxxxxxxxx/xxxxxxxxxxxx=;EntityPath=eventhub"

# define dataframe for reading stream
read_df = (
  spark
    .readStream
    .format("eventhubs")
    .options(**conf)
    .option('multiLine', True)
    .option('mode', 'PERMISSIVE')
    .load()
)

# define struct for writing
sourceSchema= StructType([
        StructField("records", ArrayType(
            StructType([
                StructField("time", StringType(), True),
                StructField("resourceId", StringType(), True),
                StructField("operationName", StringType(), True),
                StructField("category", StringType(), True),
                StructField("resultType", StringType(), True),
                StructField("resultSignature", StringType(), True),
                StructField("durationMs", StringType(), True),
                StructField("callerIpAddress", StringType(), True),
                StructField("correlationId", StringType(), True)
            ])
        ), True)
    ])

# convert binary to string
decoded_df = read_df.select(F.from_json(F.col("body").cast("string"), sourceSchema).alias("payload"))

# write to memory
query1 = (
  decoded_df
    .writeStream
    .format("memory")
    .queryName("read_hub")
    .start()
)

不清楚您希望得到的结果是什么,也许可以通过手动平铺字符串的一部分来添加一个示例我需要每个条目进入一个单独的列,比如一个列中的'time',另一个列中的'resourceId',然后继续..是否回答了您的查询?不清楚您希望得到的结果是什么,也许可以通过手动平铺字符串的一部分来添加一个示例我需要每个条目进入一个单独的列,比如一个列中的'time',另一个列中的'resourceId',然后继续..回答您的查询了吗?