Python pyspark-使用IGN createDataFrame在json流数据中查找最大值和最小值_Python_Apache Spark_Pyspark_Apache Kafka

Python pyspark-使用IGN createDataFrame在json流数据中查找最大值和最小值

python apache-spark pyspark apache-kafka

Python pyspark-使用IGN createDataFrame在json流数据中查找最大值和最小值,python,apache-spark,pyspark,apache-kafka,Python,Apache Spark,Pyspark,Apache Kafka,我有一组由Kafka传输的json消息，每个消息描述一个网站用户。使用pyspark，我需要计算每个流媒体窗口中每个国家/地区的用户数，并返回具有最大和最小用户数的国家/地区以下是流式json消息的示例： {"id":1,"first_name":"Barthel","last_name":"Kittel","email":"bkittel0@printfriendly.com","gender":"Male","ip_address":"130.187.82.195","date":"06/

我有一组由Kafka传输的json消息，每个消息描述一个网站用户。使用pyspark，我需要计算每个流媒体窗口中每个国家/地区的用户数，并返回具有最大和最小用户数的国家/地区

以下是流式json消息的示例：

{"id":1,"first_name":"Barthel","last_name":"Kittel","email":"bkittel0@printfriendly.com","gender":"Male","ip_address":"130.187.82.195","date":"06/05/2018","country":"France"}

这是我的密码：

from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark import SparkContext
from pyspark.sql import SQLContext

fields = ['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'date', 'country']
schema =  StructType([
  StructField(field, StringType(), True) for field in fields
])

def parse(s, fields):
    try:
        d = json.loads(s[0])
        return [tuple(d.get(field) for field in fields)]
    except:
        return []

array_of_users = parsed.SQLContext.createDataFrame(parsed.flatMap(lambda s: parse(s, fields)), schema)

rdd = sc.parallelize(array_of_users)

# group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples
country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

# identify the min and max using as comparison key the second element of the (country, length) tuple
country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])

当我运行它时，我得到了消息

AttributeError                            Traceback (most recent call last)
<ipython-input-24-6e6b83935bc3> in <module>()
     16         return []
     17 
---> 18 array_of_users = parsed.SQLContext.createDataFrame(parsed.flatMap(lambda s: parse(s, fields)), schema)
     19 
     20 rdd = sc.parallelize(array_of_users)

AttributeError: 'TransformedDStream' object has no attribute 'SQLContext'

AttributeError回溯（最近一次调用）
在（）
16返回[]
17
--->18数组\u of_users=parsed.SQLContext.createDataFrame（parsed.flatMap（lambda s:parse（s，fields）），模式）
19
20 rdd=sc.parallelize（用户数组）
AttributeError:“TransformedStream”对象没有属性“SQLContext”

如何解决此问题？

如果我理解正确，您需要按国家/地区对邮件列表进行分组，然后计算每个组中的邮件数，然后选择包含最小和最大邮件数的组

在我脑子里，代码是这样的：

# assuming the array_of_users is your array of messages
rdd = sc.parallelize(array_of_users)

# group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples
country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

# identify the min and max using as comparison key the second element of the (country, length) tuple
country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])

如何获得一个数据窗口？

ssc=StreamingContext（sc，60）

（使用PySpark）我没有看到这一行，也没有看到您在代码中定义了

解析的。。注意：Kafka streaming 0.8库从Spark 2.3.0开始就被弃用了，您可能已经关注了这个博客，它使用了相同的变量名，谢谢你的提示！据我所知，我的消息在parsed=kafkaStream.map（lambda v:json.loads（v[1]））
中。我怎样才能从这里转到您建议的用户数组
？这可能会很有用，看看transform的用法：