Python 如何使用sqlContext计算累计和_Python_Apache Spark_Pyspark_Apache Spark Sql

Python 如何使用sqlContext计算累计和

python apache-spark pyspark

Python 如何使用sqlContext计算累计和,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我知道我们可以用它来计算累计金额。但窗口仅在HiveContext中受支持，而在SQLContext中不受支持。我需要使用SQLContext，因为HiveContext不能在多进程中运行是否有任何有效的方法可以使用SQLContext计算累积和？一种简单的方法是将数据加载到驱动程序的内存中并使用numpy.cumsum，但缺点是数据需要能够放入内存中不确定这是否是您要查找的内容，但下面是两个如何使用sqlContext计算累计和的示例：首先，当您要按某些类别对其进行分区时： from py

我知道我们可以用它来计算累计金额。但窗口仅在HiveContext中受支持，而在SQLContext中不受支持。我需要使用SQLContext，因为HiveContext不能在多进程中运行

是否有任何有效的方法可以使用SQLContext计算累积和？一种简单的方法是将数据加载到驱动程序的内存中并使用numpy.cumsum，但缺点是数据需要能够放入内存中

不确定这是否是您要查找的内容，但下面是两个如何使用sqlContext计算累计和的示例：

首先，当您要按某些类别对其进行分区时：

from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext

rdd = sc.parallelize([
    ("Tablet", 6500), 
    ("Tablet", 5500), 
    ("Cell Phone", 6000), 
    ("Cell Phone", 6500), 
    ("Cell Phone", 5500)
    ])

schema = StructType([
    StructField("category", StringType(), False),
    StructField("revenue", LongType(), False)
    ])

df = sqlContext.createDataFrame(rdd, schema)

df.registerTempTable("test_table")

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")

输出：

[Row(category='Tablet', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=6500, cumsum=12000),
 Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Cell Phone', revenue=6000, cumsum=11500),
 Row(category='Cell Phone', revenue=6500, cumsum=18000)]

[Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=5500, cumsum=11000),
 Row(category='Cell Phone', revenue=6000, cumsum=17000),
 Row(category='Cell Phone', revenue=6500, cumsum=23500),
 Row(category='Tablet', revenue=6500, cumsum=30000)]

第二，当你只想取一个变量的总和时。将df2更改为：

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")

输出：

[Row(category='Tablet', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=6500, cumsum=12000),
 Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Cell Phone', revenue=6000, cumsum=11500),
 Row(category='Cell Phone', revenue=6500, cumsum=18000)]

[Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=5500, cumsum=11000),
 Row(category='Cell Phone', revenue=6000, cumsum=17000),
 Row(category='Cell Phone', revenue=6500, cumsum=23500),
 Row(category='Tablet', revenue=6500, cumsum=30000)]

希望这有帮助。收集数据后，使用np.cumsum不是很有效，尤其是当数据集较大时。您可以探索的另一种方法是使用简单的RDD转换，如groupByKey（），然后使用map计算每个组按某个键的累积和，最后将其减少。

windows函数仅与HiveContext一起工作不是真的。即使与sqlContext一起使用，也可以使用它们：

from pyspark.sql.window import * myPartition=Window.partitionBy(['col1','col2','col3']) temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))

下面是一个简单的例子：

import pyspark from pyspark.sql import window import pyspark.sql.functions as sf sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20), ("Cam", "F", "Cambridge", 1, 25), ("Lin", "F", "Cambridge", 1, 25), ("Cat", "M", "Boston", 1, 20), ("Sara", "F", "Cambridge", 1, 15), ("Jeff", "M", "Cambridge", 1, 25), ("Bean", "M", "Cambridge", 1, 26), ("Dave", "M", "Cambridge", 1, 21),], ["name", 'gender', "city", 'donation', "age"]) data.show()
输出

+----+------+---------+--------+---+ |name|gender| city|donation|age| +----+------+---------+--------+---+ | Bob| M| Boston| 1| 20| | Cam| F|Cambridge| 1| 25| | Lin| F|Cambridge| 1| 25| | Cat| M| Boston| 1| 20| |Sara| F|Cambridge| 1| 15| |Jeff| M|Cambridge| 1| 25| |Bean| M|Cambridge| 1| 26| |Dave| M|Cambridge| 1| 21| +----+------+---------+--------+---+
定义窗口

win_spec = (window.Window .partitionBy(['gender', 'city']) .rowsBetween(window.Window.unboundedPreceding, 0))
#window.window.unbounddReceiding--组的第一行 #.rowsBetween（…，0）--
0
指当前行，如果指定了
-2
，则在当前行之前最多2行
现在，这里有一个陷阱：

temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
有误：

TypeErrorTraceback (most recent call last) <ipython-input-9-b467d24b05cd> in <module>() ----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec)) /Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self) 238 239 def __iter__(self): --> 240 raise TypeError("Column is not iterable") 241 242 # string methods TypeError: Column is not iterable
将提供：

+----+------+---------+--------+---+--------------+ |name|gender| city|donation|age|CumSumDonation| +----+------+---------+--------+---+--------------+ |Sara| F|Cambridge| 1| 15| 1| | Cam| F|Cambridge| 1| 25| 2| | Lin| F|Cambridge| 1| 25| 3| | Bob| M| Boston| 1| 20| 1| | Cat| M| Boston| 1| 20| 2| |Dave| M|Cambridge| 1| 21| 1| |Jeff| M|Cambridge| 1| 25| 2| |Bean| M|Cambridge| 1| 26| 3| +----+------+---------+--------+---+--------------+

在尝试解决一个类似的问题后，我使用这段代码解决了我的问题。不确定我是否缺少OP的一部分，但这是一种对
SQLContext
列求和的方法：

from pyspark.conf import SparkConf from pyspark.context import SparkContext from pyspark.sql.context import SQLContext sc = SparkContext() sc.setLogLevel("ERROR") conf = SparkConf() conf.setAppName('Sum SQLContext Column') conf.set("spark.executor.memory", "2g") sqlContext = SQLContext(sc) def sum_column(table, column): sc_table = sqlContext.table(table) return sc_table.agg({column: "sum"}) sum_column("db.tablename", "column").show()

需要使用SQLContext，因为HiveContext不能在多进程中运行-嗯？您愿意详细说明一下吗？我已经广泛使用了sqlContext的窗口函数。@zero323限制了HiveContext。我面临着同样的问题，因为这不是
HiveContext
的限制。您只需将嵌入式Derby用作不用于生产的元存储。请参阅我对所需火花代码无更改的回答。但您需要一些DevOps技能。谢谢，但您的解决方案适用于hiveContext，而不是sqlContext。您可以输出sqlContext吗？它应该显示出它是一个hiveContextOnly on spark 2.0+用户可以将窗口函数与SQLContext一起使用。对于Spark版本1.4~1.6，有必要使用HiveContext否它们是从Spark版本1.4引入的它们自1.4开始存在，但在Spark 2之前，有必要使用HiveContext。然而，在许多发行版中，spark shell和pyspark中的“sqlContext”实例的默认类实际上是HiveContext，因此这可能会导致一些混淆，人们会认为可以将窗口函数与正常的sqlContext一起使用。您可以参考此问题了解更多信息：您的示例中未定义win_规范，您可以添加它吗？理解你的好例子会很有帮助oops my bad@Mike会努力挖掘我的代码库；）祈祷