spark java代码到python

spark java代码到python,java,python,apache-spark,pyspark,Java,Python,Apache Spark,Pyspark,我得到一段模拟数据,如下所示: from pyspark.sql.types import DateType import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlContext = SQLContext(sc) #columns names and each values columnMock = ['date','user_id','session_id','page_id

我得到一段模拟数据,如下所示:

from pyspark.sql.types import DateType
import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

#columns names and each values
    columnMock = ['date','user_id','session_id','page_id','action_time','search_keyword','click_category_id','click_product_id',
                  'order_category_ids','order_product_ids','pay_category_ids','pay_product_ids','city_id']
    valsMock = [
    ('2017-03-04','1','2984','54684','2017-03-0418:02:03','dog','fjsd3','jf94fj','fk430','f4j89','rebj89','fejq9','GZ'),
    ('2017-03-04','2','294','9242','2017-03-0418:07:03','apple','fr343','jf94fj','fk430','f4j89','rebj89','fejq9','SH'),
    ('2017-03-04','1','2984','425','2017-03-0418:51:03','car','fbyt3','jf94fj','fk430','f4j89','rebj89','fejq9','BJ'),
    ('2017-03-04','2','294','92356','2017-03-0419:02:03','water','dad93','jf94fj','fk430','f4j89','rebj89','fejq9','HZ'),
    ('2017-03-04','1','2984','4014','2017-03-0419:22:03','wine','brt3','jf94fj','fk430','f4j89','rebj89','fejq9','GZ'),
    ('2017-03-04','2','294','4562','2017-03-0419:55:03','tiger','s21493','jf94fj','fk430','f4j89','rebj89','fejq9','GZ'),
    ('2017-03-04','1','2984','567','2017-03-0420:02:03','camel','rb493','jf94fj','fk430','f4j89','rebj89','fejq9','GZ'),
    ('2017-03-04','2','294','5372','2017-03-0431:02:03','glass','325g93','jf94fj','fk430','f4j89','rebj89','fejq9','GZ')
    ]


df = sqlContext.createDataFrame(valsMock, columnMock)
df.createOrReplaceTempView("sessionLog")
由于python没有像java中那样的泛型,如何更改此函数?我正在考虑使用python中的映射类型进行翻译,这样我就可以使用groupByKey对
JavaPairdd GetSessionidActionRDD
进行操作,这与Java中的操作类似,但不知道如何实现它

java函数:

public static JavaPairRDD<String, Row> getSessionid2ActionRDD(JavaRDD<Row> actionRDD) {


    return actionRDD.mapPartitionsToPair(new PairFlatMapFunction<Iterator<Row>, String, Row>() {

        private static final long serialVersionUID = 1L;

        @Override
        public Iterable<Tuple2<String, Row>> call(Iterator<Row> iterator)
                throws Exception {
            List<Tuple2<String, Row>> list = new ArrayList<Tuple2<String, Row>>();

            while(iterator.hasNext()) {
                Row row = iterator.next();
                list.add(new Tuple2<String, Row>(row.getString(2), row));  
            }

            return list;
        }

    });
}
公共静态JavaPairdd getSessionid2ActionRDD(JavaRDD actionRDD){
返回actionRDD.mapPartitionsToPair(新的PairFlatMapFunction(){
私有静态最终长serialVersionUID=1L;
@凌驾
公共Iterable调用(迭代器迭代器)
抛出异常{
列表=新的ArrayList();
while(iterator.hasNext()){
Row=iterator.next();
添加(新的Tuple2(row.getString(2),row));
}
退货清单;
}
});
}

java函数到pyspark的转换如下

def createPairedTuple(partIter):
    return [(row[2], row) for row in partIter]
您可以通过将数据帧转换为rdd并使用
mapPartitions
作为

df.rdd.mapPartitions(lambda partitionIterator: createPairedTuple(partitionIterator))
df.rdd.mapPartitions(lambda partitionIterator: [(row[2], row) for row in partitionIterator])
或者更好的是,您可以将函数嵌入到
mapPartitions

df.rdd.mapPartitions(lambda partitionIterator: createPairedTuple(partitionIterator))
df.rdd.mapPartitions(lambda partitionIterator: [(row[2], row) for row in partitionIterator])
我希望答案是有帮助的