pyspark rdd拆分问题

pyspark rdd拆分问题,pyspark,split,rdd,Pyspark,Split,Rdd,我正在尝试从rdd中筛选,其值为“01-10-2019” 输入值: 其中,insuredatarepart为“pyspark.rdd.rdd”类,以下数据集为列表值 Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='96601', IssuerId2='96601', MarketCoverage=u'SHOP (Small Group)', NetworkName=u'Select Network', Network

我正在尝试从rdd中筛选,其值为“01-10-2019”

输入值:

其中,insuredatarepart为“pyspark.rdd.rdd”类,以下数据集为列表值

Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='96601', IssuerId2='96601', MarketCoverage=u'SHOP (Small Group)', NetworkName=u'Select Network', NetworkURL=u'http://il.coventryproviders.com', SourceName=u'SERFF', StateCode=u'IL', custnum='13')Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'Yes', IssuerId='37001', IssuerId2='37001', MarketCoverage=u'Individual', NetworkName=u'HumanaDental PPO/Traditional Preferred', NetworkURL=u'https://www.humana.com/finder/search?customerId=1085&pfpkey=317', SourceName=u'HIOS', StateCode=u'GA', custnum='13')
    Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='54172', IssuerId2='54172', MarketCoverage=u'Individual', NetworkName=u'Molina Marketplace', NetworkURL=u'https://eportal.molinahealthcare.com/Provider/ProviderSearch?RedirectFrom=MolinaStaticWeb&State=fl&Coverage=MMP', SourceName=u'HIOS', StateCode=u'FL', custnum='14')
例外情况如下所示:

### Remove duplicates in merged RDD:
insuredata:  class 'pyspark.rdd.PipelinedRDD'
 Result Count after duplicates removed:  1407
 Result Count of duplicates removed:  1

### Increase partition to 8 in merged RDD:
insuredatarepart: class 'pyspark.rdd.RDD'

### Split RDD with business date field:
20/02/05 19:11:43 ERROR Executor: Exception in task 0.0 in stage 74.0 (TID 150)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in <lambda>
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in <genexpr>
  File "/home/hduser/sparkdata2/script/insurance_info2_new.py", line 294, in <lambda>
    rdd_201901001 = insuredatarepart.map(lambda y: y.split(",",-1)).filter(lambda x: u'01-10-2019' in x)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1502, in __getattr__
    raise AttributeError(item)
AttributeError: split

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
###删除合并RDD中的重复项:
被保险数据:类别'pyspark.rdd.PipelinedRDD'
删除重复项后的结果计数:1407
已删除重复项的结果计数:1
###将合并RDD中的分区增加到8:
被保险人部件:类别'pyspark.rdd.rdd'
###使用业务日期字段拆分RDD:
20/02/05 19:11:43错误执行者:阶段74.0中任务0.0中的异常(TID 150)
org.apache.spark.api.python.PythonException:回溯(最近一次调用last):
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py”,主文件第172行
过程()
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py”,第167行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2371行,在pipeline_func中
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2371行,在pipeline_func中
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2371行,在pipeline_func中
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第317行,func格式
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1008行,在
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1008行,在
文件“/home/hduser/sparkdata2/script/insurance_info2_new.py”,第294行,在
rdd_201901001=被保险资产零件图(λy:y.split(“,”,-1)).过滤器(λx:u'01-10-2019'in x)
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py”,第1502行,在__
提高属性错误(项目)
属性错误:拆分
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:319)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:283)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
位于org.apache.spark.scheduler.Task.run(Task.scala:86)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:274)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
运行(Thread.java:745)

从您提供的打印输出来看,似乎您有Row类型的RDD

Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='96601', IssuerId2='96601', MarketCoverage=u'SHOP (Small Group)', NetworkName=u'Select Network', NetworkURL=u'http://il.coventryproviders.com', SourceName=u'SERFF', StateCode=u'IL', custnum='13')Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'Yes', IssuerId='37001', IssuerId2='37001', MarketCoverage=u'Individual', NetworkName=u'HumanaDental PPO/Traditional Preferred', NetworkURL=u'https://www.humana.com/finder/search?customerId=1085&pfpkey=317', SourceName=u'HIOS', StateCode=u'GA', custnum='13')
Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='54172', IssuerId2='54172', MarketCoverage=u'Individual', NetworkName=u'Molina Marketplace', NetworkURL=u'https://eportal.molinahealthcare.com/Provider/ProviderSearch?RedirectFrom=MolinaStaticWeb&State=fl&Coverage=MMP', SourceName=u'HIOS', StateCode=u'FL', custnum='14')
在这里,您不能调用split函数来分割元素,因为通过您用来获取这些元素的任何过程,它们似乎已经被分割到多个字段中。您可以通过项目索引进行访问

rdd_201901001 = insuredatarepart.filter(lambda x: u'01-10-2019' in x[0])
注意,映射被删除,索引被添加到filter子句中,作为x[0]中的

如果您的行中只有一个字符串类型的字段(根据共享输出,您没有);您仍然需要在zeroteh元素上调用split,而不是在行本身上调用split,并且语句可能已被删除

rdd_201901001 = insuredatarepart.map(lambda y: y[0].split(",",-1)).filter(lambda x: u'01-10-2019' in x[0])

请注意,索引值已应用于
map
filter
操作中。这将产生需要缝合在一起的字符串列表的RDD。

从您提供的打印输出来看,您的RDD类型为Row

Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='96601', IssuerId2='96601', MarketCoverage=u'SHOP (Small Group)', NetworkName=u'Select Network', NetworkURL=u'http://il.coventryproviders.com', SourceName=u'SERFF', StateCode=u'IL', custnum='13')Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'Yes', IssuerId='37001', IssuerId2='37001', MarketCoverage=u'Individual', NetworkName=u'HumanaDental PPO/Traditional Preferred', NetworkURL=u'https://www.humana.com/finder/search?customerId=1085&pfpkey=317', SourceName=u'HIOS', StateCode=u'GA', custnum='13')
Row(BusinessDate=u'01-10-2019', DentalOnlyPlan=u'No', IssuerId='54172', IssuerId2='54172', MarketCoverage=u'Individual', NetworkName=u'Molina Marketplace', NetworkURL=u'https://eportal.molinahealthcare.com/Provider/ProviderSearch?RedirectFrom=MolinaStaticWeb&State=fl&Coverage=MMP', SourceName=u'HIOS', StateCode=u'FL', custnum='14')
在这里,您不能调用split函数来分割元素,因为通过您用来获取这些元素的任何过程,它们似乎已经被分割到多个字段中。您可以通过项目索引进行访问

rdd_201901001 = insuredatarepart.filter(lambda x: u'01-10-2019' in x[0])
注意,映射被删除,索引被添加到filter子句中,作为x[0]中的

如果您的行中只有一个字符串类型的字段(根据共享输出,您没有);您仍然需要在zeroteh元素上调用split,而不是在行本身上调用split,并且语句可能已被删除

rdd_201901001 = insuredatarepart.map(lambda y: y[0].split(",",-1)).filter(lambda x: u'01-10-2019' in x[0])

请注意,索引值已应用于
map
filter
操作中。这将产生一个需要缝合在一起的字符串列表的RDD。

欢迎使用SO!请花一点时间阅读如何发布spark问题:看起来您的被保险人零件在列表中。您正在对其应用拆分函数。因为列表没有拆分功能。这是投掷错误。欢迎使用SO!请花一点时间阅读如何发布spark问题:看起来您的被保险人零件在列表中。您正在对其应用拆分函数。因为列表没有拆分功能。它的投掷错误。是的,它的作品除外。。。。RDD的解释也很好:-)是的,它的作品除外。。。。RDD的解释也不错:-)