Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用行中的第n个元素从RDD创建对_Python_Apache Spark_Bigdata_Rdd_Databricks - Fatal编程技术网

Python 使用行中的第n个元素从RDD创建对

Python 使用行中的第n个元素从RDD创建对,python,apache-spark,bigdata,rdd,databricks,Python,Apache Spark,Bigdata,Rdd,Databricks,我使用了以下代码: def process_row(row): words = row.replace('"', '').split(' ') for i in range(len(words)): #if we find ‘-’ we will replace it with ‘0’ if(words[-1]=='-'): words[i]='0' return words return [words(0),words(1), words(

我使用了以下代码:

def process_row(row):
words = row.replace('"', '').split(' ')
for i in range(len(words)):
      #if we find ‘-’ we will replace it with ‘0’
      if(words[-1]=='-'):
          words[i]='0'
return words
return [words(0),words(1), words(2), words(3), words(4), int(words(5))]

nasa = (
nasa_raw.flatMap(process_row)
)
nasa.persist()
for row in nasa.take(10):
print(row)
要转换此数据,请执行以下操作:

in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 
1839
 uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0
 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0
ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713
uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0
slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687
piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853
slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
进入这个流水线rdd:

in24.inetnebr.com
[01/Aug/1995:00:00:01]
 GET
 /shuttle/missions/sts-68/news/sts-68-mcc-05.txt
 200
 1839
 uplherc.upl.com
 [01/Aug/1995:00:00:07]
 GET
 /
我想通过使用配对创建地址的频率,如:uplherc.upl.com

pairs = nasa.map(lambda x: (x , 1))
count_by_resource = pairs.reduceByKey(lambda x, y : x + y)
count_by_resource =  count_by_resource.takeOrdered(10, key = lambda x: -x[1])
spark.createDataFrame(count_by_resource, ['Resource_location','Count']).show(10)
但结果是每个元素的频率:

   --------------------+-------+
   |   Resource_location|  Count|
   +--------------------+-------+
   |                 GET|1551681|
   |                 200|1398910|
   |                   0| 225418|

我应该如何引用我感兴趣的元素?

当您主要对一些域感兴趣时,将每行按空格分割,然后创建一个包含所有这些值的平面图,这可能会带来额外的工作,当然也会带来额外的开销和处理

根据提供的示例数据,域是每行的第一项。我还注意到,你的一些行以一个空格开头,因此产生了一个额外的弦乐段。您可以考虑使用
strip
功能在处理之前修剪线条

您可以考虑修改过程,只返回字符串的第一位,或者创建另一个<代码> MAP>代码>操作。


def从_行(行)提取_域_:
#如果行是字符串
domain=row.strip().split(“”)[0]
#如果你发送了一个列表,你总是可以从列表中提取第一项作为域名
#域=行[0]
返回域.lower()
#中间rdd
nasa_domains=nasa_raw.map(从_行提取_domain_)
#根据需要继续与nasa合作`
pairs=nasa_domains.map(lambda x:(x,1))
按资源计数=对。还原基(λx,y:x+y)
按资源计数=按资源计数。takeOrdered(10,key=lambda x:-x[1])
createDataFrame(按资源计数,['resource\u location','count'])。显示(10)
输出

如果第一项不是域,您可能希望使用模式筛选集合以匹配域请参见此处的建议

+--------------------+-----+
|   Resource_location|Count|
+--------------------+-----+
|     uplherc.upl.com|    5|
|slppp6.intermind.net|    2|
|   in24.inetnebr.com|    1|
|ix-esc-ca2-07.ix....|    1|
|piweba4y.prodigy.com|    1|
+--------------------+-----+