Url 使用mapreduce从日志文件中提取命中计数
我试图在Hadoop map reduce中编写以下代码。我有一个日志文件,其中包含IP地址和相应的IP打开的URL。详情如下:Url 使用mapreduce从日志文件中提取命中计数,url,hadoop,logging,text,mapreduce,Url,Hadoop,Logging,Text,Mapreduce,我试图在Hadoop map reduce中编写以下代码。我有一个日志文件,其中包含IP地址和相应的IP打开的URL。详情如下: 192.168.72.224 www.m4maths.com 192.168.72.177 www.yahoo.com 192.168.72.177 www.yahoo.com 192.168.72.224 www.facebook.com 192.168.72.224 www.gmail.com 192.168.72.177 www.facebook.com 192
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
现在我需要以这样一种方式组织这个文件的结果:它列出不同的IP地址,URL后跟该IP打开的次数
例如,如果192.168.72.224
根据整个日志文件打开www.yahoo.com
15次,则输出必须包含:
192.168.72.224 www.yahoo.com 15
应针对文件中的所有IP执行此操作,最终输出应如下所示:
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
我尝试过的代码是:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
公共类WordCountMapper扩展MapReduceBase实现Mapper
{
私有最终静态IntWritable one=新的IntWritable(1);
私有文本字=新文本();
公共void映射(LongWritable键、文本值、OutputCollector输出、Reporter报告器)引发IOException
{
字符串行=value.toString();
StringTokenizer标记器=新的StringTokenizer(行);
while(tokenizer.hasMoreTokens())
{
set(tokenizer.nextToken());
输出。收集(字,一);
}
}
}
我知道这个代码有严重的缺陷,请给我一个前进的想法
谢谢。我将提出以下设计:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
您的示例的输出为:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
我用java编写了相同的逻辑
public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
System.out.println(value);
StringTokenizer st=new StringTokenizer(value.toString());
if(st.hasMoreTokens())
contex.write(new Text(st.nextToken()), new Text(st.nextToken()));
}
}
public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashMap<String, Integer> urlCount=new HashMap<>();
String url=null;
Iterator<Text> it=values.iterator();
while (it.hasNext()) {
url=it.next().toString();
if(urlCount.get(url)==null)
urlCount.put(url, 1);
else
urlCount.put(url,urlCount.get(url)+1);
}
for(Entry<String, Integer> k:urlCount.entrySet())
context.write(key, new Text(k.getKey()+" "+k.getValue()));
}
}
public class UrlHitCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new UrlHitCount(), args);
}
public int run(String[] arg0) throws Exception {
Job job = Job.getInstance(getConf());
job.setJobName("url-hit-count");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(UrlHitMapper.class);
job.setReducerClass(UrlHitReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("input/urls"));
FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));
job.setJarByClass(WordCount.class);
job.submit();
return 1;
}
}
公共类UrlHitMapper扩展映射器{
公共void映射(对象键、文本值、上下文上下文)引发IOException、InterruptedException{
系统输出打印项次(值);
StringTokenizer st=新的StringTokenizer(value.toString());
如果(st.hasMoreTokens())
contex.write(新文本(st.nextToken()),新文本(st.nextToken());
}
}
公共类UrlHitReducer扩展了Reducer{
公共void reduce(文本键、Iterable值、上下文)
抛出IOException、InterruptedException{
HashMap urlCount=新HashMap();
字符串url=null;
迭代器it=values.Iterator();
while(it.hasNext()){
url=it.next().toString();
if(urlCount.get(url)==null)
urlCount.put(url,1);
其他的
urlCount.put(url,urlCount.get(url)+1);
}
对于(条目k:urlCount.entrySet())
write(key,新文本(k.getKey()+“”+k.getValue());
}
}
公共类UrlHitCount扩展配置的实现工具{
公共静态void main(字符串[]args)引发异常{
运行(新配置(),新UrlHitCount(),参数);
}
公共int运行(字符串[]arg0)引发异常{
Job Job=Job.getInstance(getConf());
setJobName(“url命中计数”);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
setMapperClass(UrlHitMapper.class);
job.setReducerClass(UrlHitReducer.class);
setOutputFormatClass(TextOutputFormat.class);
setInputPaths(作业,新路径(“输入/URL”);
setOutputPath(作业,新路径(“url_otput”+System.currentTimeMillis());
job.setJarByClass(WordCount.class);
job.submit();
返回1;
}
}