Php 阅读前100行_Php_Web Services_Hadoop_Web Crawler_Common Crawl

Php 阅读前100行

php web-services hadoop web-crawler

Php 阅读前100行,php,web-services,hadoop,web-crawler,common-crawl,Php,Web Services,Hadoop,Web Crawler,Common Crawl,请查看以下代码： wcmapper.php（hadoop流媒体作业的映射器） #/usr/bin/php wcreducer.php（用于示例hadoop作业的reducer脚本） #/usr/bin/php 此代码用于在commoncrawl数据集上使用PHP进行Wordcount流式处理作业在这里，这些代码读取整个输入。这不是我需要的，我需要读取前100行并将其写入文本文件。我是Hadoop、CommonCrawl和PHP的初学者。那么，我该怎么做呢请帮助。在第一个循环中使用计数器，

请查看以下代码：

wcmapper.php（hadoop流媒体作业的映射器）

#/usr/bin/php

wcreducer.php（用于示例hadoop作业的reducer脚本）

#/usr/bin/php

此代码用于在commoncrawl数据集上使用PHP进行Wordcount流式处理作业

在这里，这些代码读取整个输入。这不是我需要的，我需要读取前100行并将其写入文本文件。我是Hadoop、CommonCrawl和PHP的初学者。那么，我该怎么做呢

请帮助。

在第一个循环中使用计数器，当计数器达到100时停止循环。然后，创建一个虚拟循环，该循环一直读取到输入结束，然后继续执行代码（将结果写入STDOUT）。写入结果也可以在虚拟循环之前读取，直到STDIN输入结束。示例代码如下：

...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
   // read the line from STDIN; you
   // can add a check to exit if done ($line == false)
   $line = fgets(STDIN); 
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}

。。。
//输入来自标准输入（标准输入）
对于（$i=1；$i$count）{
//制表符分隔
回显“$word\t$count\n”；
}
//虚拟循环（使用所有映射器输入；它可能工作
//没有这个循环，但我不确定这是否会混淆
//Hadoop框架；您可以在不使用此循环的情况下尝试它，然后查看）
while（（$line=fgets（STDIN））！==false）{
}

我不确定如何定义“行”，但如果您需要单词，可以这样做：

for ($count=0; $count<=100; $count++){
      echo $word2count[$count]\t$count\n";
}

对于（$count=0；$countwow），这很有效。您能告诉我如何将输出也写入文本文件吗？请帮助。@Dongle关于您的第二个问题，我不确定您在问什么，因为您的代码会将输出写入文本文件（在HDFS中）。请发布另一个问题，详细说明您想要实现的目标、您尝试过的内容以及为什么它不起作用。真的吗？但是这个文本文件在哪里？我找不到它。我如何将其写成.txt？@Dongle这就是我建议您发布另一个问题的原因。我们需要更多信息。要回答这个问题，您如何运行程序？“-输出”flag告诉Hadoop Streaming将输出存储在何处。无论使用何种扩展名，输出都已经是文本文件。但如果愿意，您可以执行“-output/path/in/hdfs/out.txt”。
...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
   // read the line from STDIN; you
   // can add a check to exit if done ($line == false)
   $line = fgets(STDIN); 
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}

for ($count=0; $count<=100; $count++){
      echo $word2count[$count]\t$count\n";
}