Java 如何在Hadoop Mapper中处理XML文件
我有下面格式的大XML文件。我可以逐行读取并执行一些字符串操作,因为我只需要提取几个字段的值。但是,一般来说,我们如何处理以下格式的文件?我找到了Mahout XML解析器,但我认为它不适用于以下格式Java 如何在Hadoop Mapper中处理XML文件,java,xml,hadoop,Java,Xml,Hadoop,我有下面格式的大XML文件。我可以逐行读取并执行一些字符串操作,因为我只需要提取几个字段的值。但是,一般来说,我们如何处理以下格式的文件?我找到了Mahout XML解析器,但我认为它不适用于以下格式 <?xml version="1.0" encoding="utf-8"?> <posts> <row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.76
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.763" Score="155" ViewCount="160162" Body="<p>This is a common question by those who have just rooted their phones. What apps, ROMs, benefits, etc. do I get from rooting? What should I be doing now?</p>
" OwnerUserId="10" LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133" LastActivityDate="2013-09-03T05:57:21.440" Title="I've rooted my phone. Now what? What do I gain from rooting?" Tags="<rooting><root>" AnswerCount="2" CommentCount="0" FavoriteCount="107" CommunityOwnedDate="2011-01-25T08:44:10.820" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="4" CreationDate="2010-09-13T19:17:17.917" Score="10" ViewCount="966" Body="<p>I have a Google Nexus One with Android 2.2. I didn't like the default SMS-application so I installed Handcent-SMS. Now when I get an SMS, I get notified twice. How can I fix this?</p>
" OwnerUserId="7" LastEditorUserId="981" LastEditDate="2011-11-01T18:30:32.300" LastActivityDate="2011-11-01T18:30:32.300" Title="I installed another SMS application, now I get notified twice" Tags="<2.2-froyo><sms><notifications><handcent-sms>" AnswerCount="3" FavoriteCount="2" />
</posts>
您发布的数据来自SO数据转储(我知道,因为我目前正在Hadoop上使用它)。下面是我编写的映射程序,它用这个文件创建了一个选项卡分隔的文件 您基本上是逐行阅读,并使用jaxpapi解析和提取所需的信息
public class StackoverflowDataWranglerMapper extends Mapper<LongWritable, Text, Text, Text>
{
private final Text outputKey = new Text();
private final Text outputValue = new Text();
private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
private DocumentBuilder builder;
private static final Joiner TAG_JOINER = Joiner.on(",").skipNulls();
// 2008-07-31T21:42:52.667
private static final DateFormat DATE_PARSER = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat("yyyy-MM-dd");
@Override
protected void setup(Context context) throws IOException, InterruptedException
{
try
{
builder = factory.newDocumentBuilder();
}
catch (ParserConfigurationException e)
{
new IOException(e);
}
}
@Override
protected void map(LongWritable inputKey, Text inputValue, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException
{
try
{
String entry = inputValue.toString();
if (entry.contains("<row "))
{
Document doc = builder.parse(new InputSource(new StringReader(entry)));
Element rootElem = doc.getDocumentElement();
String id = rootElem.getAttribute("Id");
String postedBy = rootElem.getAttribute("OwnerUserId").trim();
String viewCount = rootElem.getAttribute("ViewCount");
String postTypeId = rootElem.getAttribute("PostTypeId");
String score = rootElem.getAttribute("Score");
String title = rootElem.getAttribute("Title");
String tags = rootElem.getAttribute("Tags");
String answerCount = rootElem.getAttribute("AnswerCount");
String commentCount = rootElem.getAttribute("CommentCount");
String favoriteCount = rootElem.getAttribute("FavoriteCount");
String creationDate = rootElem.getAttribute("CreationDate");
Date parsedDate = null;
if (creationDate != null && creationDate.trim().length() > 0)
{
try
{
parsedDate = DATE_PARSER.parse(creationDate);
}
catch (ParseException e)
{
context.getCounter("Bad Record Counters", "Posts missing CreationDate").increment(1);
}
}
if (postedBy.length() == 0 || postedBy.trim().equals("-1"))
{
context.getCounter("Bad Record Counters", "Posts with either empty UserId or UserId contains '-1'")
.increment(1);
try
{
parsedDate = DATE_BUILDER.parse("2100-00-01");
}
catch (ParseException e)
{
// ignore
}
}
tags = tags.trim();
String tagTokens[] = null;
if (tags.length() > 1)
{
tagTokens = tags.substring(1, tags.length() - 1).split("><");
}
else
{
context.getCounter("Bad Record Counters", "Untagged Posts").increment(1);
}
outputKey.clear();
outputKey.set(id);
StringBuilder sb = new StringBuilder(postedBy).append("\t").append(parsedDate.getTime()).append("\t")
.append(postTypeId).append("\t").append(title).append("\t").append(viewCount).append("\t").append(score)
.append("\t");
if (tagTokens != null)
{
sb.append(TAG_JOINER.join(tagTokens)).append("\t");
}
else
{
sb.append("").append("\t");
}
sb.append(answerCount).append("\t").append(commentCount).append("\t").append(favoriteCount).toString();
outputValue.set(sb.toString());
context.write(outputKey, outputValue);
}
}
catch (SAXException e)
{
context.getCounter("Bad Record Counters", "Unparsable records").increment(1);
}
finally
{
builder.reset();
}
}
}
公共类StackoverflowDataWranglerMapper扩展映射器
{
私有最终文本输出键=新文本();
私有最终文本输出值=新文本();
私有最终DocumentBuilderFactory工厂=DocumentBuilderFactory.newInstance();
私人文档生成器;
private static final Joiner TAG_Joiner=Joiner.on(“,”).skipNulls();
//2008-07-31T21:42:52.667
private static final DateFormat DATE_PARSER=new SimpleDateFormat(“yyyy-MM-dd'T'HH:MM:ss.SSS”);
私有静态最终简化格式日期\ u生成器=新简化格式(“yyyy-MM-dd”);
@凌驾
受保护的无效设置(上下文上下文)引发IOException、InterruptedException
{
尝试
{
builder=factory.newDocumentBuilder();
}
捕获(ParserConfiguration异常e)
{
新的例外情况(e);
}
}
@凌驾
受保护的void映射(LongWritable inputKey、Text inputValue、Mapper.Context上下文)
抛出IOException、InterruptedException
{
尝试
{
字符串项=inputValue.toString();
if(entry.contains)(“我发现了一个类似于我在中提到的字符串操作的解决方案,但是如果我想提取上述格式的所有字段,它就是无效的或不可重用的。我想您可以对XML使用Avro格式,Hadoop应该能够解析它efficiently@AngeloImmediata,你能解释清楚一点吗。