Java 如何在Hadoop Mapper中处理XML文件

Java 如何在Hadoop Mapper中处理XML文件,java,xml,hadoop,Java,Xml,Hadoop,我有下面格式的大XML文件。我可以逐行读取并执行一些字符串操作,因为我只需要提取几个字段的值。但是,一般来说,我们如何处理以下格式的文件?我找到了Mahout XML解析器,但我认为它不适用于以下格式 <?xml version="1.0" encoding="utf-8"?> <posts> <row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.76

我有下面格式的大XML文件。我可以逐行读取并执行一些字符串操作,因为我只需要提取几个字段的值。但是,一般来说,我们如何处理以下格式的文件?我找到了Mahout XML解析器,但我认为它不适用于以下格式

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.763" Score="155" ViewCount="160162" Body="&lt;p&gt;This is a common question by those who have just rooted their phones.  What apps, ROMs, benefits, etc. do I get from rooting?  What should I be doing now?&lt;/p&gt;&#xA;" OwnerUserId="10" LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133" LastActivityDate="2013-09-03T05:57:21.440" Title="I've rooted my phone.  Now what?  What do I gain from rooting?" Tags="&lt;rooting&gt;&lt;root&gt;" AnswerCount="2" CommentCount="0" FavoriteCount="107" CommunityOwnedDate="2011-01-25T08:44:10.820" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="4" CreationDate="2010-09-13T19:17:17.917" Score="10" ViewCount="966" Body="&lt;p&gt;I have a Google Nexus One with Android 2.2. I didn't like the default SMS-application so I installed Handcent-SMS. Now when I get an SMS, I get notified twice. How can I fix this?&lt;/p&gt;&#xA;" OwnerUserId="7" LastEditorUserId="981" LastEditDate="2011-11-01T18:30:32.300" LastActivityDate="2011-11-01T18:30:32.300" Title="I installed another SMS application, now I get notified twice" Tags="&lt;2.2-froyo&gt;&lt;sms&gt;&lt;notifications&gt;&lt;handcent-sms&gt;" AnswerCount="3" FavoriteCount="2" />
</posts>

您发布的数据来自SO数据转储(我知道,因为我目前正在Hadoop上使用它)。下面是我编写的映射程序,它用这个文件创建了一个选项卡分隔的文件

您基本上是逐行阅读,并使用jaxpapi解析和提取所需的信息

public class StackoverflowDataWranglerMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private final Text outputKey = new Text();
    private final Text outputValue = new Text();

    private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    private DocumentBuilder builder;
    private static final Joiner TAG_JOINER = Joiner.on(",").skipNulls();
    // 2008-07-31T21:42:52.667
    private static final DateFormat DATE_PARSER = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
    private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat("yyyy-MM-dd");

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        try
        {
            builder = factory.newDocumentBuilder();
        }
        catch (ParserConfigurationException e)
        {
            new IOException(e);
        }
    }

    @Override
    protected void map(LongWritable inputKey, Text inputValue, Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException
    {
        try
        {
            String entry = inputValue.toString();
            if (entry.contains("<row "))
            {
                Document doc = builder.parse(new InputSource(new StringReader(entry)));
                Element rootElem = doc.getDocumentElement();

                String id = rootElem.getAttribute("Id");
                String postedBy = rootElem.getAttribute("OwnerUserId").trim();
                String viewCount = rootElem.getAttribute("ViewCount");
                String postTypeId = rootElem.getAttribute("PostTypeId");
                String score = rootElem.getAttribute("Score");
                String title = rootElem.getAttribute("Title");
                String tags = rootElem.getAttribute("Tags");
                String answerCount = rootElem.getAttribute("AnswerCount");
                String commentCount = rootElem.getAttribute("CommentCount");
                String favoriteCount = rootElem.getAttribute("FavoriteCount");
                String creationDate = rootElem.getAttribute("CreationDate");

                Date parsedDate = null;
                if (creationDate != null && creationDate.trim().length() > 0)
                {
                    try
                    {
                        parsedDate = DATE_PARSER.parse(creationDate);
                    }
                    catch (ParseException e)
                    {
                        context.getCounter("Bad Record Counters", "Posts missing CreationDate").increment(1);
                    }
                }

                if (postedBy.length() == 0 || postedBy.trim().equals("-1"))
                {
                    context.getCounter("Bad Record Counters", "Posts with either empty UserId or UserId contains '-1'")
                            .increment(1);
                    try
                    {
                        parsedDate = DATE_BUILDER.parse("2100-00-01");
                    }
                    catch (ParseException e)
                    {
                        // ignore
                    }
                }

                tags = tags.trim();
                String tagTokens[] = null;

                if (tags.length() > 1)
                {
                    tagTokens = tags.substring(1, tags.length() - 1).split("><");
                }
                else
                {
                    context.getCounter("Bad Record Counters", "Untagged Posts").increment(1);
                }

                outputKey.clear();
                outputKey.set(id);

                StringBuilder sb = new StringBuilder(postedBy).append("\t").append(parsedDate.getTime()).append("\t")
                        .append(postTypeId).append("\t").append(title).append("\t").append(viewCount).append("\t").append(score)
                        .append("\t");

                if (tagTokens != null)
                {
                    sb.append(TAG_JOINER.join(tagTokens)).append("\t");
                }
                else
                {
                    sb.append("").append("\t");
                }
                sb.append(answerCount).append("\t").append(commentCount).append("\t").append(favoriteCount).toString();

                outputValue.set(sb.toString());

                context.write(outputKey, outputValue);
            }
        }
        catch (SAXException e)
        {
            context.getCounter("Bad Record Counters", "Unparsable records").increment(1);
        }
        finally
        {
            builder.reset();
        }
    }
}
公共类StackoverflowDataWranglerMapper扩展映射器
{
私有最终文本输出键=新文本();
私有最终文本输出值=新文本();
私有最终DocumentBuilderFactory工厂=DocumentBuilderFactory.newInstance();
私人文档生成器;
private static final Joiner TAG_Joiner=Joiner.on(“,”).skipNulls();
//2008-07-31T21:42:52.667
private static final DateFormat DATE_PARSER=new SimpleDateFormat(“yyyy-MM-dd'T'HH:MM:ss.SSS”);
私有静态最终简化格式日期\ u生成器=新简化格式(“yyyy-MM-dd”);
@凌驾
受保护的无效设置(上下文上下文)引发IOException、InterruptedException
{
尝试
{
builder=factory.newDocumentBuilder();
}
捕获(ParserConfiguration异常e)
{
新的例外情况(e);
}
}
@凌驾
受保护的void映射(LongWritable inputKey、Text inputValue、Mapper.Context上下文)
抛出IOException、InterruptedException
{
尝试
{
字符串项=inputValue.toString();

if(entry.contains)(“我发现了一个类似于我在中提到的字符串操作的解决方案,但是如果我想提取上述格式的所有字段,它就是无效的或不可重用的。我想您可以对XML使用Avro格式,Hadoop应该能够解析它efficiently@AngeloImmediata,你能解释清楚一点吗。