如何使用Java从ATOM提要中提取XHTML?
我试图从RSS提要中提取一些XHTML,以便将其放在Web视图中。讨论中的RSS提要有一个名为如何使用Java从ATOM提要中提取XHTML?,java,android,xml,atom-feed,Java,Android,Xml,Atom Feed,我试图从RSS提要中提取一些XHTML,以便将其放在Web视图中。讨论中的RSS提要有一个名为的标记,内容中的字符是XHTML。(我正在删除的网站是一个博客订阅源) 拉取此内容的最佳方式是什么?我试图用XPath攻击它。你喜欢这个工作吗 public static String parseAtom (InputStream atomIS) throws Exception { // Below should yield the second content block S
的标记,内容中的字符是XHTML。(我正在删除的网站是一个博客订阅源)
拉取此内容的最佳方式是什么?
我试图用XPath攻击它。你喜欢这个工作吗
public static String parseAtom (InputStream atomIS)
throws Exception {
// Below should yield the second content block
String xpathString = "(//*[starts-with(name(),"content")])[2]";
// or, String xpathString = "//*[name() = 'content'][2]";
// remove the '[2]' to get all content tags or get the count,
// if needed, and then target specific blocks
//String xpathString = "count(//*[starts-with(name(),"content")])";
// note the evaluate expression below returns a glob and not a node set
XPathFactory xpf = XPathFactory.newInstance ();
XPath xpath = xpf.newXPath ();
XPathExpression xpathCompiled = xpath.compile (xpathString);
// use the first to recast and evaluate as NodeList
//Object atomOut = xpathCompiled.evaluate (
// new InputSource (atomIS), XPathConstants.NODESET);
String atomOut = xpathCompiled.evaluate (
new InputSource (atomIS), XPathConstants.STRING);
System.out.println (atomOut);
return atomOut;
}
我可以在这里看到您的问题,这些解析器之所以不能产生正确的结果,是因为您的
标记的内容没有包装到
中,在找到更合适的解决方案之前我会做什么我会使用快速而肮脏的技巧:
private void parseFile(String fileName) throws IOException {
String line;
BufferedReader br = new BufferedReader(new FileReader(new File(fileName)));
StringBuilder sb = new StringBuilder();
boolean match = false;
while ((line = br.readLine()) != null) {
if(line.contains("<content")){
sb.append(line);
sb.append("\n");
match = true;
continue;
}
if(match){
sb.append(line);
sb.append("\n");
match = false;
}
if(line.contains("</content")){
sb.append(line);
sb.append("\n");
}
}
System.out.println(sb.toString());
}
private void parseFile(字符串文件名)引发IOException{
弦线;
BufferedReader br=新的BufferedReader(新文件读取器(新文件名));
StringBuilder sb=新的StringBuilder();
布尔匹配=假;
而((line=br.readLine())!=null){
if(line.contains(“不太好看,但这是我用来解析Blogger使用的ATOM提要的(本质)。代码很恶心,但它来自一个真实的应用程序。无论如何,你可能会得到它的一般风格
final String TAG_FEED = "feed";
public int parseXml(Reader reader) {
XmlPullParserFactory factory = null;
StringBuilder out = new StringBuilder();
int entries = 0;
try {
factory = XmlPullParserFactory.newInstance();
factory.setNamespaceAware(true);
XmlPullParser xpp = factory.newPullParser();
xpp.setInput(reader);
while (true) {
int eventType = xpp.next();
if (eventType == XmlPullParser.END_DOCUMENT) {
break;
} else if (eventType == XmlPullParser.START_DOCUMENT) {
out.append("Start document\n");
} else if (eventType == XmlPullParser.START_TAG) {
String tag = xpp.getName();
// out.append("Start tag " + tag + "\n");
if (TAG_FEED.equalsIgnoreCase(tag)) {
entries = parseFeed(xpp);
}
} else if (eventType == XmlPullParser.END_TAG) {
// out.append("End tag " + xpp.getName() + "\n");
} else if (eventType == XmlPullParser.TEXT) {
// out.append("Text " + xpp.getText() + "\n");
}
}
out.append("End document\n");
} catch (XmlPullParserException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// return out.toString();
return entries;
}
private int parseFeed(XmlPullParser xpp) throws XmlPullParserException, IOException {
int depth = xpp.getDepth();
assert (depth == 1);
int eventType;
int entries = 0;
xpp.require(XmlPullParser.START_TAG, null, TAG_FEED);
while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT) && (xpp.getDepth() > depth)) {
// loop invariant: At this point, the parser is not sitting on
// end-of-document, and is at a level deeper than where it started.
if (eventType == XmlPullParser.START_TAG) {
String tag = xpp.getName();
// Log.d("parseFeed", "Start tag: " + tag); // Uncomment to debug
if (FeedEntry.TAG_ENTRY.equalsIgnoreCase(tag)) {
FeedEntry feedEntry = new FeedEntry(xpp);
feedEntry.persist(this);
entries++;
// Log.d("FeedEntry", feedEntry.title); // Uncomment to debug
// xpp.require(XmlPullParser.END_TAG, null, tag);
}
}
}
assert (depth == 1);
return entries;
}
class FeedEntry {
String id;
String published;
String updated;
// Timestamp lastRead;
String title;
String subtitle;
String authorName;
int contentType;
String content;
String preview;
String origLink;
String thumbnailUri;
// Media media;
static final String TAG_ENTRY = "entry";
static final String TAG_ENTRY_ID = "id";
static final String TAG_TITLE = "title";
static final String TAG_SUBTITLE = "subtitle";
static final String TAG_UPDATED = "updated";
static final String TAG_PUBLISHED = "published";
static final String TAG_AUTHOR = "author";
static final String TAG_CONTENT = "content";
static final String TAG_TYPE = "type";
static final String TAG_ORIG_LINK = "origLink";
static final String TAG_THUMBNAIL = "thumbnail";
static final String ATTRIBUTE_URL = "url";
/**
* Create a FeedEntry by pulling its bits out of an XML Pull Parser. Side effect: Advances
* XmlPullParser.
*
* @param xpp
*/
public FeedEntry(XmlPullParser xpp) {
int eventType;
int depth = xpp.getDepth();
assert (depth == 2);
try {
xpp.require(XmlPullParser.START_TAG, null, TAG_ENTRY);
while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
&& (xpp.getDepth() > depth)) {
if (eventType == XmlPullParser.START_TAG) {
String tag = xpp.getName();
if (TAG_ENTRY_ID.equalsIgnoreCase(tag)) {
id = Util.XmlPullTag(xpp, TAG_ENTRY_ID);
} else if (TAG_TITLE.equalsIgnoreCase(tag)) {
title = Util.XmlPullTag(xpp, TAG_TITLE);
} else if (TAG_SUBTITLE.equalsIgnoreCase(tag)) {
subtitle = Util.XmlPullTag(xpp, TAG_SUBTITLE);
} else if (TAG_UPDATED.equalsIgnoreCase(tag)) {
updated = Util.XmlPullTag(xpp, TAG_UPDATED);
} else if (TAG_PUBLISHED.equalsIgnoreCase(tag)) {
published = Util.XmlPullTag(xpp, TAG_PUBLISHED);
} else if (TAG_CONTENT.equalsIgnoreCase(tag)) {
int attributeCount = xpp.getAttributeCount();
for (int i = 0; i < attributeCount; i++) {
String attributeName = xpp.getAttributeName(i);
if (attributeName.equalsIgnoreCase(TAG_TYPE)) {
String attributeValue = xpp.getAttributeValue(i);
if (attributeValue
.equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_HTML)) {
contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_HTML;
} else if (attributeValue
.equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_XHTML)) {
contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_XHTML;
} else {
contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_TEXT;
}
break;
}
}
content = Util.XmlPullTag(xpp, TAG_CONTENT);
extractPreview();
} else if (TAG_AUTHOR.equalsIgnoreCase(tag)) {
// Skip author for now -- it is complicated
int authorDepth = xpp.getDepth();
assert (authorDepth == 3);
xpp.require(XmlPullParser.START_TAG, null, TAG_AUTHOR);
while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
&& (xpp.getDepth() > authorDepth)) {
}
assert (xpp.getDepth() == 3);
xpp.require(XmlPullParser.END_TAG, null, TAG_AUTHOR);
} else if (TAG_ORIG_LINK.equalsIgnoreCase(tag)) {
origLink = Util.XmlPullTag(xpp, TAG_ORIG_LINK);
} else if (TAG_THUMBNAIL.equalsIgnoreCase(tag)) {
thumbnailUri = Util.XmlPullAttribute(xpp, tag, null, ATTRIBUTE_URL);
} else {
@SuppressWarnings("unused")
String throwAway = Util.XmlPullTag(xpp, tag);
}
}
} // while
} catch (XmlPullParserException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
assert (xpp.getDepth() == 2);
}
}
public static String XmlPullTag(XmlPullParser xpp, String tag)
throws XmlPullParserException, IOException {
xpp.require(XmlPullParser.START_TAG, null, tag);
String itemText = xpp.nextText();
if (xpp.getEventType() != XmlPullParser.END_TAG) {
xpp.nextTag();
}
xpp.require(XmlPullParser.END_TAG, null, tag);
return itemText;
}
public static String XmlPullAttribute(XmlPullParser xpp,
String tag, String namespace, String name)
throws XmlPullParserException, IOException {
assert (!TextUtils.isEmpty(tag));
assert (!TextUtils.isEmpty(name));
xpp.require(XmlPullParser.START_TAG, null, tag);
String itemText = xpp.getAttributeValue(namespace, name);
if (xpp.getEventType() != XmlPullParser.END_TAG) {
xpp.nextTag();
}
xpp.require(XmlPullParser.END_TAG, null, tag);
return itemText;
}
你能提供一些示例xhtml吗?@Waqas我用一个示例更新了我的问题。或者另一个库@ant它似乎与Android不兼容,或者我错了吗?不太正常,我想你不能提供一个更健壮的示例吗?我不太熟悉Xpath,所以可能我犯了一个愚蠢的错误。我将更新我的示例如果您愿意查看,请使用我的测试的pastebin提问。抱歉,我已将XPath表达式更改为在给定场景中工作。感谢更新,当我有一些停机时间时,我将不得不使用您的XPath进行测试。但是,我能够获得我所关注的内容节点的句柄(问题是命名空间)但它仍然没有完全退出xhtml。仍然支持您的帮助:)非常感谢。这对我帮助很大!感谢您对CDATA的了解,我最后只是做了一个字符串替换来在内容后添加CDATA标记。SAX仍然无法处理它,但DOM正在按照我的要求读取它。可能不是最佳解决方案,但它满足了我的要求。不真实。xml中没有要求元素内容需要包装,我n CDATA。这仅在数据未正确编码为XML时才有必要。他正在处理服务器提供的内容,不必强制更改结果集。@ingyhere是的,我不是说我对解决方案感到满意,我只是说至少它可以工作:-)我对答案也不满意,因此又快又脏
。这些问题永远不会持续下去,应该不惜一切代价避免这些问题感谢伟大的答案!这里有很多值得深入了解的地方,这看起来比我最终使用的解决方案好得多,也更健壮,所以我迫不及待地想在停机后尝试解决这些问题。
feedEntry.persist(this);