使用Java DocumentBuilder()阅读网站时忽略html特殊字符
我正在尝试使用Java DocumentBuilder()阅读一个网站(HTML),它正在阅读,但是当有HTML使用Java DocumentBuilder()阅读网站时忽略html特殊字符,java,sax,android-xml,Java,Sax,Android Xml,我正在尝试使用Java DocumentBuilder()阅读一个网站(HTML),它正在阅读,但是当有HTML£&ldquo符号或任何其他html特殊字符。它停止读取特殊字符后的任何内容,而是返回null。许多其他人也提出了类似的问题。但没有任何建设性的答案。如果有人知道解决这个问题的方法,请告诉我。请在这里找到我的代码 从488英镑增加到600英镑 Ronals说:“这一地区的学校正在接受教育” 为了阅读这些,我编写了以下代码 private String extractT
£代码>&ldquo代码>符号或任何其他html特殊字符。它停止读取特殊字符后的任何内容,而是返回null。许多其他人也提出了类似的问题。但没有任何建设性的答案。如果有人知道解决这个问题的方法,请告诉我。请在这里找到我的代码
从488英镑增加到600英镑
Ronals说:“这一地区的学校正在接受教育”
为了阅读这些,我编写了以下代码
private String extractTheTitle(String responseBody) throws Exception {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream encXML = new ByteArrayInputStream(responseBody.getBytes("UTF8"));
Document embeddedDoc = builder.parse(encXML);
NodeList titleNodes = embeddedDoc.getElementsByTagName("p");
if (titleNodes != null && titleNodes.getLength() > 0) {
for(int i = 0; i<titleNodes.getLength(); i++) {
Element aTitleElement = (Element) titleNodes.item(i);
aTitleElement.normalize();
Node titleContent = aTitleElement.getFirstChild();
String nodeText = titleContent.getNodeValue();
myArrlist.add(i , "<p>"+nodeText+"</p>");
}
}
}
private String extract标题(字符串响应库)引发异常{
DocumentBuilder=DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream encXML=newbytearrayinputstream(responseBody.getBytes(“UTF8”);
documentembeddeddoc=builder.parse(encXML);
节点列表标题节点=embeddedDoc.getElementsByTagName(“p”);
if(titleNodes!=null&&titleNodes.getLength()>0){
对于(int i=0;i每个aTitleElement(..
)都包含多个节点,其中一个节点是实体。因此,与getFirstChild on相反,必须迭代所有子节点;规范化在这方面没有帮助
StringBuilder pText = new StringBuilder();
NodeList children = aTitleElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
if (child.getNodeType() == Node.ENTITY_REFERENCE_NODE) {
...
}
pText.append(child.getNodeValue());
}
nodeText = pText.toString();
每个aTitleElement(..
)都包含多个节点,其中一个节点是实体。因此,必须迭代所有子节点,而不是getFirstChild;规范化在这方面没有帮助
StringBuilder pText = new StringBuilder();
NodeList children = aTitleElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
if (child.getNodeType() == Node.ENTITY_REFERENCE_NODE) {
...
}
pText.append(child.getNodeValue());
}
nodeText = pText.toString();
将提取数据的代码。请使用网站url
*新代码
公共作废流程(){
HttpGet getMethod=newHttpGet(“网站的URL在此显示”);
试一试{
ResponseHandler ResponseHandler=新BasicResponseHandler();
字符串websiteBody=client.execute(getMethod,responseHandler);
字符串标题=提取器主体(websiteBody);
}
}
私有字符串提取器主体(字符串响应主体)引发异常{
DocumentBuilder=DocumentBuilderFactory.newInstance().newDocumentBuilder();
DocumentEmbeddedDoc=builder.parse(新的InputSource(新的StringReader(responseBody));
//ByteArrayInputStream encXML=newbytearrayinputstream(responseBody.getBytes(“UTF8”);
//documentembeddeddoc=builder.parse(encXML);
//documentembeddeddoc=builder.parse(新文件(“/home/joop/test.html”);
NodeList pNodes=embeddedDoc.getElementsByTagName(“p”);
StringBuilder pText=新的StringBuilder();
对于(int i=0;i
将提取数据的代码。请使用网站url
*新代码
公共作废流程(){
HttpGet getMethod=newHttpGet(“网站的URL在此显示”);
试一试{
ResponseHandler ResponseHandler=新BasicResponseHandler();
字符串websiteBody=client.execute(getMethod,responseHandler);
字符串标题=提取器主体(websiteBody);
}
}
私有字符串提取器主体(字符串响应主体)引发异常{
DocumentBuilder=DocumentBuilderFactory.newInstance().newDocumentBuilder();
DocumentEmbeddedDoc=builder.parse(新的InputSource(新的StringReader(responseBody));
//ByteArrayInputStream encXML=newbytearrayinputstream(responseBody.getBytes(“UTF8”);
//documentembeddeddoc=builder.parse(encXML);
//documentembeddeddoc=builder.parse(新文件(“/home/joop/test.html”);
NodeList pNodes=embeddedDoc.getElementsByTagName(“p”);
StringBuilder pText=新的StringBuilder();
对于(int i=0;i
您尝试过URLEncoder和URLEcoder吗?我刚刚测试了您的代码,没有问题。您能否解释一下如何获取responseBody,因为我怀疑可能存在问题?我正在使用HttpGet请求获取响应体,然后将响应体传递给函数。下面是我用于获取响应体的代码responseBody.HttpGet getMethod=new HttpGet(pageURL);ResponseHandler ResponseHandler=new BasicResponseHandler();String responseBody=client.execute(getMethod,ResponseHandler);你试过URLEncoder和URLEcoder吗?我刚刚测试了你的代码,没有问题。你能解释一下你是如何得到响应的吗?我怀疑问题可能在那里。我正在得到答案
DocumentBuilder builder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document embeddedDoc = builder.parse(new File("/home/joop/test.html"));
NodeList pNodes = embeddedDoc.getElementsByTagName("p");
StringBuilder pText = new StringBuilder();
for (int i = 0; i < pNodes.getLength(); ++i) {
Element pElement = (Element) pNodes.item(i);
NodeList children = pElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
String value = child.getNodeValue();
if (value == null) {
System.out.println("node name=" + child.getNodeName()
+ ": " + child.getNodeType());
}
pText.append(value);
}
pText.append("\n");
}
String text = pText.toString();
System.out.println("FOUND TEXT:");
System.out.println(text);
FOUND TEXT:
Saluton,£“ mondo!
public void process() {
HttpGet getMethod = new HttpGet("URL OF THE WEB SITE GOES HERE");
try {
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String websiteBody = client.execute(getMethod, responseHandler);
String title = extractBody(websiteBody);
}
}
private String extractBody(String responseBody) throws Exception {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document embeddedDoc = builder.parse(new InputSource(new StringReader(responseBody)));
//ByteArrayInputStream encXML = new ByteArrayInputStream(responseBody.getBytes("UTF8"));
//Document embeddedDoc = builder.parse(encXML);
//Document embeddedDoc = builder.parse(new File("/home/joop/test.html"));
NodeList pNodes = embeddedDoc.getElementsByTagName("p");
StringBuilder pText = new StringBuilder();
for (int i = 0; i < pNodes.getLength(); ++i) {
Element pElement = (Element) pNodes.item(i);
NodeList children = pElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
String value = child.getNodeValue();
if (value == null) {
System.out.println("node name=" + child.getNodeName()
+ ": " + child.getNodeType());
value = value+convert(child.getNodeName());
}
System.out.println(value.replaceAll("null", ""));
pText.append(value);
}
pText.append("\n");
}
String text = pText.toString();
System.out.println("FOUND TEXT:");
System.out.println(text);
}