Java-解析HTML-获取文本_Java_Html_Parsing

Java-解析HTML-获取文本

java html parsing

Java-解析HTML-获取文本,java,html,parsing,Java,Html,Parsing,我试图从网站上获取文本；当您更改语言时，html url内部有一个“/en”，但是包含我想要的信息的页面没有 http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92 html tags: (the text contains the description of the photo) <div id="redx_gallery_pic_title"> text

我试图从网站上获取文本；当您更改语言时，html url内部有一个“/en”，但是包含我想要的信息的页面没有

http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92

html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>

http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html标记：（文本包含照片的描述）
文本

问题是网站是德语的，我想要英文文本，而我的脚本只有德语版本

你知道我该怎么做吗

java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
    String inputLine=null;
    StringBuffer theText = new StringBuffer();
    while ((inputLine = in.readLine()) != null)
            theText.append(inputLine+"\n");
    String html = theText.toString();
    in.close();

String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");

java代码：
...
URL oracle=新URL（x）；
BufferedReader in=新的BufferedReader（新的InputStreamReader（oracle.openStream（））；
字符串inputLine=null；
StringBuffer theText=新的StringBuffer（）；
而（（inputLine=in.readLine（））！=null）
追加（inputLine+“\n”）；
字符串html=theText.toString（）；
in.close（）；
String[]name=StringUtils.substringsBetween（html，“redx_gallery_pic_title\”>，“”）；

该站点默认使用德语进行国际化。您需要通过在

接受语言

请求标头中指定所需的语言代码来告诉服务器您接受的语言

URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...

与具体问题无关我建议您将其作为一个HTML解析器来研究一下。它使用类似jQuery的CSS选择器语法更加方便，因此比您的尝试要简单得多：

String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3

仅此而已。

您使用什么编程语言？您使用什么语言API来解析HTML？显示您迄今为止获取HTML内容的代码。我发布了一个答案，但在将来，您应该真正提到并标记这一点。有无数种方法可以从网站解析HTML，而您甚至没有告诉任何有关它的信息。但是，如果我想得到罗马尼亚语的文本？如果我用“ro”而不是“en”“我没有特殊字符。这是因为您依赖于平台默认编码来读取响应正文。您需要使用另一个

InputStreamReader

构造函数，该构造函数将字符集作为第二个参数，并使用

“UTF-8”

指定它。Jsoup完全透明地考虑了这一点，顺便说一下：）你说得对，Jsoup更容易，但我仍然不知道如何设置字符集类型（对于Jsoup代码）。你不需要这样做。如前所述，它完全透明地考虑到了这一点。根据HTTP响应头来计算这一点已经足够聪明了。问题在于在哪里显示或保存字符。您是否使用

System.out.println（）

在类似Eclipse的IDE中显示它？如果是这样，请通过窗口>首选项>常规>工作区设置Eclipse控制台编码，然后将文本文件编码设置为UTF-8。否则它将使用平台默认值。有关更多提示，请参阅