java.util.Scanner和维基百科_Java_Wikipedia_Java.util.scanner

java.util.Scanner和维基百科

java

java.util.Scanner和维基百科,java,wikipedia,java.util.scanner,Java,Wikipedia,Java.util.scanner,我正在尝试使用java.util.Scanner获取Wikipedia的内容，并将其用于基于单词的搜索。事实上，这一切都很好，但当读一些单词时，它会给我错误。查看代码并进行一些检查，结果发现，使用一些单词不识别编码，或者这样，内容就不再可读了。这是用于获取页面的代码： //-开始- try { connection = new URL("http://it.wikipedia.org wiki/"+word).openConnection();

我正在尝试使用java.util.Scanner获取Wikipedia的内容，并将其用于基于单词的搜索。事实上，这一切都很好，但当读一些单词时，它会给我错误。查看代码并进行一些检查，结果发现，使用一些单词不识别编码，或者这样，内容就不再可读了。这是用于获取页面的代码：

//-开始-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

问题出现在意大利语维基百科的“pubblico”一词上。单词pubblico上的println结果如下（剪切）： èèè½]Ksr>èè½~E ï½ïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïï

你知道为什么吗？然而，从页面源代码和页眉来看是相同的，具有相同的编码

原来内容是压缩的，所以我可以告诉维基百科不要给我压缩的teir页面，或者这是唯一的方法吗？谢谢

尝试使用具有指定字符集的扫描仪：

public Scanner(InputStream source, String charsetName)

对于默认构造函数：

流中的字节使用底层平台的默认字符集转换为字符

尝试使用具有指定字符集的扫描仪：

public Scanner(InputStream source, String charsetName)

对于默认构造函数：

流中的字节使用底层平台的默认字符集转换为字符

尝试使用

读取器

而不是

输入流

——我认为它的工作原理如下：

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

您也可以直接将字符集传递给Scanner构造函数，如另一个答案所示。

尝试使用

读取器

而不是

输入流

-我认为它的工作原理如下：

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

您也可以直接将字符集传递给Scanner构造函数，如另一个答案所示。

您需要使用

URLConnection

，以便确定响应中的字符集。这应该会告诉您在使用时要使用的字符编码

具体来说，请查看内容类型标头的“charset”参数

要禁止gzip压缩，请单击“标识”。有关更多信息，请参阅。

您需要使用

URLConnection

，以便确定响应中的连接。这应该会告诉您在使用时要使用的字符编码

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

具体来说，请查看内容类型标头的“charset”参数

要禁止gzip压缩，请单击“标识”。有关更多信息，请参阅

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

编码不会改变。为什么?

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

编码不会改变。为什么?

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

真管用

不要使用内容编码。它指定使用的压缩，与字符编码无关。不要使用内容编码。它指定使用的压缩，与字符编码无关。我更新了我的答案以解决您的gzip问题。我更新了我的答案以解决您的gzip问题。