Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/377.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Jsoup应答:每秒钟的符号都是垃圾(编码问题?)_Java_Character Encoding_Jsoup - Fatal编程技术网

Java Jsoup应答:每秒钟的符号都是垃圾(编码问题?)

Java Jsoup应答:每秒钟的符号都是垃圾(编码问题?),java,character-encoding,jsoup,Java,Character Encoding,Jsoup,作为Android应用程序的一部分,我使用了一个基于Java/Jsoup的HTML爬虫。直到几周前,这个方法还很有效,但现在我在解析结果时收到了非常奇怪的结果。 这是我正在爬网的页面(所有错误都发生在我登录之前): 这就是我获取Jsoup响应对象的方式: Connection connection = Jsoup .connect("https://stine.uni-hamburg.de/") .header("Accept",

作为Android应用程序的一部分,我使用了一个基于Java/Jsoup的HTML爬虫。直到几周前,这个方法还很有效,但现在我在解析结果时收到了非常奇怪的结果。 这是我正在爬网的页面(所有错误都发生在我登录之前):

这就是我获取Jsoup响应对象的方式:

Connection connection = Jsoup
                .connect("https://stine.uni-hamburg.de/")
                .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0")
                .referrer("https://www.stine.uni-hamburg.de/scripts/mgrqispi.dll")
                .method(postData == null ? Connection.Method.GET : Connection.Method.POST)
                .timeout(10000)
                .cookies(m_cookies);
Connection.Response response = connection.execute();
我还尝试删除了连接参数中并非绝对必要的所有部分,并尝试了一个简单的方法

Jsoup.parse(new URL("https://www.stine.uni-hamburg.de), 10000);

System.out.println("resp: " + response);
System.out.println("status: " + response.statusCode());
System.out.println("content-type: " + response.contentType());
System.out.println("header: " + response.headers().toString());
System.out.println("content: " + response.parse().body().toString());
对于任何方法,结果如下所示:

resp: org.jsoup.helper.HttpConnection$Response@756535fa
status: 200
content-type: text/html
header: {ETag="09cd9ca88eccf1:0", Date=Fri, 28 Nov 2014 20:53:49 GMT, Vary=Accept-Encoding, Content-Length=1210, Last-Modified=Mon, 20 Oct 2014 17:10:48 GMT, Content-Encoding=gzip, Accept-Ranges=bytes, Content-Type=text/html, X-Powered-By=ASP.NET, Server=Microsoft-IIS/7.5}
content: ��<��!��D��O��C��T��Y��P��E�� ��H��T��M��L�� ��P��U��B��L��I��C�� ��"��-��/��/��W��3��C��/��/��D��T��D�� ��H��T��M��L�� ��4��.��0��1��/��/��E��N��"�� ��"��h��t��t��p��:��/��/��w��w��w��.��w��3��.��o��r��g��/��T��R��/��h��t��m��l��4��/��s��t��r��i��c��t��.��d��t��d��"��>��
��<��h��t��m��l��>��
��  ��<��h��e��a��d��>��
��  ��
��  ��<��!��-��-��
��  ����� ��D��A��T��E��N��L��O��T��S��E��N�� ��I��N��F��O��R��M��A��T��I��O��N��S��S��Y��S��T��E��M��E�� ��G��M��B��H��
��  ��e��-��m��a��i��l��:�� ��  ��  ��i��n��f��o��@��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
��  ��w��e��b��:�� ��   ��  ��  ��h��t��t��p��:��/��/��w��w��w��.��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
��  ��
��  ��c��u��s��t��o��m��e��r��:�� ��    ��  ��u��h��h��
��  ��v��e��r��s��i��o��n��:�� ��   ��  ��5��.��4��0��.��0��0��8��
��  ��f��i��l��e��n��a��m��e��:��   ��  ��i��n��d��e��x��.��h��t��m��
��/��/��-��-��>��
��
��  ��
��  ��  ��<��t��i��t��l��e��>��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��<��/��t��i��t��l��e��>��
��  ��  ��  ��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��X��-��U��A��-��C��o��m��p��a��t��i��b��l��e��"�� ��c��o��n��t��e��n��t��=��"��I��E��=��E��m��u��l��a��t��e��I��E��9��"�� ��/��>�� ��<��!��-��-�� ��I��E��9�� ��d��o��c��u��m��e��n��t�� ��m��o��d��e�� ��o��n��l��y�� ��-��-��>��
��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��c��a��c��h��e��-��c��o��n��t��r��o��l��"��  �� ��   ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��e��x��p��i��r��e��s��"�� �� ��  ��  ��  ��c��o��n��t��e��n��t��=��"��-��1��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� ��    ��  ��  ��  ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��
��  ��  ��<��m��e��t��a�� ��n��a��m��e��=��"��v��i��e��w��p��o��r��t��"�� ��c��o��n��t��e��n��t��=��"��w��i��d��t��h��=��d��e��v��i��c��e��-��w��i��d��t��h��,�� ��i��n��i��t��i��a��l��-��s��c��a��l��e��=��1��,��u��s��e��r��-��s��c��a��l��a��b��l��e��=��0��"�� ��/��>��
��  ��
��  ��  ��  ��  ��
��  ��  ��<��l��i��n��k�� ��r��e��l��=��"��a��p��p��l��e��-��t��o��u��c��h��-��i��c��o��n��"�� ��h��r��e��f��=��"��/��g��f��x��/��u��h��h��/��i��c��o��n��s��/��i��p��h��o��n��e��_��t��o��u��c��h��_��i��c��o��n��.��p��n��g��"�� ��t��y��p��e��=��"��i��m��a��g��e��/��g��i��f��"�� ��/��>��
��
��  ��  ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��r��e��f��r��e��s��h��"�� ��c��o��n��t��e��n��t��=��"��0��;�� ��U��R��L��=��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"�� ��/��>��
��  ��  ��  ��  ��
��  ��  ��<��l��i��n��k�� �� �� ��h��r��e��f��=��"��/��c��s��s��/��_��d��e��f��a��u��l��t��/��d��l��.��s��t��a��r��t��p��a��g��e��.��c��s��s��"��   ��  ��  ��r��e��l��=��"��s��t��y��l��e��s��h��e��e��t��"�� �� ��t��y��p��e��=��"��t��e��x��t��/��c��s��s��"��   ��/��>��
��  ��  ��
��  ��  ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��j��q��u��e��r��y��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
��  ��  ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��o��n��m��e��d��i��a��q��u��e��r��y��.��m��i��n��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
��
��  ��<��/��h��e��a��d��>��
��  ��
��  ��<��b��o��d��y��>��    ��  ��
��  ��  ��<��d��i��v�� ��i��d��=��"��w��r��a��p��p��e��r��"��>��
��  ��  ��  ��<��a�� ��h��r��e��f��=��"��h��t��t��p��:��/��/��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"�� ��t��i��t��l��e��=��"��e��x��t��e��r��n�� ��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"��>��
��  ��  ��  ��  ��<��i��m��g�� ��b��o��r��d��e��r��=��"��0��"�� ��i��d��=��"��l��o��g��o��"�� ��s��r��c��=��"��/��g��f��x��/��u��h��h��/��l��o��g��o��.��p��n��g��"�� ��a��l��t��=��"��L��o��g��o�� ��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��"�� ��/��>��
��  ��  ��  ��<��/��a��>��
��  ��  ��  ��
��  ��  ��  ��<��u��l�� ��i��d��=��"��l��a��n��g��M��e��n��u��"��>��
��  ��  ��  ��  �� �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��1�� ��G��e��r��m��a��n�� ��/��/�� ��-��-��>��
��  ��  ��  ��  ��  ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��G��e��r��m��a��n��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"��>��d��e��<��/��a��>��<��/��l��i��>��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��2�� ��E��n��g��l��i��s��h�� ��/��/�� ��-��-��>��
��  ��  ��  ��  ��  ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��E��n��g��l��i��s��h��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��2��"��>��e��n��<��/��a��>��<��/��l��i��>��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ��<��/��u��l��>��
��  ��  ��  ��
��  ��  ��<��/��d��i��v��>��
��  ��<��/��b��o��d��y��>��
��<��/��h��t��m��l��>��
��
resp:org.jsoup.helper.HttpConnection$Response@756535fa
现状:200
内容类型:text/html
标题:{ETag=“09cd9ca88eccf1:0”,日期=Fri,2014年11月28日20:53:49 GMT,Vary=Accept Encoding,内容长度=1210,上次修改=Mon,2014年10月20日17:10:48 GMT,内容编码=gzip,接受范围=bytes,内容类型=text/html,X-Powered-By=ASP.NET,服务器=Microsoft IIS/7.5}
内容:����
����
��  ����
��  ��
��  ����
��
��  ��
��  ��  ����U��N��我��v��E��R��s��我��T�����T�� ��H��A.��M��B��U��R��G����
��  ��  ��  ��
��  ��  ���� ����
��
��  ��  ����
��  ��  ����
��  ��  ����
��  ��  ����
��  ��  ����
��
��  ��  ����
��  ��
��  ��  ��  ��  ��
��  ��  ����
��
��  ��  ����
��  ��  ��  ��  ��
��  ��  ����
��  ��  ��
��  ��  ������
��  ��  ������
��
��  ����
��  ��
��  ����    ��  ��
��  ��  ����
��  ��  ��  ����
��  ��  ��  ��  ����
��  ��  ��  ����
��  ��  ��  ��
��  ��  ��  ����
��  ��  ��  ��  �� �� ����
��  ��  ��  ��  ��  ������D��E������
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  �� ����
��  ��  ��  ��  ��  ������E��N������
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ��  ��
��  ��  ��  ��  ��  ��  ��  ����
��  ��  ��  ��
��  ��  ����
��  ����
����
��
谢谢你的帮助

编辑:我注意到另一个问题:只有在我直接检索基本url(即“”)时,才会发生此错误。另一方面,如果我在BASE_URL/scripts/mgrqispi.dll上调用Jsoup,我会收到一个有效的结果(使用相同的设置)。
但是,我还需要正确呈现基本url,因为页面在创建会话时使用了这种转发麻烦。

我想我不久前也遇到过类似的情况,如果我能找到代码,我将在这里发布它

但是是的,它可能与字符编码有关。请看一看,是否可以将其作为编码集显式的输入流读取

//from docs - http://jsoup.org/apidocs/
parse(InputStream inputstream, String charsetName, String baseUri) 


//example
Jsoup.parse(in , "ISO-8859-2", url);
您需要尝试以inputstream的形式读取Url。像这样的东西

InputStream inputstream =new URL(url).openStream();

我现在可以通过使用爬网页面转发到的第一个URL作为脚本的入口点来解决这个问题。我仍然不明白这种奇怪的反应是怎么发生的。如果我使用了错误的字符编码,我本以为有些字符无法显示,但不是这样的。。。但是,既然我现在开始工作,我认为这个线程是关闭的。如果您认为您可能知道一个解决方案,请随时发布,我会测试它,以便对未来的读者有所帮助。

谢谢您的回复。我试过你的建议,但没有解决问题。但我注意到了其他一些东西,并在问题中添加了信息。