Java Jsoup应答:每秒钟的符号都是垃圾(编码问题?)
作为Android应用程序的一部分,我使用了一个基于Java/Jsoup的HTML爬虫。直到几周前,这个方法还很有效,但现在我在解析结果时收到了非常奇怪的结果。 这是我正在爬网的页面(所有错误都发生在我登录之前): 这就是我获取Jsoup响应对象的方式:Java Jsoup应答:每秒钟的符号都是垃圾(编码问题?),java,character-encoding,jsoup,Java,Character Encoding,Jsoup,作为Android应用程序的一部分,我使用了一个基于Java/Jsoup的HTML爬虫。直到几周前,这个方法还很有效,但现在我在解析结果时收到了非常奇怪的结果。 这是我正在爬网的页面(所有错误都发生在我登录之前): 这就是我获取Jsoup响应对象的方式: Connection connection = Jsoup .connect("https://stine.uni-hamburg.de/") .header("Accept",
Connection connection = Jsoup
.connect("https://stine.uni-hamburg.de/")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0")
.referrer("https://www.stine.uni-hamburg.de/scripts/mgrqispi.dll")
.method(postData == null ? Connection.Method.GET : Connection.Method.POST)
.timeout(10000)
.cookies(m_cookies);
Connection.Response response = connection.execute();
我还尝试删除了连接参数中并非绝对必要的所有部分,并尝试了一个简单的方法
Jsoup.parse(new URL("https://www.stine.uni-hamburg.de), 10000);
System.out.println("resp: " + response);
System.out.println("status: " + response.statusCode());
System.out.println("content-type: " + response.contentType());
System.out.println("header: " + response.headers().toString());
System.out.println("content: " + response.parse().body().toString());
对于任何方法,结果如下所示:
resp: org.jsoup.helper.HttpConnection$Response@756535fa
status: 200
content-type: text/html
header: {ETag="09cd9ca88eccf1:0", Date=Fri, 28 Nov 2014 20:53:49 GMT, Vary=Accept-Encoding, Content-Length=1210, Last-Modified=Mon, 20 Oct 2014 17:10:48 GMT, Content-Encoding=gzip, Accept-Ranges=bytes, Content-Type=text/html, X-Powered-By=ASP.NET, Server=Microsoft-IIS/7.5}
content: ��<��!��D��O��C��T��Y��P��E�� ��H��T��M��L�� ��P��U��B��L��I��C�� ��"��-��/��/��W��3��C��/��/��D��T��D�� ��H��T��M��L�� ��4��.��0��1��/��/��E��N��"�� ��"��h��t��t��p��:��/��/��w��w��w��.��w��3��.��o��r��g��/��T��R��/��h��t��m��l��4��/��s��t��r��i��c��t��.��d��t��d��"��>��
��<��h��t��m��l��>��
�� ��<��h��e��a��d��>��
�� ��
�� ��<��!��-��-��
�� ����� ��D��A��T��E��N��L��O��T��S��E��N�� ��I��N��F��O��R��M��A��T��I��O��N��S��S��Y��S��T��E��M��E�� ��G��M��B��H��
�� ��e��-��m��a��i��l��:�� �� �� ��i��n��f��o��@��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
�� ��w��e��b��:�� �� �� �� ��h��t��t��p��:��/��/��w��w��w��.��d��a��t��e��n��l��o��t��s��e��n��.��d��e��
�� ��
�� ��c��u��s��t��o��m��e��r��:�� �� �� ��u��h��h��
�� ��v��e��r��s��i��o��n��:�� �� �� ��5��.��4��0��.��0��0��8��
�� ��f��i��l��e��n��a��m��e��:�� �� ��i��n��d��e��x��.��h��t��m��
��/��/��-��-��>��
��
�� ��
�� �� ��<��t��i��t��l��e��>��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��<��/��t��i��t��l��e��>��
�� �� �� ��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��X��-��U��A��-��C��o��m��p��a��t��i��b��l��e��"�� ��c��o��n��t��e��n��t��=��"��I��E��=��E��m��u��l��a��t��e��I��E��9��"�� ��/��>�� ��<��!��-��-�� ��I��E��9�� ��d��o��c��u��m��e��n��t�� ��m��o��d��e�� ��o��n��l��y�� ��-��-��>��
��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��c��a��c��h��e��-��c��o��n��t��r��o��l��"�� �� �� ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��e��x��p��i��r��e��s��"�� �� �� �� �� ��c��o��n��t��e��n��t��=��"��-��1��"�� ��/��>��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� �� �� �� �� ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� �� �� �� �� ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��p��r��a��g��m��a��"�� �� �� �� �� ��c��o��n��t��e��n��t��=��"��n��o��-��c��a��c��h��e��"�� ��/��>��
��
�� �� ��<��m��e��t��a�� ��n��a��m��e��=��"��v��i��e��w��p��o��r��t��"�� ��c��o��n��t��e��n��t��=��"��w��i��d��t��h��=��d��e��v��i��c��e��-��w��i��d��t��h��,�� ��i��n��i��t��i��a��l��-��s��c��a��l��e��=��1��,��u��s��e��r��-��s��c��a��l��a��b��l��e��=��0��"�� ��/��>��
�� ��
�� �� �� �� ��
�� �� ��<��l��i��n��k�� ��r��e��l��=��"��a��p��p��l��e��-��t��o��u��c��h��-��i��c��o��n��"�� ��h��r��e��f��=��"��/��g��f��x��/��u��h��h��/��i��c��o��n��s��/��i��p��h��o��n��e��_��t��o��u��c��h��_��i��c��o��n��.��p��n��g��"�� ��t��y��p��e��=��"��i��m��a��g��e��/��g��i��f��"�� ��/��>��
��
�� �� ��<��m��e��t��a�� ��h��t��t��p��-��e��q��u��i��v��=��"��r��e��f��r��e��s��h��"�� ��c��o��n��t��e��n��t��=��"��0��;�� ��U��R��L��=��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"�� ��/��>��
�� �� �� �� ��
�� �� ��<��l��i��n��k�� �� �� ��h��r��e��f��=��"��/��c��s��s��/��_��d��e��f��a��u��l��t��/��d��l��.��s��t��a��r��t��p��a��g��e��.��c��s��s��"�� �� �� ��r��e��l��=��"��s��t��y��l��e��s��h��e��e��t��"�� �� ��t��y��p��e��=��"��t��e��x��t��/��c��s��s��"�� ��/��>��
�� �� ��
�� �� ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��j��q��u��e��r��y��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
�� �� ��<��s��c��r��i��p��t�� ��t��y��p��e��=��"��t��e��x��t��/��j��a��v��a��s��c��r��i��p��t��"�� ��s��r��c��=��"��/��j��s��/��m��o��b��i��l��e��_��m��a��s��t��e��r��/��o��n��m��e��d��i��a��q��u��e��r��y��.��m��i��n��.��j��s��"��>��<��/��s��c��r��i��p��t��>��
��
�� ��<��/��h��e��a��d��>��
�� ��
�� ��<��b��o��d��y��>�� �� ��
�� �� ��<��d��i��v�� ��i��d��=��"��w��r��a��p��p��e��r��"��>��
�� �� �� ��<��a�� ��h��r��e��f��=��"��h��t��t��p��:��/��/��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"�� ��t��i��t��l��e��=��"��e��x��t��e��r��n�� ��w��w��w��.��u��n��i��-��h��a��m��b��u��r��g��.��d��e��"��>��
�� �� �� �� ��<��i��m��g�� ��b��o��r��d��e��r��=��"��0��"�� ��i��d��=��"��l��o��g��o��"�� ��s��r��c��=��"��/��g��f��x��/��u��h��h��/��l��o��g��o��.��p��n��g��"�� ��a��l��t��=��"��L��o��g��o�� ��U��n��i��v��e��r��s��i��t�����t�� ��H��a��m��b��u��r��g��"�� ��/��>��
�� �� �� ��<��/��a��>��
�� �� �� ��
�� �� �� ��<��u��l�� ��i��d��=��"��l��a��n��g��M��e��n��u��"��>��
�� �� �� �� �� �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��1�� ��G��e��r��m��a��n�� ��/��/�� ��-��-��>��
�� �� �� �� �� ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��G��e��r��m��a��n��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��1��"��>��d��e��<��/��a��>��<��/��l��i��>��
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� ��<��!��-��-�� ��/��/�� ��F��O��R��W��A��R��D��I��N��G�� ��0��0��2�� ��E��n��g��l��i��s��h�� ��/��/�� ��-��-��>��
�� �� �� �� �� ��<��l��i��>��<��a�� ��c��l��a��s��s��=��"��i��m��g�� ��i��m��g��_��L��a��n��g��E��n��g��l��i��s��h��"�� ��h��r��e��f��=��"��/��s��c��r��i��p��t��s��/��m��g��r��q��i��s��p��i��.��d��l��l��?��A��P��P��N��A��M��E��=��C��a��m��p��u��s��N��e��t��&��P��R��G��N��A��M��E��=��S��T��A��R��T��P��A��G��E��_��D��I��S��P��A��T��C��H��&��A��R��G��U��M��E��N��T��S��=��-��N��0��0��0��0��0��0��0��0��0��0��0��0��0��0��2��"��>��e��n��<��/��a��>��<��/��l��i��>��
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� ��<��/��u��l��>��
�� �� �� ��
�� �� ��<��/��d��i��v��>��
�� ��<��/��b��o��d��y��>��
��<��/��h��t��m��l��>��
��
resp:org.jsoup.helper.HttpConnection$Response@756535fa
现状:200
内容类型:text/html
标题:{ETag=“09cd9ca88eccf1:0”,日期=Fri,2014年11月28日20:53:49 GMT,Vary=Accept Encoding,内容长度=1210,上次修改=Mon,2014年10月20日17:10:48 GMT,内容编码=gzip,接受范围=bytes,内容类型=text/html,X-Powered-By=ASP.NET,服务器=Microsoft IIS/7.5}
内容:����
����
�� ����
�� ��
�� ����
��
�� ��
�� �� ����U��N��我��v��E��R��s��我��T�����T�� ��H��A.��M��B��U��R��G����
�� �� �� ��
�� �� ���� ����
��
�� �� ����
�� �� ����
�� �� ����
�� �� ����
�� �� ����
��
�� �� ����
�� ��
�� �� �� �� ��
�� �� ����
��
�� �� ����
�� �� �� �� ��
�� �� ����
�� �� ��
�� �� ������
�� �� ������
��
�� ����
�� ��
�� ���� �� ��
�� �� ����
�� �� �� ����
�� �� �� �� ����
�� �� �� ����
�� �� �� ��
�� �� �� ����
�� �� �� �� �� �� ����
�� �� �� �� �� ������D��E������
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� ����
�� �� �� �� �� ������E��N������
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� �� ��
�� �� �� �� �� �� �� ����
�� �� �� ��
�� �� ����
�� ����
����
��
谢谢你的帮助
编辑:我注意到另一个问题:只有在我直接检索基本url(即“”)时,才会发生此错误。另一方面,如果我在BASE_URL/scripts/mgrqispi.dll上调用Jsoup,我会收到一个有效的结果(使用相同的设置)。
但是,我还需要正确呈现基本url,因为页面在创建会话时使用了这种转发麻烦。我想我不久前也遇到过类似的情况,如果我能找到代码,我将在这里发布它 但是是的,它可能与字符编码有关。请看一看,是否可以将其作为编码集显式的输入流读取
//from docs - http://jsoup.org/apidocs/
parse(InputStream inputstream, String charsetName, String baseUri)
//example
Jsoup.parse(in , "ISO-8859-2", url);
您需要尝试以inputstream的形式读取Url。像这样的东西
InputStream inputstream =new URL(url).openStream();
我现在可以通过使用爬网页面转发到的第一个URL作为脚本的入口点来解决这个问题。我仍然不明白这种奇怪的反应是怎么发生的。如果我使用了错误的字符编码,我本以为有些字符无法显示,但不是这样的。。。但是,既然我现在开始工作,我认为这个线程是关闭的。如果您认为您可能知道一个解决方案,请随时发布,我会测试它,以便对未来的读者有所帮助。谢谢您的回复。我试过你的建议,但没有解决问题。但我注意到了其他一些东西,并在问题中添加了信息。