Java HttpUrlConnection获取内容的标题并获取；“永久搬迁”；_Java_Http_Groovy_Htmlcleaner

Java HttpUrlConnection获取内容的标题并获取；“永久搬迁”；

java http groovy

Java HttpUrlConnection获取内容的标题并获取；“永久搬迁”；,java,http,groovy,htmlcleaner,Java,Http,Groovy,Htmlcleaner,这是我用Groovy编写的代码，用于从URL中获取页面标题。然而，一些网站我得到了“永久移动”，我认为这是因为301重定向。如何避免这种情况并让HttpUrlConnection跟随正确的URL并获得正确的页面标题例如，这个网站我得到了“永久移动”而不是正确的页面标题您需要在HttpUrlConnection上调用setInstanceFollowRedirects（true）。i、 e.在第一行之后插入 con.setInstanceFollowRedirects（true）您需要在Ht

这是我用Groovy编写的代码，用于从URL中获取页面标题。然而，一些网站我得到了“永久移动”，我认为这是因为301重定向。如何避免这种情况并让HttpUrlConnection跟随正确的URL并获得正确的页面标题

例如，这个网站我得到了“永久移动”而不是正确的页面标题

您需要在HttpUrlConnection上调用setInstanceFollowRedirects（true）。i、 e.在第一行之后插入

con.setInstanceFollowRedirects（true）

您需要在HttpUrlConnection上调用setInstanceFollowRedirects（true）。i、 e.在第一行之后插入

con.setInstanceFollowRedirects（true）

如果我自己管理重定向，我可以让它工作

我认为问题在于，该站点会期望它在重定向链的一半发送cookie，如果它没有得到cookie，它会将您发送到一个登录页面

这段代码显然需要一些清理（可能有更好的方法），但它展示了如何提取标题：

@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''

while( location ) {
  new URL( location ).openConnection().with { con ->
    // We'll do redirects ourselves
    con.instanceFollowRedirects = false

    // If we got a cookie last time round, then add it to our request
    if( cookie ) con.setRequestProperty( 'Cookie', cookie )
    con.connect()

    // Get the response code, and the location to jump to (in case of a redirect)
    int responseCode = con.responseCode
    location = con.getHeaderField( "Location" )

    // Try and get a cookie the site will set, we will pass this next time round
    cookie = con.getHeaderField( "Set-Cookie" )

    // Read the HTML and close the inputstream
    pageContent = con.inputStream.withReader { it.text }
  }
}

// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "\n", "" )

println title

希望有帮助

如果我自己管理重定向，我可以让它工作

我认为问题在于，该站点会期望它在重定向链的一半发送cookie，如果它没有得到cookie，它会将您发送到一个登录页面

这段代码显然需要一些清理（可能有更好的方法），但它展示了如何提取标题：

@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''

while( location ) {
  new URL( location ).openConnection().with { con ->
    // We'll do redirects ourselves
    con.instanceFollowRedirects = false

    // If we got a cookie last time round, then add it to our request
    if( cookie ) con.setRequestProperty( 'Cookie', cookie )
    con.connect()

    // Get the response code, and the location to jump to (in case of a redirect)
    int responseCode = con.responseCode
    location = con.getHeaderField( "Location" )

    // Try and get a cookie the site will set, we will pass this next time round
    cookie = con.getHeaderField( "Set-Cookie" )

    // Read the HTML and close the inputstream
    pageContent = con.inputStream.withReader { it.text }
  }
}

// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "\n", "" )

println title

希望有帮助

我试过了，但还是不行。我认为setInstanceFollowRedirects（true）是默认设置。但是非常感谢你的回复。是的，我应该在发帖前试一下。我确实重现了你的症状，但还不知道为什么。我尝试了HttpBuilder而不是HttpUrlConnection，它遵循重定向，没有额外的配置。但我还没能将结果内容传递给HtmlCleaner。这不是纽约时报的付费墙影响的事情，是吗？可能是因为这个。但还有其他想法吗？我在facebook上复制了那个URL，facebook可以识别出，可能是我遗漏了什么？我试过了，但还是不起作用。我认为setInstanceFollowRedirects（true）是默认设置。但是非常感谢你的回复。是的，我应该在发帖前试一下。我确实重现了你的症状，但还不知道为什么。我尝试了HttpBuilder而不是HttpUrlConnection，它遵循重定向，没有额外的配置。但我还没能将结果内容传递给HtmlCleaner。这不是纽约时报的付费墙影响的事情，是吗？可能是因为这个。但还有什么想法吗？我在facebook上复制了这个URL，facebook可以意识到，可能是我遗漏了什么？