Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/objective-c/26.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
获取URL的第二级域(java)_Java_Url - Fatal编程技术网

获取URL的第二级域(java)

获取URL的第二级域(java),java,url,Java,Url,我想知道java中是否有用于提取URL中的二级域(SLD)的解析器或库,或者没有algo或regex的解析器或库。例如: URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html"); String host = uri.getHost(); System.out.println(host); 其中打印: mydomain.ltd.uk 现在,我要做的是可靠地识别SLD(“ltd.uk”)组件。有什么想法吗 编辑:

我想知道java中是否有用于提取URL中的二级域(SLD)的解析器或库,或者没有algo或regex的解析器或库。例如:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html");

String host = uri.getHost();

System.out.println(host);
其中打印:

mydomain.ltd.uk
现在,我要做的是可靠地识别SLD(“ltd.uk”)组件。有什么想法吗

编辑:我正在寻找一个理想的通用解决方案,所以我会在“police.uk”中匹配“.uk”,在“bbc.co.uk”中匹配“.co.uk”,在“amazon.com”中匹配“.com”


谢谢

如果你想要第二级域名,你可以在“.”上拆分字符串,并取最后两部分。当然,这是假设您总是有一个非特定于站点的二级域(因为这听起来像是您想要的)。

对于您的具体情况,我没有答案-Jonathan的评论指出,您可能应该重构您的问题


尽管如此,我还是建议看一下项目的类别。它有很多有用的方法。由于Restlet是开源的,所以您不必使用整个库—您可以下载源代码并将一个类添加到您的项目中。

不知道您的目的,但二级域可能对您意义不大。你可能需要找到,它下面的域名就是你要找的

ApacheHTTP组件(HttpClient 4)附带了处理此问题的类

org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser
您需要从这里下载公共后缀列表

  • 上述列表+阅读wikipedia更新提供了98%正确的TLD列表
  • 浏览并单击每个nic,查看最新消息,您将看到另外2%(如.com.aq和.gov.an)
  • 不幸的是,大型“免费网络空间”提供商是另一个需要考虑的因素,例如,无数的*.blogspot.com域名,如果您下载alexa top 100.000(免费csv文件),您至少可以很好地了解其中最常用的域名,这些域名的覆盖率应该达到一定的百分比(例如,将alexa评级与stumbleupon页面浏览量与delicious书签进行比较时)(alexa有时只获取topdomain,而delicious实际上是md5的每个url,因此1 alexa-->多个delicious md5哈希
  • 除此之外,有时在twitter的情况下,如果你想寻找独一无二的评价,那么在/之后的内容也很重要
  • 以下是Alexa top 40.000的列表,当过滤掉真实TLD时,会给你一种感觉:(这意味着Alexa不会将以下领域的评级一起计算):

    bp.blogspot.com---espn.go.com---files.wordpress.com---abcnews.go.com---disney.go.com---troktiko.blogspot.com---en.wordpress.com---api.ning.com---abc.go.com---220.181.38.82---213.174.154.20---abclocal.go.com---feedproxy.google com/~r---forums.wordpress.com---GoogleBlogspot.com---1.cnm999.com/user/10008---213.174.143.196.51---glewebmastercentral.blogspot.com--myespn.go.com--213.174.143.197--61.132.221.146--support.wordpress.com--dashboard.wordpress.com--sethgodin.typepad.com--paygo.17zhifu.com/user/10005--go2.wordpress.com--1.1.1.1--movies.go.com--home.com--home.com--home.com--comcast.net--googlesystem.blogspot.com--abcfamily.com--home.com--live.237--1961--01.com/~record--xhamster.com/user/video--gold-oil-commodity.blogspot.com--journeyplanner.tfl.gov.uk/user/XSLT_-TRIP_-REQUEST2--206.108.48.238--blog.wordpress.com--67.220.92.21--183.101.80.130--211.94.190.80--youtube-global.blogspot.com--uta-net.com/user/phplib 3satu.blogspot.com--119.global.com--global.com--global.global.blogspot.com--global.com--global.net.com--user/phplibb--global.com--phplibj4u.or.jp/~dyo--220.181.6.19--toontown.go.com--signup.wordpress.com--TheArtorialist.blogspot.com--analytics.blogspot.com--ss.iij4u.or.jp/~ceh2--67.220.92.23--Gmail博客.blogspot.com--183.99.121.86--vgorode.ru/user/create--61.132.216.243--217.175.53.72--labnol.blogspot.com--adsense.com--blogspot.com--订阅博客--spot.com---creators.ning.com---sarkari naukri.blogspot.com---search.wordpress.com---orange hiyoko.blogspot.com---cashewmaniakpop.wordpress.com---pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231.148.249---.183.101.80.77---windowsliveintro.livespot.live.com------124.228.254.234---streaming web.blogspot.com---id.tianya.cn/user/message---familyfun.go.com---tro ma ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tutututina.blogspot.com---toolserver.org/~geohack------superjob.ru/user/resume---ejobs.ro/user/rocari de munca---gnula.blogspot.com------alles.or.jp/~uir---chiark/~sgtatham---woork.blogspot.com---88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson---cloob.com/user/login---asstr.org/~Kristen---getclicky.com/user/login---gussermuff.blogspot.com---211.98.70.195---222.73.105.196------pp.iij4u.or.jp/~taakii----unsoloclic.blogspot.com----photoshopspot.com----21883.161.253--217.16.18.163--217.16.18.207--217.16.28.104--222.73.105.210--youtube.com/user/OldSpice--hubbages.com/user/new--pelisdvdripdd.blogspot.com--95.143.193.60--es.wordpress.com--217 16.18.206--61.147 116.146--damncoolpics--blogspot.com--family.go.com--81.blogspot.162 news--gutterspot.235--m---faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show---116.228.55.34---profile.typepad.com---kaixin001.com/~truth---linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspot.com---hyperboleandhalf.blogspot.com---carscoop.blogspot.com------tubemogul.com/user/dash---press----gr.blogspot.com---.com---.235.164------o、 com--208.98.30.69--trelokouneli.blogspot.com--help.ning.com--id.tianya.cn/user/register--slovari.yandex.ru/~%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8--printable-coupons.blogspot.com--unics877.blogspot.com--globaleconomicanalysis.blogspot.com--183 101.80.68--221.194.33.60--doujin-game8
    public class TopLevelDomainChecker  {
        private Set<String> exceptions;
        private Set<String> suffixes;
    
        public void setPublicSuffixes(Collection<String> suffixes) {
            this.suffixes = new HashSet<String>(suffixes);
        }
        public void setExceptions(Collection<String> exceptions) {
            this.exceptions = new HashSet<String>(exceptions);
        }
    
        /**
         * Checks if the domain is a TLD.
         * @param domain
         * @return
         */
        public boolean isTLD(String domain) {
            if (domain.startsWith(".")) 
                domain = domain.substring(1);
    
            // An exception rule takes priority over any other matching rule.
            // Exceptions are ones that are not a TLD, but would match a pattern rule
            // e.g. bl.uk is not a TLD, but the rule *.uk means it is. Hence there is an exception rule
            // stating that bl.uk is not a TLD. 
            if (this.exceptions != null && this.exceptions.contains(domain)) 
                return false;
    
    
            if (this.suffixes == null) 
                return false;
    
            if (this.suffixes.contains(domain)) 
                return true;
    
            // Try patterns. ie *.jp means that boo.jp is a TLD
            int nextdot = domain.indexOf('.');
            if (nextdot == -1)
                return false;
            domain = "*" + domain.substring(nextdot);
            if (this.suffixes.contains(domain)) 
                return true;
    
            return false;
        }
    
    
        public String extractSLD(String domain)
        {
            String last = domain;
            boolean anySLD = false;
            do
            {
                if (isTLD(domain))
                {
                    if (anySLD)
                        return last;
                    else
                        return "";
                }
                anySLD = true;
                last = domain;
                int nextDot = domain.indexOf(".");
                if (nextDot == -1)
                    return "";
                domain = domain.substring(nextDot+1);
            } while (domain.length() > 0);
            return "";
        }
    }
    
    /**
     * Parses the list from <a href="http://publicsuffix.org/">publicsuffix.org
     * Copied from http://svn.apache.org/repos/asf/httpcomponents/httpclient/trunk/httpclient/src/main/java/org/apache/http/impl/cookie/PublicSuffixListParser.java
     */
    public class TopLevelDomainParser {
        private static final int MAX_LINE_LEN = 256;
        private final TopLevelDomainChecker filter;
    
        TopLevelDomainParser(TopLevelDomainChecker filter) {
            this.filter = filter;
        }
        public void parse(Reader list) throws IOException {
            Collection<String> rules = new ArrayList();
            Collection<String> exceptions = new ArrayList();
            BufferedReader r = new BufferedReader(list);
            StringBuilder sb = new StringBuilder(256);
            boolean more = true;
            while (more) {
                more = readLine(r, sb);
                String line = sb.toString();
                if (line.length() == 0) continue;
                if (line.startsWith("//")) continue; //entire lines can also be commented using //
                if (line.startsWith(".")) line = line.substring(1); // A leading dot is optional
                // An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule
                boolean isException = line.startsWith("!"); 
                if (isException) line = line.substring(1);
    
                if (isException) {
                    exceptions.add(line);
                } else {
                    rules.add(line);
                }
            }
    
            filter.setPublicSuffixes(rules);
            filter.setExceptions(exceptions);
        }
        private boolean readLine(Reader r, StringBuilder sb) throws IOException {
            sb.setLength(0);
            int b;
            boolean hitWhitespace = false;
            while ((b = r.read()) != -1) {
                char c = (char) b;
                if (c == '\n') break;
                // Each line is only read up to the first whitespace
                if (Character.isWhitespace(c)) hitWhitespace = true;
                if (!hitWhitespace) sb.append(c);
                if (sb.length() > MAX_LINE_LEN) throw new IOException("Line too long"); // prevent excess memory usage
            }
            return (b != -1);
        }
    }
    
        FileReader fr = new FileReader("effective_tld_names.dat.txt");
        TopLevelDomainChecker checker = new TopLevelDomainChecker();
        TopLevelDomainParser parser = new TopLevelDomainParser(checker);
        parser.parse(fr);
        boolean result;
        result = checker.isTLD("com"); // true
        result = checker.isTLD("com.au"); // true
        result = checker.isTLD("ltd.uk"); // true
        result = checker.isTLD("google.com"); // false
        result = checker.isTLD("google.com.au"); // false
        result = checker.isTLD("metro.tokyo.jp"); // false
        String sld;
        sld = checker.extractSLD("com"); // ""
        sld = checker.extractSLD("com.au"); // ""
        sld = checker.extractSLD("google.com"); // "google.com"
        sld = checker.extractSLD("google.com.au"); // "google.com.au"
        sld = checker.extractSLD("www.google.com.au"); // "google.com.au"
        sld = checker.extractSLD("www.google.com"); // "google.com"
        sld = checker.extractSLD("foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp"
        sld = checker.extractSLD("moo.foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp"
    
    Set<String> nonePublicDomainParts(String uriHost) {
        InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
        InternetDomainName publicDomainName = fullDomainName.publicSuffix();
        Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
        nonePublicParts.removeAll(publicDomainName.parts());
        return nonePublicParts;
    }
    
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>10.0.1</version>
            <scope>compile</scope>
        </dependency>
    
    public static String getTopLevelDomain(String uri) {
    
    InternetDomainName fullDomainName = InternetDomainName.from(uri);
    InternetDomainName publicDomainName = fullDomainName.topPrivateDomain();
    String topDomain = "";
    
    Iterator<String> it = publicDomainName.parts().iterator();
    while(it.hasNext()){
        String part = it.next();
        if(!topDomain.isEmpty())topDomain += ".";
        topDomain += part;
    }
    return topDomain;
    }