Java 如何删除URL的子域部分

Java 如何删除URL的子域部分,java,Java,我试图删除子域,只留下域名和扩展名 很难找到子域,因为我不知道一个url中应该包含多少个点。例如,有些URL以.com结尾,有些以.co.uk结尾 如何安全地删除子域,使foo.bar.com变为bar.com,foo.bar.co.uk变为bar.co.uk 您需要的是一个公共的Sufix列表,例如可在上获得的列表。基本上,没有算法可以告诉你哪些后缀是公共的,所以你需要一个列表。您最好使用一个公共的、维护良好的函数。刚刚解决了这个问题,决定编写以下函数 示例输入->输出: http://exa

我试图删除子域,只留下域名和扩展名

很难找到子域,因为我不知道一个url中应该包含多少个点。例如,有些URL以.com结尾,有些以.co.uk结尾

如何安全地删除子域,使foo.bar.com变为bar.com,foo.bar.co.uk变为bar.co.uk


您需要的是一个公共的Sufix列表,例如可在上获得的列表。基本上,没有算法可以告诉你哪些后缀是公共的,所以你需要一个列表。您最好使用一个公共的、维护良好的函数。

刚刚解决了这个问题,决定编写以下函数

示例输入->输出:

http://example.com  -> http://example.com
http://www.example.com  -> http://example.com
ftp://www.a.example.com -> ftp://example.com
SFTP://www.a.example.com    -> SFTP://example.com
http://www.a.b.example.com  -> http://example.com
http://www.a.c.d.example.com    -> http://example.com
http://example.com/ -> http://example.com/
https://example.com/aaa -> http://example.com/aaa
http://www.example.com/aa/bb../d    -> http://example.com/aa/bb../d
FILE://www.a.example.com/ddd/dd/../ff   -> FILE://example.com/ddd/dd/../ff
HTTPS://www.a.b.example.com/index.html?param=value  -> HTTPS://example.com/index.html?param=value
http://www.a.c.d.example.com/#yeah../..!    -> http://lmao.com/#yeah../..!

Same goes for second level domains
http://some.thing.co.uk/?ke - http://thing.co.uk/?ke
something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk - something.co.uk
https://www.something.co.uk - https://something.co.uk
代码:


有问题的url是否总是有子域?如果是这种情况,您需要删除第一次出现//和之间的所有内容。google的Guava库将此功能巧妙地打包
http://example.com  -> http://example.com
http://www.example.com  -> http://example.com
ftp://www.a.example.com -> ftp://example.com
SFTP://www.a.example.com    -> SFTP://example.com
http://www.a.b.example.com  -> http://example.com
http://www.a.c.d.example.com    -> http://example.com
http://example.com/ -> http://example.com/
https://example.com/aaa -> http://example.com/aaa
http://www.example.com/aa/bb../d    -> http://example.com/aa/bb../d
FILE://www.a.example.com/ddd/dd/../ff   -> FILE://example.com/ddd/dd/../ff
HTTPS://www.a.b.example.com/index.html?param=value  -> HTTPS://example.com/index.html?param=value
http://www.a.c.d.example.com/#yeah../..!    -> http://lmao.com/#yeah../..!

Same goes for second level domains
http://some.thing.co.uk/?ke - http://thing.co.uk/?ke
something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk - something.co.uk
https://www.something.co.uk - https://something.co.uk
public static String removeSubdomains(String url, ArrayList<String> secondLevelDomains) {
    // We need our URL in three parts, protocol - domain - path
    String protocol= getProtocol(url);      
    url = url.substring(protocol.length());
    String urlDomain=url;
    String path="";
    if(urlDomain.contains("/")) {
        int slashPos = urlDomain.indexOf("/");
        path=urlDomain.substring(slashPos);
        urlDomain=urlDomain.substring(0, slashPos);
    }
    // Done, now let us count the dots . . 
    int dotCount = Strng.countOccurrences(urlDomain, ".");
    // example.com <-- nothing to cut
    if(dotCount==1){
        return protocol+url;
    }
    int dotOffset=2; // subdomain.example.com <-- default case, we want to remove everything before the 2nd last dot
    // however, somebody had the glorious idea, to have second level domains, such as co.uk
    for (String secondLevelDomain : secondLevelDomains) {
        // we need to check if our domain ends with a second level domain
        // example: something.co.uk we don't want to cut away "something", since it isn't a subdomain, but the actual domain
        if(urlDomain.endsWith(secondLevelDomain)) {
            // we increase the dot offset with the amount of dots in the second level domain (co.uk = +1)
            dotOffset += Strng.countOccurrences(secondLevelDomain, ".");
            break;
        }
    }
    // if we have something.co.uk, we have a offset of 3, but only 2 dots, hence nothing to remove
    if(dotOffset>dotCount) {
        return protocol+urlDomain+path;
    }
    // if we have sub.something.co.uk, we have a offset of 3 and 3 dots, so we remove "sub"
    int pos = Strng.nthLastIndexOf(dotOffset, ".", urlDomain)+1;
    urlDomain = urlDomain.substring(pos);   
    return protocol+urlDomain+path;
}

public static String getProtocol(String url) {
    String containsProtocolPattern = "^([a-zA-Z]*:\\/\\/)|^(\\/\\/)";
    Pattern pattern = Pattern.compile(containsProtocolPattern);
    Matcher m = pattern.matcher(url);
    if (m.find()) {       
        return m.group();
    }
    return "";
}

public static ArrayList<String> getPublicSuffixList(boolean loadFromPublicSufficOrg) {
    ArrayList<String> secondLevelDomains = new ArrayList<String>();
    if(!loadFromPublicSufficOrg) {
        secondLevelDomains.add("co.uk");secondLevelDomains.add("co.at");secondLevelDomains.add("or.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("gv.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("ac.uk");secondLevelDomains.add("gov.uk");secondLevelDomains.add("ltd.uk");secondLevelDomains.add("fed.us");secondLevelDomains.add("isa.us");secondLevelDomains.add("nsn.us");secondLevelDomains.add("dni.us");secondLevelDomains.add("ac.ru");secondLevelDomains.add("com.ru");secondLevelDomains.add("edu.ru");secondLevelDomains.add("gov.ru");secondLevelDomains.add("int.ru");secondLevelDomains.add("mil.ru");secondLevelDomains.add("net.ru");secondLevelDomains.add("org.ru");secondLevelDomains.add("pp.ru");secondLevelDomains.add("com.au");secondLevelDomains.add("net.au");secondLevelDomains.add("org.au");secondLevelDomains.add("edu.au");secondLevelDomains.add("gov.au");
    }
    try {
        String a = URLHelpers.getHTTP("https://publicsuffix.org/list/public_suffix_list.dat", false, true);
        Scanner scanner = new Scanner(a);
        while (scanner.hasNextLine()) {
        String line = scanner.nextLine();
            if(!line.startsWith("//") && !line.startsWith("*") && line.contains(".")) {
                secondLevelDomains.add(line);
            }
        }
        scanner.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return secondLevelDomains;
}