Java 更改具有未知子字符串的困难字符串_Java_String_Substring

Java 更改具有未知子字符串的困难字符串

java string

Java 更改具有未知子字符串的困难字符串,java,string,substring,Java,String,Substring,Upd：我正在使用Jsoup解析文本在解析一个站点时，我遇到了一个问题：当我得到html文本时，一些链接在随机位置被空间损坏。例如： What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a> 多么漂亮的花啊！及您可能会注意到，空格位置完全是随机的，但有一点是肯定的：它位于href标记内。当然，

Upd：我正在使用Jsoup解析文本
在解析一个站点时，我遇到了一个问题：当我得到html文本时，一些链接在随机位置被空间损坏。例如：

What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a>

多么漂亮的花啊！及您可能会注意到，空格位置完全是随机的，但有一点是肯定的：它位于

href

标记内。当然，我可以使用

replace（“，”）

方法，但是可能有两个或更多的链接。

如何解决这个问题？

这是一种旧的解决方案，但我会尝试使用旧的已退役的apache ECS解析您的html，然后，仅对于href链接，您可以删除空格，然后重新创建所有内容：-）如果我记得很清楚，有一种方法可以从html解析ECS“DOM”

http://svn.apache.org/repos/asf/jakarta/ecs/branches/ecs/src/java/org/apache/ecs/html2ecs/Html2Ecs.java

另一种选择是有选择地使用xpath之类的东西获取HREF，但您必须处理格式错误的html（您可以给Tidy一个机会-

您可以使用正则表达式查找并“优化”URL：

public class URLRegex {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        final String INPUT = "Hello World <a href=\"http://ww w.google.com\">Google</a> Second " + 
                             "Hello World <a href=\"http://www.wiki pedia.org\">Wikipedia</a> Test" + 
                             "<a href=\"https://www.example.o rg\">Example</a> Test Test";
        System.out.println(INPUT);

        // This pattern matches a sequence of one or more spaces.
        // Precompile it here, so we don't have to do it in every iteration of the loop below.
        Pattern SPACES_PATTERN = Pattern.compile("\\u0020+");

        // The regular expression below is very primitive and does not really check whether the URL is valid.
        // Moreover, only very simple URLs are matched. If an URL includes different protocols, account credentials, ... it is not matched.
        // For more sophisticated regular expressions have a look at: http://stackoverflow.com/questions/161738/
        Pattern PATTERN_A_HREF = Pattern.compile("https?://[A-Za-z0-9\\.\\-\\u0020\\?&\\=#/]+");
        Matcher m = PATTERN_A_HREF.matcher(INPUT);

        // Iterate through all matching strings:
        while (m.find()) {
            String urlThatMightContainSpaces = m.group();   // Get the current match
            Matcher spaceMatcher = SPACES_PATTERN.matcher(urlThatMightContainSpaces);
            System.out.println(spaceMatcher.replaceAll(""));  // Replaces all spaces by nothing.
        }

    }
}

公共类URLRegex{
/**
*@param指定命令行参数
*/
公共静态void main（字符串[]args）{
最终字符串输入=“Hello World Second”+
“你好，世界测试”+
“测试”；
系统输出打印项次（输入）；
//此模式匹配一个或多个空格的序列。
//在这里进行预编译，这样我们就不必在下面循环的每次迭代中都进行预编译。
模式空间\u Pattern=Pattern.compile（\\u0020+）；
//下面的正则表达式非常原始，不会真正检查URL是否有效。
//此外，只匹配非常简单的URL。如果URL包含不同的协议、帐户凭据等，则不匹配。
//有关更复杂的正则表达式，请查看：http://stackoverflow.com/questions/161738/
Pattern Pattern\u A\u HREF=Pattern.compile（“https？：/[A-Za-z0-9\\.\\-\\u0020\\？&\\=\\\\\\\\\\\\/]+”；
匹配器m=模式_A_HREF.Matcher（输入）；
//遍历所有匹配字符串：
while（m.find（））{
字符串UrlThatMightContainesSpaces=m.group（）；//获取当前匹配项
Matcher spaceMatcher=SPACES\u PATTERN.Matcher（可能包含空格的URL）；
System.out.println（spaceMatcher.replaceAll（“”）；//不使用任何内容替换所有空格。
}
}
}

在所有href值上使用

替换（“，”）

有什么问题？还有，为什么要尝试修复来自返回垃圾的站点的数据？如果您只想在链接上使用

replace

，还可以使用regex来标识链接。或者（请参阅）是的，我正在使用Jsoup进行解析，但是更改子字符串不会更改初始字符串，对吗？我将尝试一下，thnx