Java 替换标题HTML标记中特定属性内的德语umlauts_Java_Regex

Java 替换标题HTML标记中特定属性内的德语umlauts

java regex

Java 替换标题HTML标记中特定属性内的德语umlauts,java,regex,Java,Regex,我有一个很大的HTML文件，有很多行，比如 <h1 id="anwendungsfälle-und--funktionen">Anwendungsfälle und -funktionen</h1> <h1 id="öl">Öl</h1> Anwendungsfälle und-funktionen Öl 我需要替换所有的umlaut字符（ü，ö，ä），但只替换括号之间的字符（因此只替换标题id，而不替换其他字符） <h1 id="an

我有一个很大的HTML文件，有很多行，比如

<h1 id="anwendungsfälle-und--funktionen">Anwendungsfälle und -funktionen</h1> 
<h1 id="öl">Öl</h1>

Anwendungsfälle und-funktionen
Öl

我需要替换所有的umlaut字符（ü，ö，ä），但只替换括号之间的字符（因此只替换标题id，而不替换其他字符）

<h1 id="anwendungsfaelle-und--funktionen">Anwendungsfälle und -funktionen</h1> 
<h1 id="oel">Öl</h1>

Anwendungsfälle und-funktionen
Öl

ID可能包含数字、单字符和双字符。我已经没有办法构建一个Java正则表达式来匹配这些ID

我试过类似的东西

(<h)\d\s(id=")[A-Za-z0-9]*([-]{1}[A-Za-z0-9]*)*(">)

（）

但这不起作用（我知道这不是Java正则表达式，只是一个示例）。

您的正则表达式需要如下所示：

(?<="\\Wid\\=\\\"[^\"]*)(ä)(?=[^\"]\\\"") // -> ae
(?<="\\Wid\\=\\\"[^\"]*)(ö)(?=[^\"]\\\"") // -> oe
(?<="\\Wid\\=\\\"[^\"]*)(ü)(?=[^\"]\\\"") // -> ...
(?<="\\Wid\\=\\\"[^\"]*)(Ä)(?=[^\"]\\\"")
(?<="\\Wid\\=\\\"[^\"]*)(Ö)(?=[^\"]\\\"")
(?<="\\Wid\\=\\\"[^\"]*)(Ü)(?=[^\"]\\\"")
(?<="\\Wid\\=\\\"[^\"]*)(ß)(?=[^\"]\\\"") // -> ss

（？ae
（？oe）
(? ...
（？您可以使用JSoup
：
Document doc = Jsoup.parse(html); // Init the DOM structure
Elements hs = doc.select("*[id]");   // Find all tags with `id` attribute
for(int i = 0; i < hs.size(); i++){  // Iterate through the tags 
    Element h = hs.get(i);           // Get the current element
    if (h.tagName().matches("h\\d+")) { // If its tag is a heading tag
        String new_val = h.attr("id").replace("ä", "ae").replace("ö", "oe").replace("ü", "ue");
        h.attr("id",new_val);  // Replace the id attribute with a new one
    }
}
System.out.println(doc.toString());

Document doc=Jsoup.parse（html）；//初始化DOM结构
元素hs=doc.select（“*[id]”；//查找具有'id'属性的所有标记
对于（inti=0；i

或正则表达式：
Map<String, String> dictionary = new HashMap<String, String>();
dictionary.put("ä", "ae");
dictionary.put("ö", "oe");
dictionary.put("ü", "ue");
String s = "<h1 id=\"anwendungsfälle-und--funktionen\">Anwendungsfälle und -funktionen</h1> \n<h1 id=\"öl\">Öl</h1>";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("(\\G(?!^)|<h\\d+\\s+id=\")([^\"]*?)([üöä])").matcher(s);
while (m.find()) {
    m.appendReplacement(result, m.group(1) + m.group(2) + dictionary.get(m.group(3)));
}
m.appendTail(result);
System.out.println(result.toString());
// => <h1 id="anwendungsfaelle-und--funktionen">Anwendungsfälle und -funktionen</h1> 
// <h1 id="oel">Öl</h1>

Map dictionary=newhashmap（）；
dictionary.put（“ä”、“ae”）；
dictionary.put（“ö”，“oe”）；
dictionary.put（“u”，“ue”）；
字符串s=“Anwendungsfälle und-funktionen\nÖl”；
StringBuffer结果=新的StringBuffer（）；
Matcher m=Pattern.compile（“（\\G（？）|
或您是否考虑过不包含正则表达式的解决方案？通过正则表达式解析HTML已被证明是……下面是一个开始：s.replaceAll（\\G（？）|大写的umlauts（Ä，Ü，Ö）如何而ß？id中不应该有occour？这是我想到的第一件事。如果有，我希望有一个更简单的解决方案。所有id都是小写的，没有ß，或者至少我没有偶然发现它们。更简单的解决方案是JSoup，选择所有h1元素进行检查和可能的更正。
(\G(?!^)|<h\d+\s+id=")([^"]*?)([üöä])