Java 匹配特定的html属性值_Java_Regex

Java 匹配特定的html属性值

java regex

Java 匹配特定的html属性值,java,regex,Java,Regex,我想匹配id、class、name和的所有属性值！我为该任务创建了一个简单的函数 private Collection<String> getAttributes(final String htmlContent) { final Set<String> attributes = new HashSet<>(); final Pattern pattern = Pattern.compile("(class|id|for|name)=\\\"(

我想匹配

id

、

class

、

name

和

的所有属性值！我为该任务创建了一个简单的函数
private Collection<String> getAttributes(final String htmlContent) {
    final Set<String> attributes = new HashSet<>();
    final Pattern pattern = Pattern.compile("(class|id|for|name)=\\\"(.*?)\\\"");
    final Matcher matcher = pattern.matcher(htmlContent);
    while (matcher.find()) {
        attributes.add(matcher.group(2));
    }
    return attributes;
}

private Collection getAttributes（最终字符串htmlContent）{
最终集属性=新HashSet（）；
final Pattern=Pattern.compile（“（class | id | for | name）=\\\”（.*？\\”）；
最终匹配器匹配器=pattern.Matcher（htmlContent）；
while（matcher.find（））{
attributes.add（matcher.group（2））；
}
返回属性；
}

html内容示例：
<input id="test" name="testName" class="aClass bClass" type="input" />



如何通过正则表达式拆分html类，以获得以下结果集：

试验
测试名
A类
B类

有没有办法改进我的代码？我真的不喜欢这个循环。
如果你看一下，你会发现html解析和操作的有用工具
例如：
Document doc = ...//create HTML document
Elements htmlElements = doc.children();
htmlElements.traverse(new MyHtmlElementVisitor());

类MyHtmlElementVisitor
只需实现并可以访问节点

虽然您可能会找到一个适合同一工作的好正则表达式，但它有几个缺点。仅举几个例子：

很难为每个可能的html文档找到故障保护正则表达式
难以阅读，因此很难发现错误并实施更改
正则表达式通常是不可重用的
说真的
如果您的文档实际上是XHTML，则可以使用XPath：
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(
    "//@*["
        + "local-name()='class'"
        + " or local-name()='id'"
        + " or local-name()='for'"
        + " or local-name()='name'"
    + "]",
    new InputSource(new StringReader(htmlContent)),
    XPathConstants.NODESET);
int count = nodes.getLength();
for (int i = 0; i < count; i++) {
    Collections.addAll(attributes,
        nodes.item(i).getNodeValue().split("\\s+"));
}

至于不做循环，我认为那是不可能的。但无论如何，任何实现都将在内部使用一个或多个循环
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
    private final Object[] attributesOfInterest = {
        HTML.Attribute.CLASS,
        HTML.Attribute.ID,
        "for",
        HTML.Attribute.NAME,
    };

    private void addAttributes(AttributeSet attr) {
        for (Object a : attributesOfInterest) {
            Object value = attr.getAttribute(a);
            if (value != null) {
                Collections.addAll(attributes,
                    value.toString().split("\\s+"));
            }
        }
    }

    @Override
    public void handleStartTag(HTML.Tag tag,
                               MutableAttributeSet attr,
                               int pos) {
        addAttributes(attr);
        super.handleStartTag(tag, attr, pos);
    }

    @Override
    public void handleSimpleTag(HTML.Tag tag,
                                MutableAttributeSet attr,
                                int pos) {
        addAttributes(attr);
        super.handleSimpleTag(tag, attr, pos);
    }
};

HTMLDocument doc = (HTMLDocument)
    new HTMLEditorKit().createDefaultDocument();
doc.getParser().parse(new StringReader(htmlContent), callback, true);