Java 如何删除除witelisted标记JSOUP之外的所有标记
您可以将jsoup HTML清理器与白名单指定的配置一起使用Java 如何删除除witelisted标记JSOUP之外的所有标记,java,html,parsing,jsoup,Java,Html,Parsing,Jsoup,您可以将jsoup HTML清理器与白名单指定的配置一起使用 Elements els = doc.select(toDoRemoveTAG); for (Element e : els) { e.remove(); } String unsafe=“”; String safe=Jsoup.clean(不安全,Whitelist.basic()); //现在: 我们能否取消toDoRemoveTAG,然后用它构建一个白名单并进行清理?我的意思是从文档中获取所有标记,然后通过删除toDoR
Elements els = doc.select(toDoRemoveTAG);
for (Element e : els)
{
e.remove();
}
String unsafe=“”;
String safe=Jsoup.clean(不安全,Whitelist.basic());
//现在:
我们能否取消toDoRemoveTAG,然后用它构建一个白名单并进行清理?我的意思是从文档中获取所有标记,然后通过删除toDoRemoveTAG中的所有标记和属性来构建一个白名单
我的意思是这样的
String unsafe = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
导入java.util.array;
导入java.util.HashMap;
导入java.util.HashSet;
导入java.util.List;
导入java.util.Map;
导入java.util.Map.Entry;
导入java.util.Set;
导入org.jsoup.jsoup;
导入org.jsoup.nodes.Attribute;
导入org.jsoup.nodes.Document;
导入org.jsoup.nodes.Element;
导入org.jsoup.safety.Cleaner;
导入org.jsoup.safety.Whitelist;
导入org.jsoup.select.Collector;
导入org.jsoup.select.Evaluator;
公共类矩阵乘法{
公共静态void main(字符串[]args)引发异常{
字符串html=“”
+ " "
+ " "
+ ""
+“您的浏览器不支持HTML5画布标记。”
+ " | | | "
+“世界自然基金会世界自然基金会(WWF)是….”
+ " "
+“您的浏览器不支持音频元素。”
+“teasdklfjashdfjkl”;
String toDoRemoveTAG=“style,img,script,noscript,hr,input”;
String allowTagList=“p,span,b,i,u,div,br,a”;
Document doc=Jsoup.parse(html);
白名单白名单=构建白名单(doc,Arrays.asList(toDoRemoveTAG.toUpperCase().split(“,”));
清洁剂=新清洁剂(白名单);
doc=清洁剂。清洁剂(doc);
System.out.println(doc.select(“body”).html());
}
私有静态白名单构建白名单(文档文档、列表toDoRemoveTAG)抛出实例化异常、非法访问异常{
白名单白名单=新白名单();
Set allowedTags=new HashSet();
Map allowedAttributes=new HashMap();
对于(元素e:Collector.collect(Evaluator.allegements.class.newInstance(),doc)){
如果(!toDoRemoveTAG.contains(e.tagName().toUpperCase())){
allowedTags.add(e.tagName());
对于(属性属性attr:e.attributes()){
如果(!toDoRemoveTAG.contains(attr.getKey().toUpperCase())){
if(allowedAttributes.containsKey(e.tagName())){
allowedAttributes.get(e.tagName()).add(attr.getKey());
}否则{
allowedAttributes.put(e.tagName(),new HashSet(){{add(attr.getKey());}});
}
}
}
}
}
addTags(allowedTags.toArray(新字符串[allowedTags.size()]);
对于(条目e:allowedAttributes.entrySet()){
addAttributes(e.getKey(),e.getValue().toArray(新字符串[e.getValue().size()]);
}
返回白名单;
}
}
cleaner
仅包装未在白名单中列出的其他标签的内容。我需要的功能与删除整个封闭标签的功能相同,甚至不需要从该标签中删除文本。@Saym感谢您回复此详细信息。。您可以使用toDoRemoveTAG
删除标记并允许其他。。但我不想将传递给toDoRemoveTAG
,因为有一个很长的列表,即使我无法收集所有要删除的标记。因此,我想以相反的方式实现,只需传递allowTagList
所有其他标记即可。。但是我可以使用您的代码和更改条件来完美地工作。@HybrisHelp您是否能够根据上面的评论找出这个答案的allowTagList实现?
String unsafe = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Whitelist;
import org.jsoup.select.Collector;
import org.jsoup.select.Evaluator;
public class MatrixMultiplication {
public static void main(String[] args) throws Exception {
String html = "<video width='320' height='240' controls autoplay> <source src='movie.ogg' type='video/ogg'> "
+ "<source src='movie.mp4' type='video/mp4'> <object data='movie.mp4' width='320' height='240'> "
+ "<embed width='320' height='240' src='movie.swf'> </object></video>"
+ "<canvas id='myCanvas' width='200' height='100' style='border:1px solid #000000;'>"
+ "Your browser does not support the HTML5 canvas tag.</canvas><article> <header> "
+ "<h1>Internet Explorer 9</h1> <p><time pubdate datetime='2011-03-15'></time></p> "
+ "</header> <p>Windows Internet Explorer 9 (abbreviated as IE9) was released to the public on March 14, 2011 at 21:00 PDT.....</p>"
+ "</article><footer> <p>Posted by: Hege Refsnes</p> <p>Contact information: <a href='mailto:someone@example.com'> someone@example.com</a>.</p>"
+ "</footer> <nav> <a href='/html/'>HTML</a> | <a href='/css/'>CSS</a> | <a href='/js/'>JavaScript</a> | "
+ "<a href='/jquery/'>jQuery</a></nav> <section> <h1>WWF</h1> <p>The World Wide Fund for Nature (WWF) is....</p></section><datalist id='browsers'>"
+ " <option value='Internet Explorer'> <option value='Firefox'> <option value='Chrome'> <option value='Opera'> <option value='Safari'></datalist>"
+ " <audio controls> <source src='horse.ogg' type='audio/ogg'> <source src='horse.mp3' type='audio/mpeg'>Your browser does not support the audio element.</audio>"
+ " <progress value='22' max='100'>teasdklfjashdfjkl</progress> ";
String toDoRemoveTAG = "style,img,script,noscript,hr,input";
String allowTagList = "p,span,b,i,u,div,br,a";
Document doc = Jsoup.parse(html);
Whitelist whitelist = buildWhiteList(doc, Arrays.asList(toDoRemoveTAG.toUpperCase().split(",")));
Cleaner cleaner = new Cleaner(whitelist);
doc = cleaner.clean(doc);
System.out.println(doc.select("body").html());
}
private static Whitelist buildWhiteList(Document doc, List<String> toDoRemoveTAG) throws InstantiationException, IllegalAccessException {
Whitelist whitelist = new Whitelist();
Set<String> allowedTags = new HashSet<String>();
Map<String, Set<String>> allowedAttributes = new HashMap<String, Set<String>>();
for(Element e : Collector.collect(Evaluator.AllElements.class.newInstance(), doc)){
if(!toDoRemoveTAG.contains(e.tagName().toUpperCase())){
allowedTags.add(e.tagName());
for(Attribute attr : e.attributes()){
if(!toDoRemoveTAG.contains(attr.getKey().toUpperCase())){
if(allowedAttributes.containsKey(e.tagName())){
allowedAttributes.get(e.tagName()).add(attr.getKey());
} else {
allowedAttributes.put(e.tagName(), new HashSet<String>() {{ add(attr.getKey()); }});
}
}
}
}
}
whitelist.addTags(allowedTags.toArray(new String[allowedTags.size()]));
for(Entry<String, Set<String>> e : allowedAttributes.entrySet()){
whitelist.addAttributes(e.getKey(), e.getValue().toArray(new String[e.getValue().size()]));
}
return whitelist;
}
}