C#-删除不需要的字符_C# - Fatal编程技术网

C#-删除不需要的字符

C#-删除不需要的字符,c#,C#,我做了一个代理刮板，但我面临一个问题： Regex rgx = new Regex(@"[a-zA-Z<>/]"); WebClient wc = new WebClient(); foreach(string s in urls) try { string url = wc.DownloadString(s); foreach(string

我做了一个代理刮板，但我面临一个问题：

Regex rgx = new Regex(@"[a-zA-Z<>/]");
        WebClient wc = new WebClient();
        foreach(string s in urls)
            try
            {
                string url = wc.DownloadString(s);
                foreach(string reg in regex) // Regex is a list that read regex from text file
                {
                    MatchCollection mc = Regex.Matches(url, reg);
                    foreach (Match m in mc)
                    {
                        Proxies.Add(m.ToString());
                    }
                }
            }
            catch
            {

            }
        Proxies = Proxies.Distinct().ToList(); // Remove duplicate lines

Regex rgx=newregex（@“[a-zA-Z/]”）；
WebClient wc=新的WebClient（）；
foreach（URL中的字符串s）
尝试
{
字符串url=wc.DownloadString；
foreach（regex中的字符串reg）//regex是从文本文件中读取regex的列表
{
MatchCollection mc=Regex.Matches（url，reg）；
foreach（在mc中匹配m）
{
Proxies.Add（m.ToString（））；
}
}
}
抓住
{
}
Proxies=Proxies.Distinct（）.ToList（）；//删除重复的行

我使用的正则表达式：

\d{1,4}[.]\d{1,4}[.]\d{1,4}[.]\d{1,4}[:0-9]+
\d{1,8}[.]\d{1,8}[.]\d{1,8}\d{1,8}[.]\d{1,8}<\/td><td>[0-9]+

\d{1,4}[.]\d{1,4}[.]\d{1,4}[.]\d{1,4}[：0-9]+
\d{1,8}[.]\d{1,8}[.]\d{1,8}\d{1,8}[.]\d{1,8}[0-9]+

因为我不知道如何制作每个网站都能用的正则表达式，所以我必须制作一个文本文件，这样用户就可以自己制作正则表达式

第二个正则表达式的问题是它会像这样刮：

1.1.1.1</td><td>8080

1.1.1.18080

我希望它将不需要的字符“/abcdefghijklmnopqrstuvwxyz”替换为“

我建议不要使用正则表达式来清理HTML。HTML不是一种常规语言，所以我们总是会遇到这种问题。与其从整个网页中删除字符，不如专注于阅读目标HTML部分。这会容易得多。让用户提及针对目标的部分ID或名称等