Regex 将字符串与最大数值匹配的正则表达式正确

Regex 将字符串与最大数值匹配的正则表达式正确,regex,Regex,我试图找到一种方法,使用下面的正则表达式匹配所有字符串 输入字符串: https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt https://subdomain

我试图找到一种方法,使用下面的正则表达式匹配所有字符串

输入字符串:

    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt

    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt

    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt

    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
预期产出:

    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
我正在尝试下面的表达式,但是它得到了所有的URL,我如何才能将结果限制为我想要的

    "https://subdomain.domain.com/([^,:"]+?([_\d]*?)).(txt|csv)"

您可以使用否定字符类
[^,:“]+
来不匹配逗号、冒号或双引号。我认为您不必使用

然后使用一个空格将1+位数字后跟下划线与列出的任何数字(?:500 | 1280 | 980)进行匹配

对于示例数据,您不必将下划线或数字的0+倍匹配为非贪婪的
[\ud]*?
,您还可以将1+位匹配为下划线
\d+

注意:转义点
\。
以逐字匹配

https://subdomain\.domain\.com/[^,:"]+\d+_(?:500|1280|980)\.(?:txt|csv)

当我了解到使用Regex几乎不可能实现这样的目标时,我已经在C#中实现了这一点,使用LINQ而不使用Regex。多亏了Burdui,我在尝试您的建议时提出了这一点

    public List<string> FindUnique(List<string> Urls)
    {
        var distinct = Urls.Distinct();
        var grouping = distinct.GroupBy(x => x.Substring(1, x.LastIndexOf('_')));

        if (grouping.Count() > 0)
        { 
            return grouping.Select(x =>
                x.First(a =>
                    a.Contains(x.Max(y =>
                        Int32.Parse(y.Substring(y.LastIndexOf('_') + 1).Split('.')[0])).ToString())
                )
            ).ToList();
        }
        else
        {
            return distinct.ToList();
        }
    }
公共列表FindUnique(列表URL)
{
var distinct=url.distinct();
var grouping=distinct.GroupBy(x=>x.Substring(1,x.LastIndexOf(“”));
if(grouping.Count()>0)
{ 
返回分组。选择(x=>
x、 第一(a=>
a、 包含(x.Max(y=>
解析(y.Substring(y.LastIndexOf(''.')+1).Split('.')[0]).ToString())
)
).ToList();
}
其他的
{
返回distinct.ToList();
}
}

如果您的块确实按照您的问题进行了分组,那么很容易做到这一点
使用正则表达式

@(?m)(?:^[^\S\r\n]*(https?:/\S+?)(\d+)\(txt | csv)[^\S\r\n]*$\r?\n)+(?=\S*\r\n |$)”

解释

 (?m)
 (?:                           # Cluster group for block
      ^                             # BOL
      [^\S\r\n]*                    # Optional horizontal whitespace
      ( https?:// \S+? _ )          # (1), Location
      ( \d+ )                       # (2), Number
      \. 
      ( txt | csv )                 # (3), Extension
      [^\S\r\n]*                    # Optional horizontal whitespace
      $ \r? \n                      # EOL plus linebreak
 )+                            # End cluster, 1 to many times
 (?= \s* \r \n | $ )           # Lookahead to determine where the end of block is
C#代码示例

var str =
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt\n" + 
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt\n" +
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt\n" +
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt\n" +
"\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt\n" +
"\n" +
"    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt\n" +
"    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt\n" +
"\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv\n" +
"\n";

// This regex matches a block each time
var RxBlock = new Regex(@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)");

Match M = RxBlock.Match(str);
while (M.Success)
{
    CaptureCollection ccFileLoc = M.Groups[1].Captures;  // location
    CaptureCollection ccFileNum = M.Groups[2].Captures;  // number
    CaptureCollection ccFileExt = M.Groups[3].Captures;  // extension

    String Loc = ccFileLoc[0].Value;
    String Ext = ccFileExt[0].Value;
    int Largest = 0;
    bool bValid = true;

    if (Int32.TryParse(ccFileNum[0].Value, out Largest))
    {
        int cur_num = 0;
        int cnt = ccFileLoc.Count;

        for (int i = 0; bValid && i < cnt; i++)
        {
            if (!Int32.TryParse(ccFileNum[i].Value, out cur_num) || ccFileLoc[i].Value != Loc)
                bValid = false;
            else
            if (cur_num > Largest)
            {
                Largest = cur_num;
                Ext = ccFileExt[i].Value;
            }
        }
    }
    else
        bValid = false;

    if ( bValid )
        Console.WriteLine("{0}{1}.{2} ", Loc, Largest, Ext);

    M = M.NextMatch();
}

即使您的数据没有排序,您也可以这样使用正则表达式。
必须先对其进行行排序。
然后,需要稍微修改一下。如果您想这样做

好的,让我知道,我可能会告诉你怎么做。

如果唯一的区别是结尾的数字,那么试试
https://subdomain\.domain\.com/[^,:“]+(?:500 | 1280 | 980)\.(?:txt | csv)
你能解释一下,你用什么标准来过滤吗?我可以编写regex
980\.csv$|((25 | 50 | 128)0\.txt)$
,它会根据语言的不同过滤您的输入。你们怎么能在不知道他用的语言的情况下回答呢?我认为它是python,但我不确定。您想要的是正则表达式的功能之外的东西。使用pragraming语言按第一个捕获组对结果进行分组,然后按第二个捕获组提取最大值。你说了你想要的吗?我没看到那句话。那是一个详细而精彩的解释。谢谢你的努力,它帮助我以多种方式解决了这个问题。再次感谢。
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv