Regex 将字符串与最大数值匹配的正则表达式正确
我试图找到一种方法,使用下面的正则表达式匹配所有字符串 输入字符串:Regex 将字符串与最大数值匹配的正则表达式正确,regex,Regex,我试图找到一种方法,使用下面的正则表达式匹配所有字符串 输入字符串: https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt https://subdomain
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
预期产出:
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
我正在尝试下面的表达式,但是它得到了所有的URL,我如何才能将结果限制为我想要的
"https://subdomain.domain.com/([^,:"]+?([_\d]*?)).(txt|csv)"
您可以使用否定字符类
[^,:“]+
来不匹配逗号、冒号或双引号。我认为您不必使用?
然后使用一个空格将1+位数字后跟下划线与列出的任何数字(?:500 | 1280 | 980)进行匹配
对于示例数据,您不必将下划线或数字的0+倍匹配为非贪婪的[\ud]*?
,您还可以将1+位匹配为下划线\d+
注意:转义点\。
以逐字匹配
https://subdomain\.domain\.com/[^,:"]+\d+_(?:500|1280|980)\.(?:txt|csv)
当我了解到使用Regex几乎不可能实现这样的目标时,我已经在C#中实现了这一点,使用LINQ而不使用Regex。多亏了Burdui,我在尝试您的建议时提出了这一点
public List<string> FindUnique(List<string> Urls)
{
var distinct = Urls.Distinct();
var grouping = distinct.GroupBy(x => x.Substring(1, x.LastIndexOf('_')));
if (grouping.Count() > 0)
{
return grouping.Select(x =>
x.First(a =>
a.Contains(x.Max(y =>
Int32.Parse(y.Substring(y.LastIndexOf('_') + 1).Split('.')[0])).ToString())
)
).ToList();
}
else
{
return distinct.ToList();
}
}
公共列表FindUnique(列表URL)
{
var distinct=url.distinct();
var grouping=distinct.GroupBy(x=>x.Substring(1,x.LastIndexOf(“”));
if(grouping.Count()>0)
{
返回分组。选择(x=>
x、 第一(a=>
a、 包含(x.Max(y=>
解析(y.Substring(y.LastIndexOf(''.')+1).Split('.')[0]).ToString())
)
).ToList();
}
其他的
{
返回distinct.ToList();
}
}
如果您的块确实按照您的问题进行了分组,那么很容易做到这一点使用正则表达式
@(?m)(?:^[^\S\r\n]*(https?:/\S+?)(\d+)\(txt | csv)[^\S\r\n]*$\r?\n)+(?=\S*\r\n |$)”
解释
(?m)
(?: # Cluster group for block
^ # BOL
[^\S\r\n]* # Optional horizontal whitespace
( https?:// \S+? _ ) # (1), Location
( \d+ ) # (2), Number
\.
( txt | csv ) # (3), Extension
[^\S\r\n]* # Optional horizontal whitespace
$ \r? \n # EOL plus linebreak
)+ # End cluster, 1 to many times
(?= \s* \r \n | $ ) # Lookahead to determine where the end of block is
C#代码示例
var str =
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt\n" +
"\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt\n" +
"\n" +
" https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt\n" +
" https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt\n" +
"\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv\n" +
"\n";
// This regex matches a block each time
var RxBlock = new Regex(@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)");
Match M = RxBlock.Match(str);
while (M.Success)
{
CaptureCollection ccFileLoc = M.Groups[1].Captures; // location
CaptureCollection ccFileNum = M.Groups[2].Captures; // number
CaptureCollection ccFileExt = M.Groups[3].Captures; // extension
String Loc = ccFileLoc[0].Value;
String Ext = ccFileExt[0].Value;
int Largest = 0;
bool bValid = true;
if (Int32.TryParse(ccFileNum[0].Value, out Largest))
{
int cur_num = 0;
int cnt = ccFileLoc.Count;
for (int i = 0; bValid && i < cnt; i++)
{
if (!Int32.TryParse(ccFileNum[i].Value, out cur_num) || ccFileLoc[i].Value != Loc)
bValid = false;
else
if (cur_num > Largest)
{
Largest = cur_num;
Ext = ccFileExt[i].Value;
}
}
}
else
bValid = false;
if ( bValid )
Console.WriteLine("{0}{1}.{2} ", Loc, Largest, Ext);
M = M.NextMatch();
}
即使您的数据没有排序,您也可以这样使用正则表达式。
必须先对其进行行排序。
然后,需要稍微修改一下。如果您想这样做
好的,让我知道,我可能会告诉你怎么做。如果唯一的区别是结尾的数字,那么试试
https://subdomain\.domain\.com/[^,:“]+(?:500 | 1280 | 980)\.(?:txt | csv)
你能解释一下,你用什么标准来过滤吗?我可以编写regex980\.csv$|((25 | 50 | 128)0\.txt)$
,它会根据语言的不同过滤您的输入。你们怎么能在不知道他用的语言的情况下回答呢?我认为它是python,但我不确定。您想要的是正则表达式的功能之外的东西。使用pragraming语言按第一个捕获组对结果进行分组,然后按第二个捕获组提取最大值。你说了你想要的吗?我没看到那句话。那是一个详细而精彩的解释。谢谢你的努力,它帮助我以多种方式解决了这个问题。再次感谢。
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv