C++ 用于HTML图像标记的QRegExp_C++_Regex_Qt_Qregexp

C++ 用于HTML图像标记的QRegExp

c++ regex qt

C++ 用于HTML图像标记的QRegExp,c++,regex,qt,qregexp,C++,Regex,Qt,Qregexp,首先，我只想说，我理解将regexs用于HTML是个坏主意。我只是用它来抓取网页中所有图像的url。然而，我似乎只得到了第一个结果。这是我的正则表达式，还是我使用它的方式？我的正则表达式技能有些生疏，所以我可能错过了一些明显的东西 QRegExp imgTagRegex("(<img.*>)+", Qt::CaseInsensitive); //Grab the entire <img> tag imgTagRegex.setMinimal(true); imgTagRe

首先，我只想说，我理解将regexs用于HTML是个坏主意。我只是用它来抓取网页中所有图像的

url。然而，我似乎只得到了第一个结果。这是我的正则表达式，还是我使用它的方式？我的正则表达式技能有些生疏，所以我可能错过了一些明显的东西
QRegExp imgTagRegex("(<img.*>)+", Qt::CaseInsensitive); //Grab the entire <img> tag
imgTagRegex.setMinimal(true);
imgTagRegex.indexIn(pDocument);
QStringList imgTagList = imgTagRegex.capturedTexts();
imgTagList.removeFirst();   //the first is always the total captured text

foreach (QString imgTag, imgTagList) //now we want to get the source URL
{
    QRegExp urlRegex("src=\"(.*)\"", Qt::CaseInsensitive);
    urlRegex.setMinimal(true);
    urlRegex.indexIn(imgTag);
    QStringList resultList = urlRegex.capturedTexts();
    resultList.removeFirst();
    imageUrls.append(resultList.first());
}

这就是我想要的，但我知道页面上有更多的图像标签…你知道为什么我只拿回第一个吗

更新
在塞巴斯蒂安·兰格的帮助下，我走了这么远：
QRegExp imgTagRegex("<img.*src=\"(.*)\".*>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlMatches;
QStringList imgMatches;
int offset = 0;
while(offset >= 0)
{
    offset = imgTagRegex.indexIn(pDocument, offset);
    offset += imgTagRegex.matchedLength();

    QString imgTag = imgTagRegex.cap(0);
    if (!imgTag.isEmpty())
        imgMatches.append(imgTag); // Should hold complete img tag

    QString url = imgTagRegex.cap(1);
    if (!url.isEmpty())
    {
        url = url.split("\"").first(); //ehhh....
        if (!urlMatches.contains(url))
            urlMatches.append(url); // Should hold only src property
    }
}

QRegExp-imgTagRegex（“，Qt：：不区分大小写）；
imgTagRegex.setMinimal（真）；
QStringList；
QStringList-imgMatches；
整数偏移=0；
而（偏移量>=0）
{
偏移量=imgTagRegex.indexIn（p文件，偏移量）；
偏移量+=imgTagRegex.matchedLength（）；
QString imgTag=imgTagRegex.cap（0）；
如果（！imgTag.isEmpty（））
append（imgTag）；//应该包含完整的img标记
QString url=imgTagRegex.cap（1）；
如果（！url.isEmpty（））
{
url=url.split（\）.first（）；//ehhh。。。。
如果（！urlMatches.contains（url））
urlMatches.append（url）；//应仅包含src属性
}
}

最后的split
是一种去除段中非src元素的黑客方法。它是有效的，但这只是因为我无法找到正确的方法。我还添加了一些东西来标准化QRegExp，通常只给出一个匹配项。list CapturedText（）提供此匹配的所有捕获！在一个正则表达式语句中可以有多个捕获括号。要解决您的问题，您需要执行以下操作：
QRegExp imgTagRegex("\\<img[^\\>]*src\\s*=\\s*\"([^\"]*)\"[^\\>]*\\>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlmatches;
QStringList imgmatches;
int offset = 0;
while( (offset = imgTagRegex.indexIn(pDocument, offset)) != -1){
    offset += imgTagRegex.matchedLength();
    imgmatches.append(imgTagRegex.cap(0)); // Should hold complete img tag
    urlmatches.append(imgTagRegex.cap(1)); // Should hold only src property
}

QRegExp-imgTagRegex（“\\]*\\\>”，Qt：：不区分大小写）；
imgTagRegex.setMinimal（真）；
QStringList；
QStringList-imgmatches；
整数偏移=0；
而（（offset=imgTagRegex.indexIn（pDocument，offset））！=-1）{
偏移量+=imgTagRegex.matchedLength（）；
imgmatches.append（imgTagRegex.cap（0））；//应包含完整的img标记
urlmatches.append（imgTagRegex.cap（1））；//应仅包含src属性
}

编辑：将capture RegExpression更改为“\\]*\\>”
EDIT2:在src字符串中添加了可能的空格：“\\]*\\>”
谢谢Sebastian，我会尝试一下，然后再给你回复。看起来indexIn
在第一次运行时返回了-1
。尝试使用“\\]*\>“
我很快自己检查了一下，新的正则表达式正确地捕获了src，您现在应该可以使用这个解决方案了。至少对于给定的img标签，它可以正常工作。如果在src=“url”
之间可能会有空格，您最终希望将其替换为：“\\]*\\>”最后一行之前的两行中似乎缺少括号。
QRegExp imgTagRegex("\\<img[^\\>]*src\\s*=\\s*\"([^\"]*)\"[^\\>]*\\>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlmatches;
QStringList imgmatches;
int offset = 0;
while( (offset = imgTagRegex.indexIn(pDocument, offset)) != -1){
    offset += imgTagRegex.matchedLength();
    imgmatches.append(imgTagRegex.cap(0)); // Should hold complete img tag
    urlmatches.append(imgTagRegex.cap(1)); // Should hold only src property
}