Regex 如何提取作者';使用正则表达式的名称和发布日期?
我试图从这个HTML文本中提取作者的姓名和出版日期 以下是我到目前为止的情况: (authorName)=(“………………”) 不过,这只适用于这种特定情况,我正在寻找一种通用方法。我能得到一些关于如何处理这个问题的建议吗 教师是SF应该投资意外之财“var”的最好例子 omni_bizObjectId=“13560483”var omni_className=“article”var omni_publicationDate=“2019-01-25T12:00:00+00:00”var omni_sourceSite =“sfgate”var omni_authorName=“Heather Knight”var omni_authorTitle=“”;var omni_premiumStatus=“isPremium”;var omni_premiumEndDate=“1893506400”;var omni_originalSource=“SF”var omni_页码= “1”var omni_breakingNewsFlag=“0”var omni_localNewsFlag=“1”var omni_isListView=“0”var omni_paywallSite=“1”var omni_displayTemplate=“ard”Regex 如何提取作者';使用正则表达式的名称和发布日期?,regex,python-3.x,beautifulsoup,Regex,Python 3.x,Beautifulsoup,我试图从这个HTML文本中提取作者的姓名和出版日期 以下是我到目前为止的情况: (authorName)=(“………………”) 不过,这只适用于这种特定情况,我正在寻找一种通用方法。我能得到一些关于如何处理这个问题的建议吗 教师是SF应该投资意外之财“var”的最好例子 omni_bizObjectId=“13560483”var omni_className=“article”var omni_publicationDate=“2019-01-25T12:00:00+00:00”var omn
您可以使用此正则表达式捕获group1中的作者姓名
authorName\s+=\s+"([^"]*)"
此正则表达式按字面顺序匹配authorName
,然后匹配一个或多个空格,然后再匹配=
一个或多个空格,然后再匹配一个双引号“
”,然后捕获下一个双引号之间的任何数据,并将其存储在group1中,使用m.group(1)
查看此Python代码,了解如何从group1捕获数据
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
print(m.group(1))
只打印作者姓名
Heather Knight
编辑:感谢Onyanbu指出发布日期。
与authorName
类似,您可以使用上述正则表达式,将authorName
替换为publicationDate
,并使用此正则表达式捕获publicationDate
publicationDate\s+=\s+"([^"]*)"
如果要使用单个正则表达式同时提取这两个正则表达式,可以使用此正则表达式
(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"
Python代码
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
print('Publication Date:', m.group(1))
print('Author Name:', m.group(2))
印刷品
Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight
您可以使用此正则表达式捕获group1中的作者姓名
authorName\s+=\s+"([^"]*)"
此正则表达式按字面顺序匹配authorName
,然后匹配一个或多个空格,然后再匹配=
一个或多个空格,然后再匹配一个双引号“
”,然后捕获下一个双引号之间的任何数据,并将其存储在group1中,使用m.group(1)
查看此Python代码,了解如何从group1捕获数据
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
print(m.group(1))
只打印作者姓名
Heather Knight
编辑:感谢Onyanbu指出发布日期。
与authorName
类似,您可以使用上述正则表达式,将authorName
替换为publicationDate
,并使用此正则表达式捕获publicationDate
publicationDate\s+=\s+"([^"]*)"
如果要使用单个正则表达式同时提取这两个正则表达式,可以使用此正则表达式
(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"
Python代码
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
print('Publication Date:', m.group(1))
print('Author Name:', m.group(2))
印刷品
Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight
你能在你想捕获的地方再添加一些样本吗?我想捕获这部分:var omni_authorName=“Heather Knight”;但如果可能的话,只捕获作者的名字。你能在你想捕获的地方再添加一些样本吗?我想捕获这部分:var omni_authorName=“Heather Knight”;但如果可能的话,只捕获作者的名字。
(?i)(?:authorname | publicationdate)\s=\s\“[^\”]+
@onyanbu:啊,我刚才回答了一半,忘了publicationdate。谢谢你指出。让我更新我的答案,以涵盖publicationdate。(?:authorname | publicationdate)\s=\s\[^\]+
@Onyanbu:啊,我刚才回答了一半,忘了publicationdate。谢谢你指出。让我更新我的答案,以涵盖publicationdate。