Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 如何提取作者';使用正则表达式的名称和发布日期?_Regex_Python 3.x_Beautifulsoup - Fatal编程技术网

Regex 如何提取作者';使用正则表达式的名称和发布日期?

Regex 如何提取作者';使用正则表达式的名称和发布日期?,regex,python-3.x,beautifulsoup,Regex,Python 3.x,Beautifulsoup,我试图从这个HTML文本中提取作者的姓名和出版日期 以下是我到目前为止的情况: (authorName)=(“………………”) 不过,这只适用于这种特定情况,我正在寻找一种通用方法。我能得到一些关于如何处理这个问题的建议吗 教师是SF应该投资意外之财“var”的最好例子 omni_bizObjectId=“13560483”var omni_className=“article”var omni_publicationDate=“2019-01-25T12:00:00+00:00”var omn

我试图从这个HTML文本中提取作者的姓名和出版日期

以下是我到目前为止的情况: (authorName)=(“………………”)

不过,这只适用于这种特定情况,我正在寻找一种通用方法。我能得到一些关于如何处理这个问题的建议吗

教师是SF应该投资意外之财“var”的最好例子 omni_bizObjectId=“13560483”var omni_className=“article”var omni_publicationDate=“2019-01-25T12:00:00+00:00”var omni_sourceSite =“sfgate”var omni_authorName=“Heather Knight”var omni_authorTitle=“”;var omni_premiumStatus=“isPremium”;var omni_premiumEndDate=“1893506400”;var omni_originalSource=“SF”var omni_页码= “1”var omni_breakingNewsFlag=“0”var omni_localNewsFlag=“1”var omni_isListView=“0”var omni_paywallSite=“1”var omni_displayTemplate=“ard”


您可以使用此正则表达式捕获group1中的作者姓名

authorName\s+=\s+"([^"]*)"
此正则表达式按字面顺序匹配
authorName
,然后匹配一个或多个空格,然后再匹配
=
一个或多个空格,然后再匹配一个双引号
”,然后捕获下一个双引号之间的任何数据,并将其存储在group1中,使用
m.group(1)

查看此Python代码,了解如何从group1捕获数据

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
 print(m.group(1))
只打印作者姓名

Heather Knight
编辑:感谢Onyanbu指出发布日期。

authorName
类似,您可以使用上述正则表达式,将
authorName
替换为
publicationDate
,并使用此正则表达式捕获
publicationDate

publicationDate\s+=\s+"([^"]*)"

如果要使用单个正则表达式同时提取这两个正则表达式,可以使用此正则表达式

(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"

Python代码

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
 print('Publication Date:', m.group(1))
 print('Author Name:', m.group(2))
印刷品

Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight

您可以使用此正则表达式捕获group1中的作者姓名

authorName\s+=\s+"([^"]*)"
此正则表达式按字面顺序匹配
authorName
,然后匹配一个或多个空格,然后再匹配
=
一个或多个空格,然后再匹配一个双引号
”,然后捕获下一个双引号之间的任何数据,并将其存储在group1中,使用
m.group(1)

查看此Python代码,了解如何从group1捕获数据

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
 print(m.group(1))
只打印作者姓名

Heather Knight
编辑:感谢Onyanbu指出发布日期。

authorName
类似,您可以使用上述正则表达式,将
authorName
替换为
publicationDate
,并使用此正则表达式捕获
publicationDate

publicationDate\s+=\s+"([^"]*)"

如果要使用单个正则表达式同时提取这两个正则表达式,可以使用此正则表达式

(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"

Python代码

import re

s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'

m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
 print('Publication Date:', m.group(1))
 print('Author Name:', m.group(2))
印刷品

Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knight

你能在你想捕获的地方再添加一些样本吗?我想捕获这部分:var omni_authorName=“Heather Knight”;但如果可能的话,只捕获作者的名字。你能在你想捕获的地方再添加一些样本吗?我想捕获这部分:var omni_authorName=“Heather Knight”;但如果可能的话,只捕获作者的名字。
(?i)(?:authorname | publicationdate)\s=\s\“[^\”]+
@onyanbu:啊,我刚才回答了一半,忘了publicationdate。谢谢你指出。让我更新我的答案,以涵盖publicationdate。
(?:authorname | publicationdate)\s=\s\[^\]+
@Onyanbu:啊,我刚才回答了一半,忘了publicationdate。谢谢你指出。让我更新我的答案,以涵盖publicationdate。