Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/88.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python Beautifulsoup从复杂html标记获取数据_Python_Html_Beautifulsoup - Fatal编程技术网

使用Python Beautifulsoup从复杂html标记获取数据

使用Python Beautifulsoup从复杂html标记获取数据,python,html,beautifulsoup,Python,Html,Beautifulsoup,我有以下HTML数据: <div class="display-info"> <div class="record-icon pubtype"><span class="pubtype-icon pt-academicJournal" title="Academic Journal"> </span> <p class="caption">Academic Journal</p> </d

我有以下HTML数据:

<div class="display-info">
    <div class="record-icon pubtype"><span class="pubtype-icon pt-academicJournal" title="Academic Journal"> </span>
        <p class="caption">Academic Journal</p>
    </div>By: Stein, Mark. <strong>Organization Studies</strong>. 2007, Vol. 28 Issue 8, p1223-1241. 19p. Abstract: While the literature on front-line service work utilizes a variety of productive images, I argue that these images do not capture certain of the more problematic experiences of front-line service employees. Drawing on words used by these workers themselves, and using concepts from psychoanalysis and its application to organizational dynamics, I therefore propose a new image, that of toxicity. I argue that — especially when under severe pressure from customers — front-line workers may have the unconscious fantasy that they have been polluted by toxic substances. The unconscious experience of the entry of toxic material is likely to result in further <strong>contagion</strong> of relationships such as those among employees and between employees and customers. This may also result in workers retaliating against customers by exacting revenge on them. A downward spiralling of relationships may follow, with the result that large parts of the work environment are experienced as toxic. The implications for theory are explored. In conclusion, I argue that the theme of toxicity helps us connect the employee-customer interface with a deep reservoir of primordial human experience that links the body with emotions. [ABSTRACT FROM AUTHOR] DOI: 10.1177/0170840607079527. (<cite>AN: 26198405</cite>)
    <p class="subjectResults"><strong>Subjects:
    </strong>Industrial relations; Personnel management; Customer relations; Corporate image; Public relations; Consumer behavior; Sales personnel; Administration of Human Resource Programs (except Education, Public Health, and Veterans' Affairs Programs); Human Resources Consulting Services; Public Relations Agencies; Psychoanalysis; Social interaction</p><span class="record-additional"><span class="item add-to-folder"><a class="folder-toggle item-not-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="true" data-itemid="50" href="#" id="add_50" name="addToFolder" title="To print, e-mail, or save multiple items">Add to folder</a> <a class="folder-toggle item-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="false" data-itemid="50" href="#" id="added_50" style="display: none;" title="Remove result from folder">Remove from folder</a></span><span class="result-list-cite-ref-label"><a data-title="Cited References" href="javascript:__doLinkPostBack('','sl~~ref||su~~50','_top');" id="references50" title="Cited References">Cited References: (92) </a></span><span class="result-list-cite-link"><a data-title="Times Cited in this Database" href="javascript:__doLinkPostBack('','sl~~cit||su~~50','_top');" id="citations50" title="Times Cited in this Database">Times Cited in this Database: (20) </a></span> </span>
    <div class="record-formats-wrapper externalLinks"><span><span class="custom-link"><a class="ils-link" href="/ehost/SmartLink/OpenIlsLink?sid=42487fcc-c655-469f-b8ed-2802260b3983@sessionmgr102&amp;vid=15&amp;sl=smartlink&amp;st=ilslink_new&amp;sv=sdbn%253Dbth%2526pbt%253DAcademic%2520Journal%2526issn%253D01708406%2526ttl%253DOrganization%252520Studies%2526stp%253DC%2526asi%253DY%2526ldc%253DCheck%252520full%252520text%252520availability%2526lna%253DFull%252520Text%252520Finder%252520%25252D%252520INSEAD%2526lca%253DfullText%2526lo%255Fan%253D26198405&amp;su=http%3A%2F%2Fresolver%2Eebscohost%2Ecom%2Fopenurl%3Fcustid%3Ds8362180%26group%3Dmain%26authtype%3Dip%2Cuid%26sid%3DEBSCO%3Abth%26genre%3Darticle%26issn%3D01708406%26ISBN%3D%26volume%3D28%26issue%3D8%26date%3D20070801%26spage%3D1223%26pages%3D1223%2D1241%26title%3DOrganization%20Studies%26atitle%3DToxicity%2520and%2520the%2520Unconscious%2520Experience%2520of%2520the%2520Body%2520at%2520the%2520Employee%2D%2DCustomer%2520Interface%2E%26aulast%3DStein%252C%2520Mark%26id%3DDOI%3A10%2E1177%2F0170840607079527" id="linkILSLink50_1" onblur="self.status='';return true" onfocus="self.status='check full text availability.';return true" onmouseout="self.status='';return true" onmouseover="self.status='check full text availability.';return true" target="_new" title="check full text availability."><img align="middle" alt="check full text availability." border="0" class="icon-image" data-defer-image="https://s3.amazonaws.com/libapps/customers/2023/images/logo-INSEAD_blanc-sur-vert_250.jpg" id="imgILSLink50_1" src="https://if.ebsco-content.com/interfacefiles/17.232.0.2749/blank.gif"/>Check full text availability</a></span></span>
    </div>
</div>

对于此任务,最好同时使用
re
bs4

如果变量
txt
包含问题的HTML文本,则此脚本:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

txt = soup.select_one('.display-info').get_text(strip=True, separator='\n')

author = re.findall(r'By:.*', txt)[0]
abstract = re.findall(r'Abstract:.*?(?=\[ABSTRACT FROM AUTHOR\])', txt, flags=re.S)[0]

from textwrap import wrap
print(author)
print(*wrap(abstract.replace('\n', ' ')), sep='\n')

# or in case Python2 just:
# print author
# print abstract
印刷品:

By: Stein, Mark.
Abstract: While the literature on front-line service work utilizes a
variety of productive images, I argue that these images do not capture
certain of the more problematic experiences of front-line service
employees. Drawing on words used by these workers themselves, and
using concepts from psychoanalysis and its application to
organizational dynamics, I therefore propose a new image, that of
toxicity. I argue that — especially when under severe pressure from
customers — front-line workers may have the unconscious fantasy that
they have been polluted by toxic substances. The unconscious
experience of the entry of toxic material is likely to result in
further contagion of relationships such as those among employees and
between employees and customers. This may also result in workers
retaliating against customers by exacting revenge on them. A downward
spiralling of relationships may follow, with the result that large
parts of the work environment are experienced as toxic. The
implications for theory are explored. In conclusion, I argue that the
theme of toxicity helps us connect the employee-customer interface
with a deep reservoir of primordial human experience that links the
body with emotions.

对于此任务,最好同时使用
re
bs4

如果变量
txt
包含问题的HTML文本,则此脚本:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

txt = soup.select_one('.display-info').get_text(strip=True, separator='\n')

author = re.findall(r'By:.*', txt)[0]
abstract = re.findall(r'Abstract:.*?(?=\[ABSTRACT FROM AUTHOR\])', txt, flags=re.S)[0]

from textwrap import wrap
print(author)
print(*wrap(abstract.replace('\n', ' ')), sep='\n')

# or in case Python2 just:
# print author
# print abstract
印刷品:

By: Stein, Mark.
Abstract: While the literature on front-line service work utilizes a
variety of productive images, I argue that these images do not capture
certain of the more problematic experiences of front-line service
employees. Drawing on words used by these workers themselves, and
using concepts from psychoanalysis and its application to
organizational dynamics, I therefore propose a new image, that of
toxicity. I argue that — especially when under severe pressure from
customers — front-line workers may have the unconscious fantasy that
they have been polluted by toxic substances. The unconscious
experience of the entry of toxic material is likely to result in
further contagion of relationships such as those among employees and
between employees and customers. This may also result in workers
retaliating against customers by exacting revenge on them. A downward
spiralling of relationships may follow, with the result that large
parts of the work environment are experienced as toxic. The
implications for theory are explored. In conclusion, I argue that the
theme of toxicity helps us connect the employee-customer interface
with a deep reservoir of primordial human experience that links the
body with emotions.

使用以下正则表达式

from bs4 import BeautifulSoup
import re
html='''<div class="display-info">
    <div class="record-icon pubtype"><span class="pubtype-icon pt-academicJournal" title="Academic Journal"> </span>
        <p class="caption">Academic Journal</p>
    </div>By: Stein, Mark. <strong>Organization Studies</strong>. 2007, Vol. 28 Issue 8, p1223-1241. 19p. Abstract: While the literature on front-line service work utilizes a variety of productive images, I argue that these images do not capture certain of the more problematic experiences of front-line service employees. Drawing on words used by these workers themselves, and using concepts from psychoanalysis and its application to organizational dynamics, I therefore propose a new image, that of toxicity. I argue that — especially when under severe pressure from customers — front-line workers may have the unconscious fantasy that they have been polluted by toxic substances. The unconscious experience of the entry of toxic material is likely to result in further <strong>contagion</strong> of relationships such as those among employees and between employees and customers. This may also result in workers retaliating against customers by exacting revenge on them. A downward spiralling of relationships may follow, with the result that large parts of the work environment are experienced as toxic. The implications for theory are explored. In conclusion, I argue that the theme of toxicity helps us connect the employee-customer interface with a deep reservoir of primordial human experience that links the body with emotions. [ABSTRACT FROM AUTHOR] DOI: 10.1177/0170840607079527. (<cite>AN: 26198405</cite>)
    <p class="subjectResults"><strong>Subjects:
    </strong>Industrial relations; Personnel management; Customer relations; Corporate image; Public relations; Consumer behavior; Sales personnel; Administration of Human Resource Programs (except Education, Public Health, and Veterans' Affairs Programs); Human Resources Consulting Services; Public Relations Agencies; Psychoanalysis; Social interaction</p><span class="record-additional"><span class="item add-to-folder"><a class="folder-toggle item-not-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="true" data-itemid="50" href="#" id="add_50" name="addToFolder" title="To print, e-mail, or save multiple items">Add to folder</a> <a class="folder-toggle item-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="false" data-itemid="50" href="#" id="added_50" style="display: none;" title="Remove result from folder">Remove from folder</a></span><span class="result-list-cite-ref-label"><a data-title="Cited References" href="javascript:__doLinkPostBack('','sl~~ref||su~~50','_top');" id="references50" title="Cited References">Cited References: (92) </a></span><span class="result-list-cite-link"><a data-title="Times Cited in this Database" href="javascript:__doLinkPostBack('','sl~~cit||su~~50','_top');" id="citations50" title="Times Cited in this Database">Times Cited in this Database: (20) </a></span> </span>
    <div class="record-formats-wrapper externalLinks"><span><span class="custom-link"><a class="ils-link" href="/ehost/SmartLink/OpenIlsLink?sid=42487fcc-c655-469f-b8ed-2802260b3983@sessionmgr102&amp;vid=15&amp;sl=smartlink&amp;st=ilslink_new&amp;sv=sdbn%253Dbth%2526pbt%253DAcademic%2520Journal%2526issn%253D01708406%2526ttl%253DOrganization%252520Studies%2526stp%253DC%2526asi%253DY%2526ldc%253DCheck%252520full%252520text%252520availability%2526lna%253DFull%252520Text%252520Finder%252520%25252D%252520INSEAD%2526lca%253DfullText%2526lo%255Fan%253D26198405&amp;su=http%3A%2F%2Fresolver%2Eebscohost%2Ecom%2Fopenurl%3Fcustid%3Ds8362180%26group%3Dmain%26authtype%3Dip%2Cuid%26sid%3DEBSCO%3Abth%26genre%3Darticle%26issn%3D01708406%26ISBN%3D%26volume%3D28%26issue%3D8%26date%3D20070801%26spage%3D1223%26pages%3D1223%2D1241%26title%3DOrganization%20Studies%26atitle%3DToxicity%2520and%2520the%2520Unconscious%2520Experience%2520of%2520the%2520Body%2520at%2520the%2520Employee%2D%2DCustomer%2520Interface%2E%26aulast%3DStein%252C%2520Mark%26id%3DDOI%3A10%2E1177%2F0170840607079527" id="linkILSLink50_1" onblur="self.status='';return true" onfocus="self.status='check full text availability.';return true" onmouseout="self.status='';return true" onmouseover="self.status='check full text availability.';return true" target="_new" title="check full text availability."><img align="middle" alt="check full text availability." border="0" class="icon-image" data-defer-image="https://s3.amazonaws.com/libapps/customers/2023/images/logo-INSEAD_blanc-sur-vert_250.jpg" id="imgILSLink50_1" src="https://if.ebsco-content.com/interfacefiles/17.232.0.2749/blank.gif"/>Check full text availability</a></span></span>
    </div>
</div>'''

soup=BeautifulSoup(html,'html.parser')
divtext=soup.find('div',class_='display-info')
print(re.findall("By:?\s.*Mark.",divtext.text)[0])
print(re.findall("Abstract:?\s.*\[",divtext.text)[0][:-1])

使用以下正则表达式

from bs4 import BeautifulSoup
import re
html='''<div class="display-info">
    <div class="record-icon pubtype"><span class="pubtype-icon pt-academicJournal" title="Academic Journal"> </span>
        <p class="caption">Academic Journal</p>
    </div>By: Stein, Mark. <strong>Organization Studies</strong>. 2007, Vol. 28 Issue 8, p1223-1241. 19p. Abstract: While the literature on front-line service work utilizes a variety of productive images, I argue that these images do not capture certain of the more problematic experiences of front-line service employees. Drawing on words used by these workers themselves, and using concepts from psychoanalysis and its application to organizational dynamics, I therefore propose a new image, that of toxicity. I argue that — especially when under severe pressure from customers — front-line workers may have the unconscious fantasy that they have been polluted by toxic substances. The unconscious experience of the entry of toxic material is likely to result in further <strong>contagion</strong> of relationships such as those among employees and between employees and customers. This may also result in workers retaliating against customers by exacting revenge on them. A downward spiralling of relationships may follow, with the result that large parts of the work environment are experienced as toxic. The implications for theory are explored. In conclusion, I argue that the theme of toxicity helps us connect the employee-customer interface with a deep reservoir of primordial human experience that links the body with emotions. [ABSTRACT FROM AUTHOR] DOI: 10.1177/0170840607079527. (<cite>AN: 26198405</cite>)
    <p class="subjectResults"><strong>Subjects:
    </strong>Industrial relations; Personnel management; Customer relations; Corporate image; Public relations; Consumer behavior; Sales personnel; Administration of Human Resource Programs (except Education, Public Health, and Veterans' Affairs Programs); Human Resources Consulting Services; Public Relations Agencies; Psychoanalysis; Social interaction</p><span class="record-additional"><span class="item add-to-folder"><a class="folder-toggle item-not-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="true" data-itemid="50" href="#" id="add_50" name="addToFolder" title="To print, e-mail, or save multiple items">Add to folder</a> <a class="folder-toggle item-in-folder" data-folder='{"db":"bth","uiTerm":"26198405","uiTag":"AN","ebookFormat":"false","abookFormat":"false","title":"Toxicity and the Unconscious Experience of the Body at the Employee--Customer Interface. ","resultID":"50","doid":"","segid":""}' data-isaddtofolder="false" data-itemid="50" href="#" id="added_50" style="display: none;" title="Remove result from folder">Remove from folder</a></span><span class="result-list-cite-ref-label"><a data-title="Cited References" href="javascript:__doLinkPostBack('','sl~~ref||su~~50','_top');" id="references50" title="Cited References">Cited References: (92) </a></span><span class="result-list-cite-link"><a data-title="Times Cited in this Database" href="javascript:__doLinkPostBack('','sl~~cit||su~~50','_top');" id="citations50" title="Times Cited in this Database">Times Cited in this Database: (20) </a></span> </span>
    <div class="record-formats-wrapper externalLinks"><span><span class="custom-link"><a class="ils-link" href="/ehost/SmartLink/OpenIlsLink?sid=42487fcc-c655-469f-b8ed-2802260b3983@sessionmgr102&amp;vid=15&amp;sl=smartlink&amp;st=ilslink_new&amp;sv=sdbn%253Dbth%2526pbt%253DAcademic%2520Journal%2526issn%253D01708406%2526ttl%253DOrganization%252520Studies%2526stp%253DC%2526asi%253DY%2526ldc%253DCheck%252520full%252520text%252520availability%2526lna%253DFull%252520Text%252520Finder%252520%25252D%252520INSEAD%2526lca%253DfullText%2526lo%255Fan%253D26198405&amp;su=http%3A%2F%2Fresolver%2Eebscohost%2Ecom%2Fopenurl%3Fcustid%3Ds8362180%26group%3Dmain%26authtype%3Dip%2Cuid%26sid%3DEBSCO%3Abth%26genre%3Darticle%26issn%3D01708406%26ISBN%3D%26volume%3D28%26issue%3D8%26date%3D20070801%26spage%3D1223%26pages%3D1223%2D1241%26title%3DOrganization%20Studies%26atitle%3DToxicity%2520and%2520the%2520Unconscious%2520Experience%2520of%2520the%2520Body%2520at%2520the%2520Employee%2D%2DCustomer%2520Interface%2E%26aulast%3DStein%252C%2520Mark%26id%3DDOI%3A10%2E1177%2F0170840607079527" id="linkILSLink50_1" onblur="self.status='';return true" onfocus="self.status='check full text availability.';return true" onmouseout="self.status='';return true" onmouseover="self.status='check full text availability.';return true" target="_new" title="check full text availability."><img align="middle" alt="check full text availability." border="0" class="icon-image" data-defer-image="https://s3.amazonaws.com/libapps/customers/2023/images/logo-INSEAD_blanc-sur-vert_250.jpg" id="imgILSLink50_1" src="https://if.ebsco-content.com/interfacefiles/17.232.0.2749/blank.gif"/>Check full text availability</a></span></span>
    </div>
</div>'''

soup=BeautifulSoup(html,'html.parser')
divtext=soup.find('div',class_='display-info')
print(re.findall("By:?\s.*Mark.",divtext.text)[0])
print(re.findall("Abstract:?\s.*\[",divtext.text)[0][:-1])


你想找到什么?在什么元素上?您是否可以尝试编辑html,使其更具可读性?我试图为您修复它,但它看起来非常困难。HTML看起来很不幸,但我可以添加一个链接,并指定HTML TAGSI编辑的HTMLIT这将是一个不好的地方,考虑使用正则表达式提取数据感谢编辑。我的错是你说了你需要的。你想找什么?在什么元素上?您是否可以尝试编辑html,使其更具可读性?我试图为您修复它,但它看起来非常困难。HTML看起来很不幸,但我可以添加一个链接,并指定HTML TAGSI编辑的HTMLIT这将是一个不好的地方,考虑使用正则表达式提取数据感谢编辑。对不起,你说了你需要的。@edyvedy13语法错误?你在用Python2吗?你可以只
print author
print abstract
,而不使用
wrap
,我只是用它在这里进行了漂亮的打印。不,但我认为问题是(abstract.replace('\n',''),sep='\n'),它可以工作,最后我会将它们保存到csv,你认为这会产生问题吗?@edyvedy13我不这么认为。只需使用正确的分隔符和引号字符。@edyvedy13语法错误?你在用Python2吗?你可以只
print author
print abstract
,而不使用
wrap
,我只是用它在这里进行了漂亮的打印。不,但我认为问题是(abstract.replace('\n',''),sep='\n'),它可以工作,最后我会将它们保存到csv,你认为这会产生问题吗?@edyvedy13我不这么认为。只需使用适当的分隔符和引号字符。