Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/eclipse/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 对于beautifulsoup文件中的所有文件名,返回标记为空_Python_Parsing_Text_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 对于beautifulsoup文件中的所有文件名,返回标记为空

Python 对于beautifulsoup文件中的所有文件名,返回标记为空,python,parsing,text,web-scraping,beautifulsoup,Python,Parsing,Text,Web Scraping,Beautifulsoup,我想解析一个大的.txt文件,并根据它们的父标记提取数据的比特和片段。例如,问题是“class=“ro”包含数百个不同的文本和数字位,其中大多数都没有用处 import requests from bs4 import BeautifulSoup data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt') # load the data soup = Beauti

我想解析一个大的.txt文件,并根据它们的父标记提取数据的比特和片段。例如,问题是“class=“ro”包含数百个不同的文本和数字位,其中大多数都没有用处

import requests
from bs4 import BeautifulSoup

data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')

# load the data
soup = BeautifulSoup(data.text, 'html.parser')

# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
    db = [td.text.strip() for td in tr.find_all('td')]
    print(db)
正如我之前所说,这可以获得所有这些标签,但95%的回报是无用的。我想根据文件名使用for循环或类似的方式进行过滤。。。“对于文件名为R2、R3等的所有文件”。。。抓取所有带有“ro”、“rou”等类的标签。到目前为止,我尝试过的所有东西都返回空标签。。。有人能帮忙吗?提前谢谢

<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm** <------- for everything with this filename
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text">&#160;<span></span> <------ return this tag
</td>
.....removed for brevity
</tr>

XML
14

**R2.htm**不确定您希望如何输出,但对于bs4 4.7.1,您可以使用
:contains
伪类来过滤文件名标记

import requests
from bs4 import BeautifulSoup

data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
soup = BeautifulSoup(data.text, 'lxml')

filenames = ['R2.htm', 'R3.htm']

for filename in filenames:
    print('-----------------------------')
    i = 1
    for item in soup.select('filename:contains("' + filename + '")'):
        print(filename, ' ', 'result' + str(i))
        for tr in item.find_all('tr', {'class':['rou','ro','re','reu']}):
            db = [td.text.strip() for td in tr.find_all('td')]
            print(db)
        i+=1