Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/331.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:Scraping table/在第一列不总是相等时获取特定列_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python:Scraping table/在第一列不总是相等时获取特定列

Python:Scraping table/在第一列不总是相等时获取特定列,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我试图提取下表的第二列,即肌肉的名称: 以下是我目前的代码: import requests import time from bs4 import BeautifulSoup as soup url = "http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html" links = [] time.sleep(1) print(url) page = requests.get(u

我试图提取下表的第二列,即肌肉的名称:

以下是我目前的代码:

    import requests
    import time
    from bs4 import BeautifulSoup as soup

    url = "http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html"
    links = []
    time.sleep(1)
    print(url)
    page = requests.get(url)
    text = soup(page.text, 'html.parser')
    table = text.select('table')[1]
    rows = table.find_all('tr')[2:]

    names = []
    for row in rows:
        names.append(row.find_all('td')[1].text.replace('\n', ''))

    print(names)
问题是它有时会让我得到行的第二列,有时是第三列,这取决于第一列是否延伸到两行。有道理,但我不知道怎么解决

感谢你的任何想法

试试这个:

import pandas as pd

url = 'http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html'

tables = pd.read_html(url)
print(tables[1][1])

输出是标题为“麝香-肌肉(解剖学术语)”的列。

您可以考虑第二行始终具有特定宽度的事实:
width=“15%”
。您可以尝试在每行中选择具有此宽度的单元格(请注意,最后一列有时具有相同的属性,您应该选择第一个元素).

您可以将属性选择器与类型选择器结合使用,以具有
name
属性的
a
type/tag元素为目标。比熊猫更轻,特别是如果你只想要那些肌肉名称

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html')
soup = bs(r.content,'lxml')
muscles = [a['name'] for a in soup.select('a[name]')]
print(muscles)

哇,那真的很容易。我还没有和熊猫合作过,但看起来我真的应该调查一下!Thanks@JuRoSch-是的,我喜欢熊猫!很高兴它对你有用。