Html Python web抓取非结构化表_Html_Python 3.x_Pandas_Web Scraping_Python Requests

Html Python web抓取非结构化表

html python-3.x pandas web-scraping

Html Python web抓取非结构化表,html,python-3.x,pandas,web-scraping,python-requests,Html,Python 3.x,Pandas,Web Scraping,Python Requests,我试图从网页上出现的表中提取一些信息，但该表是非结构化的，行是标题，列是内容，如下所示：（我为没有披露网页而道歉）该表基本上如下所示：然而，我有点被困在如何在Python上实现这一点上。我似乎无法集中精力去获取数据。我想要的结果如下：任何帮助都将不胜感激。非常感谢您。您可以使用beautifulsoup解析HTML。例如： import pandas as pd from bs4 import BeautifulSoup txt = '''<table class="

我试图从网页上出现的表中提取一些信息，但该表是非结构化的，行是标题，列是内容，如下所示：（我为没有披露网页而道歉）

该表基本上如下所示：

然而，我有点被困在如何在Python上实现这一点上。我似乎无法集中精力去获取数据。我想要的结果如下：

任何帮助都将不胜感激。非常感谢您。

您可以使用

beautifulsoup

解析HTML。例如：

import pandas as pd
from bs4 import BeautifulSoup


txt = '''<table class="table-detail">
            <tbody>
                <tr>
                    <td colspan="4" class="noborder">General Information
                    </td>
                </tr>
                <tr>
                    <th>Full name</th>
                    <td>
                        James Smith
                    </td>
                    <th>Year of birth</th>
                    <td>1992</td>
                </tr>
                <tr>
                    <th>Gender</th>
                    <td>Male</td>
                </tr>
                <tr>
                    <th>Place of birth</th>
                    <td>TTexas, USA</td>
                    <td>&nbsp;</td>
                    <td>&nbsp;</td>
                </tr>
                <tr>
                    <th>Address</th>
                    <td>Texas, USA</td>
                    <td>&nbsp;</td>
                    <td></td>
                </tr>'''


soup = BeautifulSoup(txt, 'html.parser')

row = {}
for h in soup.select('th:has(+td)'):
    row[h.text] = h.find_next('td').get_text(strip=True)

df = pd.DataFrame([row])
print(df)

我花了好几个小时想弄明白。你的代码工作起来很有魅力。非常感谢你的帮助！非常感谢！

import pandas as pd
import requests

url = "example.com"

r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()

df.to_csv('myfile.csv',encoding='utf-8-sig')

import pandas as pd
from bs4 import BeautifulSoup


txt = '''<table class="table-detail">
            <tbody>
                <tr>
                    <td colspan="4" class="noborder">General Information
                    </td>
                </tr>
                <tr>
                    <th>Full name</th>
                    <td>
                        James Smith
                    </td>
                    <th>Year of birth</th>
                    <td>1992</td>
                </tr>
                <tr>
                    <th>Gender</th>
                    <td>Male</td>
                </tr>
                <tr>
                    <th>Place of birth</th>
                    <td>TTexas, USA</td>
                    <td>&nbsp;</td>
                    <td>&nbsp;</td>
                </tr>
                <tr>
                    <th>Address</th>
                    <td>Texas, USA</td>
                    <td>&nbsp;</td>
                    <td></td>
                </tr>'''


soup = BeautifulSoup(txt, 'html.parser')

row = {}
for h in soup.select('th:has(+td)'):
    row[h.text] = h.find_next('td').get_text(strip=True)

df = pd.DataFrame([row])
print(df)

     Full name Year of birth Gender Place of birth     Address
0  James Smith          1992   Male    TTexas, USA  Texas, USA