Python 将读取的html结果整形为更简单的结构
我希望有人能建议我如何创建只包含第二列文本而不包含前两行或左列文本的pandas dataframe。解决方案需要能够处理多个类似的表 我原以为Python 将读取的html结果整形为更简单的结构,python,pandas,python-3.5,bs4,Python,Pandas,Python 3.5,Bs4,我希望有人能建议我如何创建只包含第二列文本而不包含前两行或左列文本的pandas dataframe。解决方案需要能够处理多个类似的表 我原以为pd.read_html(LOTable.prettify(),skiprows=2,flavor='bs4')从html创建数据帧列表(跳过2行)是一种方法,但最终数据结构太混乱,新手无法理解或处理为更简单的结构 其他人是否会有一种方法来处理产生结果的结构,或者推荐其他方法来细化数据,这样我就可以得到一列只包含我需要的文本 样本表 学习成果 成功完成
pd.read_html(LOTable.prettify(),skiprows=2,flavor='bs4')
从html创建数据帧列表(跳过2行)是一种方法,但最终数据结构太混乱,新手无法理解或处理为更简单的结构
其他人是否会有一种方法来处理产生结果的结构,或者推荐其他方法来细化数据,这样我就可以得到一列只包含我需要的文本
样本表
学习成果
成功完成本模块后,学员将能够:
LO1
了解财务会计信息作为决策过程输入的重要作用。
LO2
了解财务报表编制所依据的基本会计概念、原则和惯例。
LO3
了解记录和分类交易或事件相关信息的各种格式。
LO4
运用会计概念、惯例和技术知识,如复式记账,将记录信息过账到名义分类账中的T账户。
LO5
根据试算表,以规定格式编制并呈报独家贸易商的财务报表,并附上附注和其他信息。
第一个选项使用
iloc
这应该通过让iloc
去掉第一列来实现`
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]
解释
...iloc[:, 1:]
# ^ ^
# | \
# says to says to take columns
# take all starting with one and on
# rows
htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table> """
pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]
你可以只取一列
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]
我运行的工作代码
...iloc[:, 1:]
# ^ ^
# | \
# says to says to take columns
# take all starting with one and on
# rows
htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="2">
Learning Outcomes
</th>
</tr>
<tr>
<td class="info" colspan="2">
On successful completion of this module the learner will be able to:
</td>
</tr>
<tr>
<td style="width:10%;">
LO1
</td>
<td>
Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
</td>
</tr>
<tr>
<td style="width:10%;">
LO2
</td>
<td>
Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
</td>
</tr>
<tr>
<td style="width:10%;">
LO3
</td>
<td>
Understand the various formats in which information in relation to transactions or events is recorded and classified.
</td>
</tr>
<tr>
<td style="width:10%;">
LO4
</td>
<td>
Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the posting of recorded information to the T accounts in the Nominal Ledger.
</td>
</tr>
<tr>
<td style="width:10%;">
LO5
</td>
<td>
Prepare and present the financial statements of a Sole Trader in prescribed format from a Trial Balance accompanies by notes with additional information.
</td>
</tr>
</table> """
pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]
htm=”“”
学习成果
成功完成本模块后,学员将能够:
LO1
了解财务会计信息作为决策过程输入的重要作用。
LO2
了解财务报表编制所依据的基本会计概念、原则和惯例。
LO3
了解记录和分类交易或事件相关信息的各种格式。
LO4
运用会计概念、惯例和技术知识,如复式记账,将记录信息过账到名义分类账中的T账户。
LO5
根据试算表,以规定格式编制并呈报独家贸易商的财务报表,并附上附注和其他信息。
"""
pd.read_html(htm,skiprows=2,flavor='bs4')[0].iloc[:,1:]
你能解释一下你的改进意味着什么吗?如果你需要将代码添加到一个循环中,那么在最后,你有一列,包含了5,34或106个类似的独立表,它们都以相同的开头,但可以有更少或更多的行?实际上,我的问题的最后一点是处理循环,但声明一个空数据帧
InfoDF=pd.dataframe()
在循环之前,将您的代码生成的数据帧连接到空数据帧,可以得到我需要的结果InfoDF=pd.concat([InfoDF,pd.read_html(LOTable.prettify(),skiprows=2,flavor='bs4')[0].iloc[:,1:])