Python 如何在事先未知的循环大小上循环on?
我有一个很大的数据框,里面有章节号、标题、副标题和文本,都是字符串。我想按时间顺序过滤掉标题和副标题之间的特定文本片段,但是章节的副标题数量不固定。因此,我不知道循环的边界 我能够找到所有标题和副标题的索引,并找到和提取我需要的特定文本,但我只能在手动输入每个副标题字符串时这样做Python 如何在事先未知的循环大小上循环on?,python,pandas,loops,Python,Pandas,Loops,我有一个很大的数据框,里面有章节号、标题、副标题和文本,都是字符串。我想按时间顺序过滤掉标题和副标题之间的特定文本片段,但是章节的副标题数量不固定。因此,我不知道循环的边界 我能够找到所有标题和副标题的索引,并找到和提取我需要的特定文本,但我只能在手动输入每个副标题字符串时这样做 import pandas as pd # Example of the contents of the file series = (["1.1.1.1", "lots of useless text", "mor
import pandas as pd
# Example of the contents of the file
series = (["1.1.1.1", "lots of useless text", "more useless text", "I want this text", "1.1.1.2","I want this text","Not this text","1.1.1.3","1.1.2.1","some lines of text","1.2.1.1","Interesting text","1.2.1.2" ])
# These two operations are to get the same structure as I have in my imported file
df2 = pd.DataFrame(series)
df2 = df2.iloc[:,0]
# Start of finding the first chapter
title = 1
subtitle = 1
# Change to string to find the location of the string
string_title = "1."+ str(title)+"."+str(subtitle)
process_loc = df2[df2.str.contains(string_title, na=False)]
idx = process_loc.index
#Locate text I want
true_text = df2.str[0] == "I"
# Locate text for the subtitle.
text_range = df2.loc[idx[0]:idx[2]]
text_list = text_range[true_text == True]
#Loop over all subtitles to get all the subtitles and text I want in 1 DataFrame
new_df2 = pd.DataFrame(columns=['Ordered'])
new_df2 = new_df2.append(process_loc.to_frame())
new_df2 = new_df2.append(text_list.to_frame())
我想要输出:
- 1.1.1
- 我想要这个文本
- 1.1.1.2
- 我想要这个文本
- 1.1.1.3
- 1.1.2.1
- 1.2.1.1
- 有趣的文本
- 1.2.1.2
I
开头的行,或者查找数字后跟点的行:
df2[df2.str.match('^I.*|^\d\..*')]
输出:
0 1.1.1.1
3 I want this text
4 1.1.1.2
5 I want this text
7 1.1.1.3
8 1.1.2.1
10 1.2.1.1
11 Interesting text
12 1.2.1.2
你试过使用正则表达式吗?如果你知道你想要什么,你可以使用pandas loc函数和正则表达式来收集你想要的。谢谢!知道这一点,我就可以从一开始就省下几个小时。