Python 如何在事先未知的循环大小上循环on?

Python 如何在事先未知的循环大小上循环on?,python,pandas,loops,Python,Pandas,Loops,我有一个很大的数据框,里面有章节号、标题、副标题和文本,都是字符串。我想按时间顺序过滤掉标题和副标题之间的特定文本片段,但是章节的副标题数量不固定。因此,我不知道循环的边界 我能够找到所有标题和副标题的索引,并找到和提取我需要的特定文本,但我只能在手动输入每个副标题字符串时这样做 import pandas as pd # Example of the contents of the file series = (["1.1.1.1", "lots of useless text", "mor

我有一个很大的数据框,里面有章节号、标题、副标题和文本,都是字符串。我想按时间顺序过滤掉标题和副标题之间的特定文本片段,但是章节的副标题数量不固定。因此,我不知道循环的边界

我能够找到所有标题和副标题的索引,并找到和提取我需要的特定文本,但我只能在手动输入每个副标题字符串时这样做

import pandas as pd

# Example of the contents of the file
series = (["1.1.1.1", "lots of useless text", "more useless text", "I want this text", "1.1.1.2","I want this text","Not this text","1.1.1.3","1.1.2.1","some lines of text","1.2.1.1","Interesting text","1.2.1.2" ])

# These two operations are to get the same structure as I have in my imported file
df2 = pd.DataFrame(series)
df2 = df2.iloc[:,0]

# Start of finding the first chapter
title = 1
subtitle = 1

# Change to string to find the location of the string
string_title = "1."+ str(title)+"."+str(subtitle)
process_loc = df2[df2.str.contains(string_title, na=False)]
idx = process_loc.index

#Locate text I want
true_text   = df2.str[0] == "I"
# Locate text for the subtitle.
text_range  = df2.loc[idx[0]:idx[2]]
text_list   = text_range[true_text == True]

#Loop over all subtitles to get all the subtitles and text I want in 1 DataFrame
new_df2 = pd.DataFrame(columns=['Ordered'])
new_df2 = new_df2.append(process_loc.to_frame())
new_df2 = new_df2.append(text_list.to_frame())
我想要输出:

  • 1.1.1
  • 我想要这个文本
  • 1.1.1.2
  • 我想要这个文本
  • 1.1.1.3
  • 1.1.2.1
  • 1.2.1.1
  • 有趣的文本
  • 1.2.1.2
是否可以循环此操作,或者我必须手动查找所有字幕编号?

您可以使用查找与您的条件匹配的行,例如,查找所有以
I
开头的行,或者查找数字后跟点的行:

df2[df2.str.match('^I.*|^\d\..*')]
输出:

0              1.1.1.1
3     I want this text
4              1.1.1.2
5     I want this text
7              1.1.1.3
8              1.1.2.1
10             1.2.1.1
11    Interesting text
12             1.2.1.2

你试过使用正则表达式吗?如果你知道你想要什么,你可以使用pandas loc函数和正则表达式来收集你想要的。谢谢!知道这一点,我就可以从一开始就省下几个小时。