如何使用python根据pdf文本的标题将文本字符串拆分为多个部分？_Python_Regex_Pandas_Dataframe_Pdfminer

如何使用python根据pdf文本的标题将文本字符串拆分为多个部分？

python regex pandas dataframe

如何使用python根据pdf文本的标题将文本字符串拆分为多个部分？,python,regex,pandas,dataframe,pdfminer,Python,Regex,Pandas,Dataframe,Pdfminer,我对python还是很陌生，所以我还没有很好地了解python语言我试图从研究文章的PDF中提取文本，并通过标题将它们分离到一个数据框架中标题是标准的（摘要、简介、方法、结果、讨论、参考资料），我只需要三列：1）文件名2）摘要3）文本，其中“文本”是摘要和参考资料之间的所有内容（因此我希望在一个组中包含从讨论结束到介绍的文本字符串）我从以下代码开始： from pdfminer.high_level import extract_text pdf_dir = "C:/Users/

我对python还是很陌生，所以我还没有很好地了解python语言

我试图从研究文章的PDF中提取文本，并通过标题将它们分离到一个数据框架中

标题是标准的（摘要、简介、方法、结果、讨论、参考资料），我只需要三列：1）文件名2）摘要3）文本，其中“文本”是摘要和参考资料之间的所有内容（因此我希望在一个组中包含从讨论结束到介绍的文本字符串）

我从以下代码开始：

from pdfminer.high_level import extract_text
pdf_dir = "C:/Users/dmari/Documents/Python/HeteroTA/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

output_data = pd.DataFrame(index = [0], columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  #pdfFileObj = open(file,'rb')     
  cleanText = extract_text(file) 

  
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName','Text'])  
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  output_data = pd.concat([output_data, newRow], ignore_index=True)

# Create a list with all the strings 
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2", 
            "Name: Bird_Box Year: 2018 Rating: 6.8", 
            "Name: Fight_Club Year: 1999 Rating: 8.8"] 
  
# Create a dictionary with the required columns  
# Used later to convert to DataFrame 
movies = {"Name":[], "Year":[], "Rating":[]} 
  
for item in movie_data: 
      
    # For Name field 
    name_field = re.search("Name: .*",item) 
      
    if name_field is not None: 
        name = re.search('\w*\s\w*',name_field.group()) 
    else: 
        name = None
    movies["Name"].append(name.group()) 
      
    # For Year field 
    year_field = re.search("Year: .*",item) 
    if year_field is not None: 
        year = re.search('\s\d\d\d\d',year_field.group()) 
    else: 
        year = None
    movies["Year"].append(year.group().strip()) 
      
    # For rating field 
    rating_field = re.search("Rating: .*",item) 
    if rating_field is not None:  
        rating = re.search('\s\d.\d',rating_field.group()) 
    else:  
        rating - None
    movies["Rating"].append(rating.group().strip()) 
  
# Creating DataFrame 
df = pd.DataFrame(movies) 
print(df)

要获得如下所示的输出：

我想进一步拆分这些文本，但似乎在网上找不到任何适合我需要的代码。我尝试使用以下代码：

from pdfminer.high_level import extract_text
pdf_dir = "C:/Users/dmari/Documents/Python/HeteroTA/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

output_data = pd.DataFrame(index = [0], columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  #pdfFileObj = open(file,'rb')     
  cleanText = extract_text(file) 

  
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName','Text'])  
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  output_data = pd.concat([output_data, newRow], ignore_index=True)

# Create a list with all the strings 
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2", 
            "Name: Bird_Box Year: 2018 Rating: 6.8", 
            "Name: Fight_Club Year: 1999 Rating: 8.8"] 
  
# Create a dictionary with the required columns  
# Used later to convert to DataFrame 
movies = {"Name":[], "Year":[], "Rating":[]} 
  
for item in movie_data: 
      
    # For Name field 
    name_field = re.search("Name: .*",item) 
      
    if name_field is not None: 
        name = re.search('\w*\s\w*',name_field.group()) 
    else: 
        name = None
    movies["Name"].append(name.group()) 
      
    # For Year field 
    year_field = re.search("Year: .*",item) 
    if year_field is not None: 
        year = re.search('\s\d\d\d\d',year_field.group()) 
    else: 
        year = None
    movies["Year"].append(year.group().strip()) 
      
    # For rating field 
    rating_field = re.search("Rating: .*",item) 
    if rating_field is not None:  
        rating = re.search('\s\d.\d',rating_field.group()) 
    else:  
        rating - None
    movies["Rating"].append(rating.group().strip()) 
  
# Creating DataFrame 
df = pd.DataFrame(movies) 
print(df)

为我的文件生成电影数据示例，但无法使其返回任何输出

有什么想法吗

提前谢谢你

如果您的字符串总是按此顺序排列，您可以使用

re.split

在列表中拆分这些字符串，例如，使用

\s*\w+：\s*

作为模式。您能给我展示一个使用该模式的示例代码吗？我以前从未使用过正则表达式（或者大部分python），所以我不确定它应该是什么样子。python不是我的强项，但我会看看我能想出什么办法让你知道如何处理这个问题=）谢谢你，JvdV！！我真的很感激！PDF是一种复杂的格式，很难识别部分-它可能会将每个单词作为单独的项目，并带有自己的

x，y

，然后很难说哪个单词属于标题，哪个属于内容。在一个问题中，我尝试使用距离

来识别行之间的较大距离，这可能意味着页眉。如果您的字符串总是按此顺序排列，您可以使用

re.split

在列表中拆分这些字符串，例如，使用

\s*\w+：\s*

x，y

，然后很难说哪个单词属于标题，哪个属于内容。在一个问题中，我尝试使用距离

来识别行之间的较大距离，这可能意味着页眉。