在单词表中搜索特定文本Python docx
我有一些代码可以读取Word文档中的表,并从中生成数据帧在单词表中搜索特定文本Python docx,python,python-docx,Python,Python Docx,我有一些代码可以读取Word文档中的表,并从中生成数据帧 import numpy as np import pandas as pd from docx import Document #### Time for some old fashioned user functions #### def make_dataframe(f_name, table_loc): document = Document(f_name) tables = document.tab
import numpy as np
import pandas as pd
from docx import Document
#### Time for some old fashioned user functions ####
def make_dataframe(f_name, table_loc):
document = Document(f_name)
tables = document.tables[table_loc]
for i, row in enumerate(tables.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame.from_dict(data)
return df
SHRD_filename = "SHRD - 12485.docx"
SHDD_filename = "SHDD - 12485.docx"
df_SHRD = make_dataframe(SHRD_filename,30)
df_SHDD = make_dataframe(SHDD_filename,-60)
因为文件是不同的(例如SHRD有32个表,我要查找的是倒数第二个表,而SHDD文件有280个表,我要查找的是从末尾算起的第60个表。但情况可能并不总是这样
如何搜索文档中的表格并开始处理
单元格[0,0]中的表格='tagnumbers'
您可以遍历表格并检查第一个单元格中的文本。我修改了输出以返回数据帧列表,以防找到多个表格。如果没有符合条件的表格,它将返回空列表
def make_dataframe(f_name, first_cell_string='tag number'):
document = Document(f_name)
# create a list of all of the table object with text of the
# first cell equal to `first_cell_string`
tables = [t for t in document.tables
if t.cell(0,0).text.lower().strip()==first_cell_string]
# in the case that more than one table is found
out = []
for table in tables:
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
out.append(pd.DataFrame.from_dict(data))
return out
谢谢。我唯一需要添加的是“first\u cell\u string=first\u cell\u string.lower().strip()”,因此搜索字符串与单词字符串匹配。