Python-从多个文件的多个字符串中提取文本_Python_Python 2.7

Python-从多个文件的多个字符串中提取文本

python python-2.7

Python-从多个文件的多个字符串中提取文本,python,python-2.7,Python,Python 2.7,Python大师，我需要提取从列表到URL的所有文本，下面是该模式的示例。我还希望脚本能够循环文件夹中的所有文件 ..... ..... <List>Product Line</List> <URL>http://teamspace.abb.com/sites/Product</URL> ... ... <List>Contact Number</List> <URL>https://teamsp

Python大师，我需要提取从列表到URL的所有文本，下面是该模式的示例。我还希望脚本能够循环文件夹中的所有文件

 .....
 .....
 <List>Product Line</List>
 <URL>http://teamspace.abb.com/sites/Product</URL>
 ...
 ...
 <List>Contact Number</List>
 <URL>https://teamspace.abb.com/sites/Contact</URL>
 ....
 ....

。。。。。
.....
产品线
http://teamspace.abb.com/sites/Product
...
...
联系电话
https://teamspace.abb.com/sites/Contact
....
....

预期产量

<List>Product Line</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Contact Number</List>
<URL>https://teamspace.abb.com/sites/Contact</URL>

产品线
http://teamspace.abb.com/sites/Product
联系电话
https://teamspace.abb.com/sites/Contact

我开发了一个脚本，可以循环文件夹中的所有文件，然后提取列表中的所有关键字，但我无法包含URL。非常感谢你的帮助

# defining location of parent folder
  BASE_DIRECTORY = 'C:\D_Drive\Projects\Test'
  output_file = open('C:\D_Drive\Projects\\Test\Output.txt', 'w')
  output = {}
  file_list = []

# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
for f in filenames:
    if 'xml' in str(f):
        e = os.path.join(str(dirpath), str(f))
        file_list.append(e)

for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
    if '<List>' in line:
        output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)

tabs.sort()
for tab in tabs:
output_file.write(tab + '\n')
output_file.write('\n')
for row in output[tab]:
    output_file.write(row + '')
output_file.write('\n')
output_file.write('----------------------------------------------------------\n')

raw_input()

#定义父文件夹的位置
基本目录='C:\D\u驱动器\Projects\Test'
output\u file=open（'C:\D\u Drive\Projects\\Test\output.txt'，'w'）
输出={}
文件列表=[]
#扫描子文件夹
对于os.walk（基本目录）中的（目录路径、目录名、文件名）：
对于文件名中的f：
如果str（f）中的“xml”：
e=os.path.join（str（dirpath），str（f））
文件列表。追加（e）
对于文件列表中的f：
打印f
txtfile=open（f，'r'）
输出[f]=[]
对于txtfile中的行：
如果“”在第行中：
输出[f]。追加（行）
制表符=[]
对于输出中的选项卡：
tabs.append（选项卡）
tabs.sort（）
对于选项卡中的选项卡：
输出文件。写入（制表符+“\n”）
输出文件。写入（'\n'）
对于输出[选项卡]中的行：
输出_file.write（行+“”）
输出文件。写入（'\n'）
输出\u file.write（'----------------------------------------------------------------\n'）
原始输入（）

试试：

Output.txt

将为您提供：

<List>Emove</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Asset_KWT</List>
<URL>https://teamspace.slb.com/sites/Contact</URL>

Emove
http://teamspace.abb.com/sites/Product
资产
https://teamspace.slb.com/sites/Contact

您的答案基本上是正确的，这是为文件创建迭代器所需的唯一更改。您可以使用元素树或BeautifulSoup，但当它是非xml或html文件时，理解这样的迭代也会起作用

txtfile = iter(open(f, 'r'))  # change here
output[f] = []
for line in txtfile:
    if '<List>' in line:
        output[f].append(line)
        output[f].append(next(txtfile))  # and here

txtfile=iter（open（f，'r'））#在这里更改
输出[f]=[]
对于txtfile中的行：
如果“”在第行中：
输出[f]。追加（行）
输出[f]。追加（下一个（txtfile））#和此处

您可以使用

过滤器

或类似的列表：

tgt=('URL', 'List')
with open('file') as f:  
    print filter(lambda line: any(e in line for e in tgt), (line for line in f))

或：

任何一种打印：

[' <List>Product Line</List>\n', ' <URL>http://teamspace.abb.com/sites/Product</URL>\n', ' <List>Contact Number</List>\n', ' <URL>https://teamspace.abb.com/sites/Contact</URL>\n']

[“产品线”\n“http://teamspace.abb.com/sites/Product\n'，'联系电话号码\n'，'https://teamspace.abb.com/sites/Contact\n']

输入和预期输出看起来相同。试着改善你的问题：为什么要重新发明轮子？只需使用xml解析器，如请更新缩进。太好了！非常感谢汉克斯提供的信息。我将介绍xml元素方法。感谢您的评论，我将介绍它。

with open('/tmp/file') as f:  
    print [line for line in f if any(e in line for e in tgt)]

[' <List>Product Line</List>\n', ' <URL>http://teamspace.abb.com/sites/Product</URL>\n', ' <List>Contact Number</List>\n', ' <URL>https://teamspace.abb.com/sites/Contact</URL>\n']