Python根据句子列表中的特定条件剥离单词_Python_String

Python根据句子列表中的特定条件剥离单词

python string

Python根据句子列表中的特定条件剥离单词,python,string,Python,String,我的起始文件是.txt-one，看起来像： https://www.website.com/something1/id=39494 notes !!!! other notes https://www.website2.com/something1/id=596774 ... notes2 !! other notes2 等等。。太乱了为了清理它，我做了： import re with open('file.txt', 'r') as filehandle: places = [c

我的起始文件是.txt-one，看起来像：

https://www.website.com/something1/id=39494 notes !!!! other notes
https://www.website2.com/something1/id=596774 ... notes2 !! other notes2

等等。。太乱了

为了清理它，我做了：

import re

with open('file.txt', 'r') as filehandle:
    places = [current_place.rstrip() for current_place in filehandle.readlines()]

filtered = [x for x in places if x.strip()]

这给了我一个网站列表（中间没有空格），但仍然有相同字符串的注释

我的目标是第一个有一个没有任何注释的“清理”网站列表：

https://www.website.com/something1/id=39494 
https://www.website2.com/something1/id=596774

for s in filtered:
    f = re.search('\s')

为此，我想把目标放在网站结束后的空间，去掉所有的后记：

https://www.website.com/something1/id=39494 
https://www.website2.com/something1/id=596774

for s in filtered:
    f = re.search('\s')

这会返回一个错误，但即使它有效，也不会返回我所想的

第二步是去掉网站上的一些字符，然后像这样组合：

但这会在以后发生

我只是想知道如何才能完成第一步，在网站之后去掉注释，并有一个干净的列表。

如果每行包含一个URL，后跟一个空格和任何其他文本，您可以简单地按空格分割，并获取每行的第一个元素：

urls = []
with open('file.txt') as filehandle:
  for line in filehandle:
    if not line.strip(): continue # skip empty lines
    urls.append(line.split(" ")[0])

# now the variable `urls` should contain all the URLs you are looking for

编辑：第二步

for url in urls:
  print('<iframe src="{}"></iframe>'.format(url))

对于url中的url：
打印（“”.格式（url））

您可以使用以下功能：

# to read the lines
with open('file.txt', 'r') as f:
    strlist = f.readlines()
# list to store the URLs
webs = []
for x in strlist:
    webs.append(x.split(' ')[0])
print(webs)

如果URL位置不总是在行的开头。你可以试试

https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)

然后您可以使用back引用来获取URL和id

代码示例

with open('file.txt') as file:
for line in file:
    m = re.match(r'https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)', line)
    if m:
        print("URL=%s" % m.group(0))
        print("ID=%d" % int(m.group(1)))

尝试

将open（'file.txt'，r'）作为f:for-in-f:if-line.strip（）.startswith（'http'）：print（line.strip（）.split（）[0]）

注意，我将一个带空格的字符串传递到

split（）

中，但您也可以忽略它，因为这是默认参数。我添加它是为了更明确地感谢，也是为了第二步！我注意到添加的内容在print（）中，是否可以将其稳定地写入列表？@Steven当然，您也可以将HTML写入列表、长字符串或文件。在所有这些情况下，您都可以使用参数

print（）

，并使用列表附加函数、字符串连接或文件写入函数。