Python 从文本文件中提取单词_Python_Awk

Python 从文本文件中提取单词

python awk

Python 从文本文件中提取单词,python,awk,Python,Awk,我正在使用递归神经网络，需要处理包含树的输入文本文件以提取单词。输入文件如下所示：岩石2 2 2 2 2 2注定2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2是2 2 21 2 2 2 2 2世纪2 2 2新的2 2 2 `` 2柯南2 2和3 2他3 2 3 2的3 2 3 2以4 3 2比4 2 3 2使3 2 3 2 3 2成为一个轰动2 2甚至比2 2 2 2 2 2 2 2阿诺德2施瓦辛格2、2让-克劳德2凡2达姆2或2史蒂文2西格尔2更大2 《2 2

我正在使用递归神经网络，需要处理包含树的输入文本文件以提取单词。输入文件如下所示：

岩石2 2 2 2 2 2注定2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2是2 2 21 2 2 2 2 2世纪2 2 2新的2 2 2 `` 2柯南2 2和3 2他3 2 3 2的3 2 3 2以4 3 2比4 2 3 2使3 2 3 2 3 2成为一个轰动2 2甚至比2 2 2 2 2 2 2 2阿诺德2施瓦辛格2、2让-克劳德2凡2达姆2或2史蒂文2西格尔2更大2

《2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2》三部曲的2 2 2 2 2 2 2 2 2 2 2 2 2 2 2华丽的2 2 2 2 2 2 2 2 2 2 2 2 2详细的2 2 2继续2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2《2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22是中土世界2

作为输出，我希望新文本文件中的单词列表为：

摇滚乐

是

命中注定

忽略行之间的空格

我试着用python来做，但没有找到解决方案。此外，我还了解到awk可以用于文本处理，但无法生成任何工作代码。非常感谢您的帮助。

您可以使用re.findall：

在文本上运行上述代码时，输出为：

make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century

您可以使用re.findall：

在文本上运行上述代码时，输出为：

make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century

你可以使用正则表达式

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

请注意，这将返回一个匹配列表，因此如果您想在一句话中全部打印出来，可以使用：

' '.join(results)

或者任何你想用以分隔单词的字符，而不是空格

分解正则表达式模式，我们有：

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If the numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

你可以使用正则表达式

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

请注意，这将返回一个匹配列表，因此如果您想在一句话中全部打印出来，可以使用：

' '.join(results)

或者任何你想用以分隔单词的字符，而不是空格

分解正则表达式模式，我们有：

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If the numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

您可以使用re.compile：

为了记录在案，我们可以选择扔掉什么而不是保留什么。例如，我们可以在paren、空格和数字上拆分。提醒包括单词和标点符号。这对于非拉丁文字和特殊字符可能很方便

import re

# split on parens, numbers and spaces
spl = re.compile("\(|\s|[0-9]|\)")
words = filter(None, spl.split(string_to_split))

import re

# split on parens, numbers and spaces
spl = re.compile("\(|\s|[0-9]|\)")
words = filter(None, spl.split(string_to_split))

谢谢你的回答。还有一件事，有些词包含“和-像他一样，先发制人”。我怎样才能得到这些？这会在输出文件中给我唯一的单词吗？@user129129请查看我最近的编辑。此解决方案将获得一个独特的列表。非常感谢。你是个省钱的人。谢谢你的回答。还有一件事，有些词包含“和-像他一样，先发制人”。我怎样才能得到这些？这会在输出文件中给我唯一的单词吗？@user129129请查看我最近的编辑。非常感谢你的独特解决方案。你是个省钱的人。谢谢你的回答。re.findall会给我唯一的实例吗？如果没有，我如何获得它们？什么是唯一实例？比如，如果某个文件中出现多次，您只想读取一次？在这种情况下，如果顺序不重要，您可以使用setresults，或者使用np.uniqueresults并将import numpy作为np添加到文件顶部。感谢您的回复。是的，顺序不重要，因此将使用setresults。谢谢您的回答。re.findall会给我唯一的实例吗？如果没有，我如何获得它们？什么是唯一实例？比如，如果某个文件中出现多次，您只想读取一次？在这种情况下，如果顺序不重要，您可以使用setresults，或者使用np.uniqueresults并将import numpy作为np添加到文件顶部。感谢您的回复。是的，顺序不重要，因此将使用setresults。