Python 元组列表中元组的第一个小写元素_Python_List_Tuples_Text Processing

Python 元组列表中元组的第一个小写元素

python list

Python 元组列表中元组的第一个小写元素,python,list,tuples,text-processing,Python,List,Tuples,Text Processing,我有一份文件清单，标有相应的类别： documents = [(list(corpus.words(fileid)), category) for category in corpus.categories() for fileid in corpus.fileids(category)] 它给出了下面的元组列表，其中元组的第一个元素是单词列表（句子的标记）。例如： [([u'A', u'pilot', u'investigation',

我有一份文件清单，标有相应的类别：

documents = [(list(corpus.words(fileid)), category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

它给出了下面的元组列表，其中元组的第一个元素是单词列表（句子的标记）。例如：

[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary', 
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with', 
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'], 
'cancer'), 
([u'A', u'Systematic', u'Review', u'of', u'the', u'Effectiveness', 
u'of', u'Medical', u'Cannabis', u'for', u'Psychiatric', u',', 
u'Movement', u'and', u'Neurodegenerative', u'Disorders', u'.'], 'hd')]

我想应用一些文本处理技术，但我希望保持元组列表的格式

我知道，如果我只有一个单词列表，这就可以：

[w.lower() for w in words]

但在本例中，我想将.lower（）应用于元组列表中每个元组的第一个元素（字符串列表），然后尝试以下各种选项：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents], or
[x[0].lower() for x in documents]

我总是会遇到这样的错误：

AttributeError:“list”对象没有属性“lower”

在创建列表之前，我也尝试过应用所需的内容，但是.categories（）和.fileid（）是语料库的属性，它们也返回相同的错误（它们也是列表）

任何帮助都将不胜感激

已解决：

“亚当·斯密的答案”和“瓦西亚”都是正确的：

[([s.lower() for s in item[0]], item[1]) for item in documents]

@上面亚当的回答保持了元组结构@vasia在创建元组列表时就做到了这一点：

documents = [([word.lower() for word in corpus.words(fileid)], category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

谢谢大家：）

你们很接近了。您正在寻找这样的结构：

[([s.lower() for s in ls], cat) for ls, cat in documents]

这基本上把这两者结合在一起：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents]

因此，您的数据结构是

[（[str]，str）]

。元组列表，其中每个元组都是

（字符串列表，字符串）

。在您试图从中提取数据之前，深入理解这意味着什么是很重要的

这意味着

for item in documents

将获得一个元组列表，其中

item

是每个元组

这意味着

项[0]

是每个元组中的列表

这意味着

对于文档中的项：对于项[0]中的s:

将迭代该列表中的每个字符串。让我们试试看

[s.lower() for item in documents for s in item[0]]

根据您的示例数据，应给出：

[u'a', u'p', u'i', u'o', u'a', u'm', ...]

如果要保持元组格式，可以执行以下操作：

[([s.lower() for s in item[0]], item[1]) for item in documents]

# or perhaps more readably
[([s.lower() for s in lst], val) for lst, val in documents]

这两种说法都给出了：

[([u'a', u'p', u'i', u'o', u'a', u'm', ...], 'cancer'), ... ]

试试这个：

documents = [([word.lower() for word in corpus.words(fileid)], category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

通常，元组是不可变的。但是，由于每个元组的第一个元素是列表，因此该列表是可变的，因此您可以修改其内容，而无需更改该列表的元组所有权：

documents = [(...what you originally posted...) ... etc. ...]

for d in documents:
    # to lowercase all strings in the list
    # trailing '[:]' is important, need to modify list in place using slice
    d[0][:] = [w.lower() for w in d[0]]

    # or to just lower-case the first element of the list (which is what you asked for)
    d[0][0] = d[0][0].lower()

您不能只对字符串调用

lower（）

，然后对其进行更新-

lower（）

返回一个新字符串。因此，要将字符串修改为小写版本，必须对其进行赋值。如果字符串本身是元组成员，则这是不可能的，但由于您要修改的字符串位于元组的列表中，因此您可以修改列表内容，而无需修改元组对列表的所有权。

您需要再嵌套一个…@yinnonsanders是的，我已经提到了，谢谢！这为我节省了额外的代码行，并且只需一步即可完成此工作。：）