Python 提取产品标题中可测量的指标_Python_Arrays_Numpy

Python 提取产品标题中可测量的指标

python arrays numpy

Python 提取产品标题中可测量的指标,python,arrays,numpy,Python,Arrays,Numpy,我的目标是提取产品标题中可测量的指标示例：我有以下产品及其标题： Product title A: "Milk 12KG 1Box" Product title B: "Apple 10Plus 256GB" Product title C: "Samsung 4G 3S" 在用空格写下产品名称后，我有以下内容： import numpy as np arr = [np.array(['Milk', '12KG', '1Box'],dtype=object),np.array(['Appl

我的目标是提取产品标题中可测量的指标

示例：我有以下产品及其标题：

Product title A: "Milk 12KG 1Box"
Product title B: "Apple 10Plus 256GB"
Product title C: "Samsung 4G 3S"

在用空格写下产品名称后，我有以下内容：

import numpy as np

arr = [np.array(['Milk', '12KG', '1Box'],dtype=object),np.array(['Apple', '10Plus', '256GB'],dtype=object),np.array(['Samsung', '4G', '3S'],dtype=object)]


for arr1 in arr:
    sum_list = []
    for a in arr1:  
        sum = 0
        for i in range(10):
            sum += a.count(str(i))
        sum_list.append(sum)
    print(arr1,"->",sum_list)

输出：

['Milk' '12KG' '1Box'] -> [0, 2, 1]
['Apple' '10Plus' '256GB'] -> [0, 2, 3]
['Samsung' '4G' '3S'] -> [0, 1, 1]

期望输出：

比其他项目包含更多数字的项目

如果有多个项目包含相同数量的数字，则取长度较大的项目

如果有多个项目包含相同数量的数字和相同长度，则取第一个订单的项目

如何获得所需的输出？

您可以试试

>>重新导入
>>>模式=r'[^\d]+'
>>>对于arr中的项目：
...     idx=np.argmax（项中x的np.array（[len（re.sub（pattern'，，x）））
...     打印（项目[idx]）
...
12公斤
256GB
4G

编辑

满足您的最新要求

>>> arr
[array(['Milk', '12KG', '1Box', '#123'], dtype=object), array(['Apple', '10Plus', '256GB'], dtype=object), array(['Samsung', '4G', '3S'], dtype=object)]

模式=r'^#.*.[^\d]+' >>>对于arr中的项目： ... idx=np.argmax（项中x的np.array（[len（re.sub（pattern'，，x））） ... 打印（项目[idx]） ... 12公斤 256GB 4G

我建议您不要在

中混用类型，以便使所有计算仅基于

numpy

。您可以通过以下命令将

转换为整洁的形式：

a=np.array（a）.astype（str）

然后我们计算结果如下：

# array of matrices that represents amounts of symbols for each word:
symbol_counts = np.array([np.char.count(a, symbol) for symbol in '0123456789']) 
# element-wise addition of these matrices:
total_counts = np.sum(symbol_counts, axis=0) 
# indices of word that has the most digits in each row:
idx = np.argmax(total_counts, axis=1)
# corresponding words:
result = np.choose(idx, a.T)
print(result)

输出：

谢谢我收到名称错误：未定义名称“模式”。请让我知道更多。我确实进口了reHi abhilb，只是一个跟进。如果我在产品名称中有噪音，如[‘Milk’、‘12KG’、‘1Box’、“#123”]，则上面的代码将以“#123”代替“12KG”。让我知道如何修理它。感谢

比其他项目包含更多数字的项目

需求不是吗？也许不是，最终目标是从产品标题中获取产品指标。如果我们在产品中有像#123这样的噪音，那么上面的代码是不可理解的。您如何定义有噪声的数据？任何以#开头的东西？

# array of matrices that represents amounts of symbols for each word:
symbol_counts = np.array([np.char.count(a, symbol) for symbol in '0123456789']) 
# element-wise addition of these matrices:
total_counts = np.sum(symbol_counts, axis=0) 
# indices of word that has the most digits in each row:
idx = np.argmax(total_counts, axis=1)
# corresponding words:
result = np.choose(idx, a.T)
print(result)

['12KG' '256GB' '4G']