Python 提取产品标题中可测量的指标
我的目标是提取产品标题中可测量的指标 示例:我有以下产品及其标题:Python 提取产品标题中可测量的指标,python,arrays,numpy,Python,Arrays,Numpy,我的目标是提取产品标题中可测量的指标 示例:我有以下产品及其标题: Product title A: "Milk 12KG 1Box" Product title B: "Apple 10Plus 256GB" Product title C: "Samsung 4G 3S" 在用空格写下产品名称后,我有以下内容: import numpy as np arr = [np.array(['Milk', '12KG', '1Box'],dtype=object),np.array(['Appl
Product title A: "Milk 12KG 1Box"
Product title B: "Apple 10Plus 256GB"
Product title C: "Samsung 4G 3S"
在用空格写下产品名称后,我有以下内容:
import numpy as np
arr = [np.array(['Milk', '12KG', '1Box'],dtype=object),np.array(['Apple', '10Plus', '256GB'],dtype=object),np.array(['Samsung', '4G', '3S'],dtype=object)]
for arr1 in arr:
sum_list = []
for a in arr1:
sum = 0
for i in range(10):
sum += a.count(str(i))
sum_list.append(sum)
print(arr1,"->",sum_list)
输出:
['Milk' '12KG' '1Box'] -> [0, 2, 1]
['Apple' '10Plus' '256GB'] -> [0, 2, 3]
['Samsung' '4G' '3S'] -> [0, 1, 1]
期望输出:
>>重新导入
>>>模式=r'[^\d]+'
>>>对于arr中的项目:
... idx=np.argmax(项中x的np.array([len(re.sub(pattern',,x)))
... 打印(项目[idx])
...
12公斤
256GB
4G
编辑
满足您的最新要求
>>> arr
[array(['Milk', '12KG', '1Box', '#123'], dtype=object), array(['Apple', '10Plus', '256GB'], dtype=object), array(['Samsung', '4G', '3S'], dtype=object)]
模式=r'^#.*.[^\d]+' >>>对于arr中的项目: ... idx=np.argmax(项中x的np.array([len(re.sub(pattern',,x))) ... 打印(项目[idx]) ... 12公斤 256GB 4G
我建议您不要在
a
中混用类型,以便使所有计算仅基于numpy
。您可以通过以下命令将a
转换为整洁的形式:
a=np.array(a).astype(str)
然后我们计算结果如下:
# array of matrices that represents amounts of symbols for each word:
symbol_counts = np.array([np.char.count(a, symbol) for symbol in '0123456789'])
# element-wise addition of these matrices:
total_counts = np.sum(symbol_counts, axis=0)
# indices of word that has the most digits in each row:
idx = np.argmax(total_counts, axis=1)
# corresponding words:
result = np.choose(idx, a.T)
print(result)
输出:
谢谢我收到名称错误:未定义名称“模式”。请让我知道更多。我确实进口了reHi abhilb,只是一个跟进。如果我在产品名称中有噪音,如[‘Milk’、‘12KG’、‘1Box’、“#123”],则上面的代码将以“#123”代替“12KG”。让我知道如何修理它。感谢
比其他项目包含更多数字的项目
需求不是吗?也许不是,最终目标是从产品标题中获取产品指标。如果我们在产品中有像#123这样的噪音,那么上面的代码是不可理解的。您如何定义有噪声的数据?任何以#开头的东西?
# array of matrices that represents amounts of symbols for each word:
symbol_counts = np.array([np.char.count(a, symbol) for symbol in '0123456789'])
# element-wise addition of these matrices:
total_counts = np.sum(symbol_counts, axis=0)
# indices of word that has the most digits in each row:
idx = np.argmax(total_counts, axis=1)
# corresponding words:
result = np.choose(idx, a.T)
print(result)
['12KG' '256GB' '4G']