Python 熊猫从列表中连续写入可变数量的新行_Python_Pandas

Python 熊猫从列表中连续写入可变数量的新行

python pandas

Python 熊猫从列表中连续写入可变数量的新行,python,pandas,Python,Pandas,我使用Pandas作为从Selenium写入数据的一种方式网页上搜索框ac_results的两个示例结果： #Search for product_id = "01" ac_results = "Orange (10)" #Search for product_id = "02" ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"] 橙色只返回一个价格（$10），而香蕉从不同的供应商返回不同数量的价格，在本例中为三个价格（$10

我使用

Pandas

作为从

Selenium

写入数据的一种方式

网页上搜索框

ac_results

的两个示例结果：

#Search for product_id = "01"
ac_results = "Orange (10)"

#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]

橙色只返回一个价格（$10），而香蕉从不同的供应商返回不同数量的价格，在本例中为三个价格（$10），（$20），（$30）

代码使用regex通过

re.findall

获取每个价格并将其放入列表中。只要

re.findall

只找到一个列表项，代码就可以正常工作，就像橘子一样。问题是当价格可变时，比如在寻找香蕉时。我想为每个声明的价格创建一个新行，并且这些行还应该包括

产品标识

和

项目名称

电流输出：

product_id      prices                  item_name
01              10                      Orange
02              [u'10', u'20', u'30']   Banana

期望输出：

product_id      prices                  item_name
01              10                      Orange
02              10                      Banana
02              20                      Banana
02              30                      Banana

当前代码：

df = pd.read_csv("product_id.csv")
def crawl(product_id):
    #Enter search input here, omitted
    #Getting results:
    search_result = driver.find_element_by_class_name("ac_results")
    item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
    prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
    return pd.Series([prices, item_name])

df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)

我无法运行您的代码（可能缺少输入），但您可能可以在dict列表中转换

价格

列表，然后在此基础上构建

数据帧

：

 d = [{"price":10, "product_id":2, "item_name":"banana"}, 
      {"price":20, "product_id":2, "item_name":"banana"}, 
      {"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)

那么

df

是：

  item_name  price  product_id
0    banana     10           2
1    banana     20           2
2    orange     10           1

下面的代码片段应该在您的

应用（爬网）

之后运行

谢谢，但这里的香蕉和桔子只是例子。实际上，它可以是任何可能的名称和价格，因此我无法为它创建字典。基本上我所需要做的就是为price-In-prices:wr\u insref.writerow（[product\u id，price，item\u-name]）模块：

。这意味着复制colsproduct_id
和ìtem_name`中的值，只要列表prices
中有项目。我通过re.findall从活动搜索框中以列表形式检索prices
。但是，它不应该将整个列表写入单元格，因为它现在正在这样做。谢谢您的建议！我将尝试实现这一点，但可能会有一些小问题。首先，我得到了AttributeError:“DataFrame”对象没有属性“prices”

。请注意，我获取

价格的方法不是手动将其输入df
，而是：prices=re.findall（（（？此外，是否可以将所有这些与def crawl（产品标识）：
和df[[价格”，“商品名称]]=df[“产品标识]。应用（爬网）
如问题所示？产品id
从df=pd检索。读取\u csv（“Product\u id.csv”）
。在您完成df[“prices”，“item\u name”]=df[“Product\u id”]之后，我将从代码倒数第二行定义的df
开始。应用（爬网）
。我的解决方案应该可以从那里开始工作。您可能可以直接修改爬网
，但我不会尝试这样做，因为我无法访问与您相同的查询数据。啊，好的，是的，这是有意义的。在爬网
之后实现它，我得到错误：df=pd.DataFrame（{“product\id”：pids，“prices”：prices，“项目名称：名称}）NameError:名称“pids”未定义
。我尝试过对此进行可能的解释，但在atm中找不到它。您不需要创建df——您已经创建了它。我只包含了前几行，以便其他人可以复制我的代码片段。您应该从我的示例代码的第5行开始。
  item_name  price  product_id
0    banana     10           2
1    banana     20           2
2    orange     10           1

# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})

# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])

# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]), 
                                names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()

# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']