Python lxml webscraping:解析一列中有多个功能的列

Python lxml webscraping:解析一列中有多个功能的列,python,pandas,lxml,Python,Pandas,Lxml,任何帮助都将不胜感激 我基本上是尝试使用python库“lxml”从Expedia中提取数据,并将数据移动到数据帧中 有些栏目,如酒店设施,有几个条目。我试图解析酒店设施和其他列中的几个条目,并将它们移动到一个单独的列中。因此,每种礼仪都有自己的专栏 再次感谢你的帮助 from lxml import html import requests import lxml.html from lxml.etree import XPath from lxml import etree import u

任何帮助都将不胜感激

我基本上是尝试使用python库“lxml”从Expedia中提取数据,并将数据移动到数据帧中

有些栏目,如酒店设施,有几个条目。我试图解析酒店设施和其他列中的几个条目,并将它们移动到一个单独的列中。因此,每种礼仪都有自己的专栏

再次感谢你的帮助

from lxml import html
import requests
import lxml.html
from lxml.etree import XPath
from lxml import etree
import urllib
import pandas as pd
from fake_useragent import UserAgent

ua = UserAgent()
header = {'user-agent':ua.chrome}

Sumisho_url = requests.get('https://www.expedia.com/Tokyo-Hotels-Sumisho-Hotel.h2221301.Hotel-Information?chkin=6%2F22%2F2017&chkout=6%2F23%2F2017&rm1=a2&regionId=179900&hwrqCacheKey=65e880f7-4254-472b-a76c-a9d652938f8cHWRQ1498148578719&vip=false&c=80642461-a7d7-49bb-856e-df5db3b7cec9&', headers=header)
Sumisho_tree = html.fromstring(Sumisho_url.content)

Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)

Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
Sumisho_df['Hotel_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()'))
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')

Sumisho_df

您已经将数据作为
列表
刮取,您可以for循环列表并将其分配给具有不同列名的数据框:

Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)

Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
hotel_amenities = Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()')
for i, e in enumerate(hotel_amenities):
    Sumisho_df['Hotel_Amenities'+str(i)]=e.strip() #assign to separated columns
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')
Sumisho_df
然后,数据帧将包含独立的列:

Hotel_Amenities1            Hotel_Amenities2    Hotel_Amenities3    Hotel_Amenities4                    Hotel_Amenities5
Total number of rooms - 83  Conference space    Free WiFi           Breakfast available (surcharge)     Free wired high-speed Internet  Laundry facilities
您还可以解析具有多个条目的其他列

更新:

您可以尝试:

foo = lambda x: pd.Series([i for i in x.split(',')])
df1 = df['Hotel_Amenities'].apply(foo)
df.join(df1)

非常感谢。这真的很有帮助。另外一个问题,由于我有几个酒店,并将它们连接到一个数据帧中,如何从数据帧中读取数据?我注意到酒店的便利设施是从xpath读取的。如果我想让它来自pd数据框呢?谢谢!这太完美了。对迟来的答复表示歉意