Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 无法从输出Url列表中查找Url_Python_Python 3.x_Web Scraping_Fuzzy Search - Fatal编程技术网

Python 无法从输出Url列表中查找Url

Python 无法从输出Url列表中查找Url,python,python-3.x,web-scraping,fuzzy-search,Python,Python 3.x,Web Scraping,Fuzzy Search,我下面的代码给出了一个url列表。如果我想要任何特定的url,如何解决此问题: 我的代码如下: import bs4, requests index_pages = ('https://www.tripadvisor.in/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 180, 30)) urls = [] with requests.sessio

我下面的代码给出了一个url列表。如果我想要任何特定的url,如何解决此问题: 我的代码如下:

import bs4, requests
index_pages = ('https://www.tripadvisor.in/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 180, 30))
urls = []
with requests.session() as s:
    for index in index_pages:
        r = s.get(index)
        soup = bs4.BeautifulSoup(r.text, 'lxml')
        url_list = [i.get('href') for i in soup.select('.property_title')]
        urls.append(url_list)
        print(url_list) 
New_York_City_New_York.html', '/Hotel_Review-g60763-d93543-Reviews-Shelburne_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1485603-Reviews-Comfort_Inn_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d93340-Reviews-Hotel_Elysee_by_Library_Hotel_Collection-New_York_City_New_York.html', '/Hotel_Review-g60763-d1641016-Reviews-The_Chatwal_A_Luxury_Collection_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d93585-Reviews-Lowell_Hotel-New_York_City_New_York.html']
D:\anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
['/Hotel_Review-g60763-d277882-Reviews-Hampton_Inn_Manhattan_Seaport_Financial_District-New_York_City_New_York.html', '/Hotel_Review-g60763-d3529145-Reviews-Holiday_Inn_Express_Manhattan_Times_Square_South-New_York_City_New_York.html', '/Hotel_Review-g60763-d208453-Reviews-Hilton_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d249711-Reviews-The_Hotel_at_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d1158753-Reviews-Kimpton_Ink48_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1186070-Reviews-Marriott_Vacation_Club_Pulse_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d1938661-Reviews-Row_NYC_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93345-Reviews-Skyline_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d217616-Reviews-Kimpton_Muse_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1888977-Reviews-The_Pearl_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d223021-Reviews-Club_Quarters_Hotel_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d611947-Reviews-New_York_Hilton_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d4274398-Reviews-Courtyard_New_York_Manhattan_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456416-Reviews-The_Dominick_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d122014-Reviews-Gild_Hall_a_Thompson_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d2622936-Reviews-Wyndham_Garden_Chinatown-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456560-Reviews-Kimpton_Hotel_Eventi-New_York_City_New_York.html', '/Hotel_Review-g60763-d249710-Reviews-Morningside_Inn-New_York_City_New_York.html', '/Hotel_Review-g60763-d2079052-Reviews-YOTEL_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d224214-Reviews-The_Bryant_Park_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1785018-Reviews-The_James_New_York_SoHo-New_York_City_New_York.html', '/Hotel_Review-g60763-d247814-Reviews-The_Gatsby_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d112039-Reviews-Hotel_Newton-New_York_City_New_York.html', '/Hotel_Review-g60763-d612263-Reviews-Hotel_Mela-New_York_City_New_York.html', '/Hotel_Review-g60763-d99392-Reviews-Hotel_Metro-New_York_City_New_York.html', '/Hotel_Review-g60763-d4446427-Reviews-Hotel_Boutique_At_Grand_Central-New_York_City_New_York.html', '/Hotel_Review-g60763-d1503474-Reviews-Distrikt_Hotel_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d93467-Reviews-Gardens_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93603-Reviews-The_Pierre_A_Taj_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d113311-Reviews-The_Peninsula_New_York-New_York_City_New_York.html']
现在,我得到的输出是URL列表。输出如下所示:

import bs4, requests
index_pages = ('https://www.tripadvisor.in/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 180, 30))
urls = []
with requests.session() as s:
    for index in index_pages:
        r = s.get(index)
        soup = bs4.BeautifulSoup(r.text, 'lxml')
        url_list = [i.get('href') for i in soup.select('.property_title')]
        urls.append(url_list)
        print(url_list) 
New_York_City_New_York.html', '/Hotel_Review-g60763-d93543-Reviews-Shelburne_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1485603-Reviews-Comfort_Inn_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d93340-Reviews-Hotel_Elysee_by_Library_Hotel_Collection-New_York_City_New_York.html', '/Hotel_Review-g60763-d1641016-Reviews-The_Chatwal_A_Luxury_Collection_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d93585-Reviews-Lowell_Hotel-New_York_City_New_York.html']
D:\anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
['/Hotel_Review-g60763-d277882-Reviews-Hampton_Inn_Manhattan_Seaport_Financial_District-New_York_City_New_York.html', '/Hotel_Review-g60763-d3529145-Reviews-Holiday_Inn_Express_Manhattan_Times_Square_South-New_York_City_New_York.html', '/Hotel_Review-g60763-d208453-Reviews-Hilton_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d249711-Reviews-The_Hotel_at_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d1158753-Reviews-Kimpton_Ink48_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1186070-Reviews-Marriott_Vacation_Club_Pulse_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d1938661-Reviews-Row_NYC_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93345-Reviews-Skyline_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d217616-Reviews-Kimpton_Muse_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1888977-Reviews-The_Pearl_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d223021-Reviews-Club_Quarters_Hotel_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d611947-Reviews-New_York_Hilton_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d4274398-Reviews-Courtyard_New_York_Manhattan_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456416-Reviews-The_Dominick_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d122014-Reviews-Gild_Hall_a_Thompson_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d2622936-Reviews-Wyndham_Garden_Chinatown-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456560-Reviews-Kimpton_Hotel_Eventi-New_York_City_New_York.html', '/Hotel_Review-g60763-d249710-Reviews-Morningside_Inn-New_York_City_New_York.html', '/Hotel_Review-g60763-d2079052-Reviews-YOTEL_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d224214-Reviews-The_Bryant_Park_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1785018-Reviews-The_James_New_York_SoHo-New_York_City_New_York.html', '/Hotel_Review-g60763-d247814-Reviews-The_Gatsby_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d112039-Reviews-Hotel_Newton-New_York_City_New_York.html', '/Hotel_Review-g60763-d612263-Reviews-Hotel_Mela-New_York_City_New_York.html', '/Hotel_Review-g60763-d99392-Reviews-Hotel_Metro-New_York_City_New_York.html', '/Hotel_Review-g60763-d4446427-Reviews-Hotel_Boutique_At_Grand_Central-New_York_City_New_York.html', '/Hotel_Review-g60763-d1503474-Reviews-Distrikt_Hotel_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d93467-Reviews-Gardens_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93603-Reviews-The_Pierre_A_Taj_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d113311-Reviews-The_Peninsula_New_York-New_York_City_New_York.html']
现在,如果我要从上面的列表中查找任何特定的url,如何做到这一点?
例如:对于希尔顿时代广场,如何从上面的列表中找到url

查找准确的url:

def findExactUrl(urlList, searched):
    for url in urllist:
       if url == searched:
           reurn url
空闲时你可以打电话

>>findExactUrl(url_list, "http://maritonhotel.com/123")
## if such url is in your list
>>"mariton Hotel" 
## or if such url is not there, nothing should show, just:
>>
或者,从.py文件调用:

myUrl = findExactUrl(url_list, "http://maritonhotel.com/123")
print(myUrl)
>>"http://maritonhotel.com/123"
您可以编辑函数以返回
True
i
以查找其索引

更模糊的搜索

def findOccurence(urlList, searched):
    foundUrls = []
    for url in urllist:
       if url.contains(searched):
           foundUrls.append(url)
    return foundUrls

若要从字符串中删除某些子字符串,只需调用.replace()方法



如果您有什么不明白的地方,请告诉我。另外,请认真考虑购买一些Python书给初学者,或者做一些在线课程。 这就是我要做的:

keyword = "Hilton_Times_Square"
target_urls = [ e for e in url_list if keyword in e ]

如何打印该url?示例:假设我的列表中有5家万豪酒店。但我只需要打印一家特定的酒店,那么我该怎么做?>>a=findExactUrl(url\u list,”)<然后你只需调用>>打印(a)@Shaon你还有什么问题吗?非常感谢@Kajbo…我只想在我的输出中添加这个url…这样看起来:…我不明白。你能详细说明一下吗?我该如何打印吗?当我使用print(target\u url)时,它给出了一个输出显示str对象不可调用要打印它,你应该在target\u url:print中为u使用循环:
(u) 
这与Kajbo答案中的“更多模糊搜索”部分相同