Python 使用BeautifulSoup从Github页面提取文件名列表_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup从Github页面提取文件名列表

python python-3.x web-scraping

Python 使用BeautifulSoup从Github页面提取文件名列表,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正在用python编写一个程序，扫描我朋友和我自己的GitHub页面，并显示上传文件的所有名称。我已经设法让它做到这一点。所有文件的名称都在标记下。问题是标签下面还有其他随机文本，如“通过上传添加文件”。我不想让这些东西出现。任何帮助都将不胜感激。亲切的问候。埃里克我在打印最终结果时尝试了字符串剥离，但仍然不起作用这是我的密码： import bs4 import requests from bs4 import BeautifulSoup as soup import lxml impo

我正在用python编写一个程序，扫描我朋友和我自己的GitHub页面，并显示上传文件的所有名称。我已经设法让它做到这一点。所有文件的名称都在标记下。问题是标签下面还有其他随机文本，如“通过上传添加文件”。我不想让这些东西出现。任何帮助都将不胜感激。亲切的问候。埃里克

我在打印最终结果时尝试了字符串剥离，但仍然不起作用

这是我的密码：

import bs4
import requests
from bs4 import BeautifulSoup as soup
import lxml
import re
import time
import os
import webbrowser
import re

def webscrape():
    res = requests.get('https://github.com/Dukesan7/jerichson')
    type(res)
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    type(soup)
    file = soup.select('a')
    file[1].getText()
    time.sleep(1)
    files = str(file)
    clean = re.compile('<.*?>')
    files = re.sub(clean, '', files)
    print (files)
    time.sleep(1)
    print ("1. Main Menu: 1")
    print ("2. exit?: 2")
    op = input (":")
    if op == "2":
        exit()
    else:
        MainMenu()

导入bs4
导入请求
从bs4进口美汤作为汤
导入lxml
进口稀土
导入时间
导入操作系统
导入网络浏览器
进口稀土
def webscrape（）：
res=requests.get（'https://github.com/Dukesan7/jerichson')
类型（res）
soup=bs4.BeautifulSoup（res.text，“lxml”）
类型（汤）
文件=汤。选择（'a'）
文件[1]。getText（）
时间。睡眠（1）
files=str（文件）
clean=re.compile（“”）
files=re.sub（清除，，，文件）
打印（文件）
时间。睡眠（1）
打印（“1.主菜单：1”）
打印（“2.退出？：2”）
op=输入（“：”）
如果op==“2”：
退出（）
其他：
主菜单（）

代码的简化版本：

from bs4 import BeautifulSoup as bs
import requests

res = requests.get('https://github.com/Dukesan7/jerichson')    
soup = bs(res.text, 'lxml')   
file = soup.find_all('a',class_="js-navigation-open")
for i in file:
    if '.' in i.text:
        print(i.text)

提供以下输出：

21s.py
BVVVVV.exe
Calling Casino.py
Game Download Link.txt
Homework.py
Password Username System.py
Puzzle.txt
StopWatch.py
Voting ligitimacy system.py
Vowl counter.py
agenotage.py
coin.py
dice.py
explorer reset.bat
name and age dukesan.py
notification.pyw
reminder.py
win 21 game.py

这就是您要找的吗？

代码的简化版本：

from bs4 import BeautifulSoup as bs
import requests

res = requests.get('https://github.com/Dukesan7/jerichson')    
soup = bs(res.text, 'lxml')   
file = soup.find_all('a',class_="js-navigation-open")
for i in file:
    if '.' in i.text:
        print(i.text)

提供以下输出：

21s.py
BVVVVV.exe
Calling Casino.py
Game Download Link.txt
Homework.py
Password Username System.py
Puzzle.txt
StopWatch.py
Voting ligitimacy system.py
Vowl counter.py
agenotage.py
coin.py
dice.py
explorer reset.bat
name and age dukesan.py
notification.pyw
reminder.py
win 21 game.py

这就是您要查找的吗？

如果您在浏览器中使用Inspector，您可以尝试查找所有文件/文件夹名称共用的类和/或标记。我发现它们都在一个

td

元素中，该元素包含class

content

，它有一个

tr

元素，class

js导航项作为父项：

因此，您可以在BeautifulSoup中使用以下选择器：tr.js-navigation-item>td.content

注意，您可以使用语法elem.text
简单地提取HTML元素的文本。使用正则表达式不适合剥离HTML标记
工作执行：
res = requests.get('https://github.com/Dukesan7/jerichson')
soup = bs4.BeautifulSoup(res.text, 'lxml')
files_list = soup.select('tr.js-navigation-item > td.content')
files_list_text = [f.text.strip() for f in files_list]
print(files_list_text)

输出：
['Google2', 'Maths Game', 'OpenMinecraft', '21s.py', 'BVVVVV.exe', 'Calling Casino.py', 'Game Download Link.txt', 'Homework.py', 'Password Username System.py', 'Puzzle.txt', 'StopWatch.py', 'Voting ligitimacy system.py', 'Vowl counter.py', 'agenotage.py', 'coin.py', 'dice.py', 'explorer reset.bat', 'name and age dukesan.py', 'notification.pyw', 'privilege_escalation', 'reminder.py', 'win 21 game.py']

如果在浏览器中使用Inspector，则可以尝试查找所有文件/文件夹名称共用的类和/或标记。我发现它们都在一个td
元素中，该元素包含classcontent
，它有一个tr
元素，classjs导航项作为父项：

因此，您可以在BeautifulSoup中使用以下选择器：tr.js-navigation-item>td.content

注意，您可以使用语法elem.text
简单地提取HTML元素的文本。使用正则表达式不适合剥离HTML标记
工作执行：
res = requests.get('https://github.com/Dukesan7/jerichson')
soup = bs4.BeautifulSoup(res.text, 'lxml')
files_list = soup.select('tr.js-navigation-item > td.content')
files_list_text = [f.text.strip() for f in files_list]
print(files_list_text)

输出：
['Google2', 'Maths Game', 'OpenMinecraft', '21s.py', 'BVVVVV.exe', 'Calling Casino.py', 'Game Download Link.txt', 'Homework.py', 'Password Username System.py', 'Puzzle.txt', 'StopWatch.py', 'Voting ligitimacy system.py', 'Vowl counter.py', 'agenotage.py', 'coin.py', 'dice.py', 'explorer reset.bat', 'name and age dukesan.py', 'notification.pyw', 'privilege_escalation', 'reminder.py', 'win 21 game.py']

请注意，这不包括文件夹名称（不确定OP是否希望包含它们）。此外，这将排除任何不包含句点（可能发生）的文件名；由于OP没有指出他想要的输出，我认为这已经足够接近了，如果有必要，他可以修改代码以满足他的需要。注意，这不包括文件夹名称（不确定OP是否希望包含它们）。此外，这将排除任何不包含句点（可能发生）的文件名；因为OP没有指出他想要的输出，所以我认为这已经足够接近了，如果需要的话，他可以修改代码以满足他的需要。你想包括还是排除文件夹？e、 g.谷歌2did您想包括或排除文件夹？e、 谷歌2银行。成功了。不知道你能不能告诉我这句话的确切含义，这样我才能更好地理解它。感谢文件\u list=soup。选择（'tr.js-navigation-item>td.content'）td.content
表示td
元素与类content
。类似地，tr.js-navigation-item
表示类为js-navigation
的tr
元素。element1>element2
语法意味着，如果它包含在element2
中，则仅选择element1
。好的，再见。但是当我在那里查看页面时，我在内容附近的任何地方都看不到tr？专门检查文件名以查找元素。或者用Ctrl+U检查页面源代码，用Ctrl+F搜索tr
。查看我在回答中添加的图像。谢谢。成功了。不知道你能不能告诉我这句话的确切含义，这样我才能更好地理解它。感谢文件\u list=soup。选择（'tr.js-navigation-item>td.content'）td.content
表示td
元素与类content
。类似地，tr.js-navigation-item
表示类为js-navigation
的tr
元素。element1>element2
语法意味着，如果它包含在element2
中，则仅选择element1
。好的，再见。但是当我在那里查看页面时，我在内容附近的任何地方都看不到tr？专门检查文件名以查找元素。或者用Ctrl+U检查页面源代码，用Ctrl+F搜索tr
。查看我在回答中添加的图像。