Python 3.x 解析HTML(设计拙劣)以获取python中的相关文本行

Python 3.x 解析HTML(设计拙劣)以获取python中的相关文本行,python-3.x,nlp,html-parsing,file-handling,language-model,Python 3.x,Nlp,Html Parsing,File Handling,Language Model,所以,我正在研究一个关于梵语的低资源语言模型。顾名思义,互联网上没有太多的资源或数据语料库供我训练我的模型。然而,我确实找到了这个网站,它允许我下载一个zip,它给了我1000个梵文文本。但是,数据是html格式的,我只需要从中获取相关的行,这样我的模型就可以正确地训练 下面是对其中一些文件的介绍 <html> <head> <meta charset="utf-8"> <meta name="viewp

所以,我正在研究一个关于梵语的低资源语言模型。顾名思义,互联网上没有太多的资源或数据语料库供我训练我的模型。然而,我确实找到了这个网站,它允许我下载一个zip,它给了我1000个梵文文本。但是,数据是html格式的,我只需要从中获取相关的行,这样我的模型就可以正确地训练

下面是对其中一些文件的介绍

<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
<title> Atharvaveda-Samhita, Saunaka recension, UNACCENTED TEXT </title>
    <style>
    body   { background-color: #FFFFFF; font-family: Arial Unicode Ms, Arial Unicode Ms Standard }
    .large { font-size: 18px; }
    .red   { color: #FF0000; }
    .blue  { color: #0000FF; }
    td     { font-family: 'Arial Unicode MS', 'Arial Unicode MS Standard', 'Gentium', 'Gandhari Unicode', 'Lucida Grande', 'CyberBit', 'Bitstream CyberBase', 'Bitstream CyberCJK', 'Code2000', 'Courier New', 'Doulos SIL', 'Fixedsys Excelsio', 'Free Monospaced', 'Free Serif', 'Everson Mono Unicode', 'Arial', 'CN-Arial', 'CN-Times'; }
    hr     { width: 50%; height: 1px; margin-left: 0; margin-right: auto; }
    </style>
</head>
<body><BR><BR>

Atharvaveda-Samhita, Saunaka recension<BR>
Based on the ed.: Gli inni dell' Atharvaveda (Saunaka),<BR>
trasliterazione a cura di Chatia Orlandi, Pisa 1991,<BR>
collated with the ed. R. Roth and WḌ. Whitney:<BR>
Atharva Veda Sanhita, Berlin 1856.<BR>
<BR>
Input by Vladimir Petr and Petr Vavrousek.<BR>
TITUS redaction by Jost Gippert (31 January 1997).<BR>
Text of Books 11-20 improved by <BR>
Arlo Griffiths, Leiden 18 May 2000 and<BR>
Philipp Kubisch, Bonn 13 March 2007.<BR>
Revised by Arlo Griffiths, August 2009.<BR>
<BR>
<BR>
<BR>
UNACCENTED TEXT<BR>
<BR>
<BR>
NOTE ON REFERENCES IN BOOKS 11-20:<BR>
The basic numbering of Books 11-20 follows the ed. Roth/Whitney.<BR>
Numbering in [...] follows the ed. by Vishva Bandhu: Atharvaveda (Saunaka),<BR>
with the Pada-Patha and Sayanacarya's commentary, Hoshiarpur 1960-1964<BR>
(Vishveshvaranand indological series, 13-17).<BR>
<BR>
<BR>
<BR>


<hr>
<br>
THIS <a href="http://gretil.sub.uni-goettingen.de/gretil.htm" target="_blank">GRETIL</a> TEXT FILE IS FOR REFERENCE PURPOSES ONLY!<br>
COPYRIGHT AND TERMS OF USAGE AS FOR SOURCE FILE.<br>
<br>
Text converted to Unicode (UTF-8).<br>
(This file is to be used with a UTF-8 font and your browser's VIEW configuration<br>
set to UTF-8.)
<br>
<br>
<table style="width: 50%">
<tr><td>description:</td><td style="text-align: center">multibyte sequence:</td></tr>
<tr><td>long a</td><td style="text-align: center">  ā   </td></tr>
<tr><td>long A</td><td style="text-align: center">  Ā   </td></tr>
<tr><td>long i</td><td style="text-align: center">  ī   </td></tr>
<tr><td>long I</td><td style="text-align: center">  Ī   </td></tr>
<tr><td>long u</td><td style="text-align: center">  ū   </td></tr>
<tr><td>long U</td><td style="text-align: center">  Ū   </td></tr>
<tr><td>vocalic r</td><td style="text-align: center">  ṛ  </td></tr>
<tr><td>vocalic R</td><td style="text-align: center">  Ṛ  </td></tr>
<tr><td>long vocalic r</td><td style="text-align: center">  ṝ  </td></tr>
<tr><td>vocalic l</td><td style="text-align: center">  ḷ  </td></tr>
<tr><td>vocalic L</td><td style="text-align: center">  Ḷ  </td></tr>
<tr><td>long vocalic l</td><td style="text-align: center">  ḹ  </td></tr>
<tr><td>velar n</td><td style="text-align: center">  ṅ  </td></tr>
<tr><td>velar N</td><td style="text-align: center">  Ṅ  </td></tr>
<tr><td>palatal n</td><td style="text-align: center">  ñ   </td></tr>
<tr><td>palatal N</td><td style="text-align: center">  Ñ   </td></tr>
<tr><td>retroflex t</td><td style="text-align: center">  ṭ  </td></tr>
<tr><td>retroflex T</td><td style="text-align: center">  Ṭ  </td></tr>
<tr><td>retroflex d</td><td style="text-align: center">  ḍ  </td></tr>
<tr><td>retroflex D</td><td style="text-align: center">  Ḍ  </td></tr>
<tr><td>retroflex n</td><td style="text-align: center">  ṇ  </td></tr>
<tr><td>retroflex N</td><td style="text-align: center">  Ṇ  </td></tr>
<tr><td>palatal s</td><td style="text-align: center">  ś   </td></tr>
<tr><td>palatal S</td><td style="text-align: center">  Ś   </td></tr>
<tr><td>retroflex s</td><td style="text-align: center">  ṣ  </td></tr>
<tr><td>retroflex S</td><td style="text-align: center">  Ṣ  </td></tr>
<tr><td>anusvara</td><td style="text-align: center">  ṃ  </td></tr>
<tr><td>visarga</td><td style="text-align: center">  ḥ  </td></tr>
<tr><td>long e</td><td style="text-align: center">  ē   </td></tr>
<tr><td>long o</td><td style="text-align: center">  ō   </td></tr>
<tr><td>l underbar</td><td style="text-align: center">  ḻ  </td></tr>
<tr><td>r underbar</td><td style="text-align: center">  ṟ  </td></tr>
<tr><td>n underbar</td><td style="text-align: center">  ṉ  </td></tr>
<tr><td>k underbar</td><td style="text-align: center">  ḵ  </td></tr>
<tr><td>t underbar</td><td style="text-align: center">  ṯ  </td></tr>
</table>
<br>
<p>
Unless indicated otherwise, accents have been dropped in order <br>
to facilitate word search.<br>
<br>
For a comprehensive list of GRETIL encodings and formats see:<br>
http://gretil.sub.uni-goettingen.de/gretil/gretdiac.pdf<br>
and<br>
http://gretil.sub.uni-goettingen.de/gretil/gretdias.pdf<br>
<br>
For further information see:<br>
http://gretil.sub.uni-goettingen.de/gretil.htm</p>
<br>
<hr>
<BR>
<BR>
<BR>


<BR>
<BR>
(AVŚ_1,1.1a) ye triṣaptāḥ pariyanti viśvā rūpāṇi bibhrataḥ |<BR>
(AVŚ_1,1.1c) vācas patir balā teṣāṃ tanvo adya dadhātu me ||1||<BR>
<BR>
(AVŚ_1,1.2a) punar ehi vacas pate devena manasā saha |<BR>
(AVŚ_1,1.2c) vasoṣ pate ni ramaya mayy evāstu mayi śrutam ||2||<BR>
<BR>
(AVŚ_1,1.3a) ihaivābhi vi tanūbhe ārtnī iva jyayā |<BR>
(AVŚ_1,1.3c) vācas patir ni yachatu mayy evāstu mayi śrutam ||3||<BR>
如果有人能告诉我如何在一个没有任何HTML标记的.txt文件中获取相关文本,并像原始HTML那样正确地分割成行,那就太好了。谢谢

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Haribhadrasuri: Sastravartasamuccaya</title>
    <style>
    body   { background-color: #FFFFFF; font-family: Arial Unicode Ms, Arial Unicode Ms Standard }
    .large { font-size: 18px; }
    .red   { color: #FF0000; }
    .blue  { color: #0000FF; }
    td     { font-family: 'Arial Unicode MS', 'Arial Unicode MS Standard', 'Gentium', 'Gandhari Unicode', 'Lucida Grande', 'CyberBit', 'Bitstream CyberBase', 'Bitstream CyberCJK', 'Code2000', 'Courier New', 'Doulos SIL', 'Fixedsys Excelsio', 'Free Monospaced', 'Free Serif', 'Everson Mono Unicode', 'Arial', 'CN-Arial', 'CN-Times'; }
    hr     { width: 50%; height: 1px; margin-left: 0; margin-right: auto; }
    </style>
</head>
<body><BR><BR>


Haribhadrasuri: Sastravartasamuccaya, <BR>
Based on the ed. by K. K. Dixit,<BR>
Ahmedabad : Lalbhai Dalpatbhai Bharatiya 1969<BR>
(L. D. Series, 22)<BR>
<BR>
<BR>
Input by Yasunori Harada<BR>
<BR>
<BR>
<BR>
PLAIN TEXT VERSION<BR>
<BR>
<BR>
<BR>
The text has a number of metrical irregularities.<BR>
Pada boundaries frequently cut through compounds, and sometimes through words.<BR>
<BR>
<BR>
<BR>
REFERENCE SYSTEM:<BR>
The reference includes the stabaka and section nos. <BR>
of K.K. Dixit's Viṣayasūcī [bracketed]:<BR>
<BR>
HSvs_[n.n]nnn = [stabaka.section]verse<BR>
<BR>
<BR>

<hr>
<br>
THIS <a href="http://gretil.sub.uni-goettingen.de/gretil.htm" target="_blank">GRETIL</a> TEXT FILE IS FOR REFERENCE PURPOSES ONLY!<br>
COPYRIGHT AND TERMS OF USAGE AS FOR SOURCE FILE.<br>
<br>
Text converted to Unicode (UTF-8).<br>
(This file is to be used with a UTF-8 font and your browser's VIEW configuration<br>
set to UTF-8.)
<br>
<br>
<table style="width: 50%">
<tr><td>description:</td><td style="text-align: center">multibyte sequence:</td></tr>
<tr><td>long a</td><td style="text-align: center">  ā   </td></tr>
<tr><td>long A</td><td style="text-align: center">  Ā   </td></tr>
<tr><td>long i</td><td style="text-align: center">  ī   </td></tr>
<tr><td>long I</td><td style="text-align: center">  Ī   </td></tr>
<tr><td>long u</td><td style="text-align: center">  ū   </td></tr>
<tr><td>long U</td><td style="text-align: center">  Ū   </td></tr>
<tr><td>vocalic r</td><td style="text-align: center">  ṛ  </td></tr>
<tr><td>vocalic R</td><td style="text-align: center">  Ṛ  </td></tr>
<tr><td>long vocalic r</td><td style="text-align: center">  ṝ  </td></tr>
<tr><td>vocalic l</td><td style="text-align: center">  ḷ  </td></tr>
<tr><td>vocalic L</td><td style="text-align: center">  Ḷ  </td></tr>
<tr><td>long vocalic l</td><td style="text-align: center">  ḹ  </td></tr>
<tr><td>velar n</td><td style="text-align: center">  ṅ  </td></tr>
<tr><td>velar N</td><td style="text-align: center">  Ṅ  </td></tr>
<tr><td>palatal n</td><td style="text-align: center">  ñ   </td></tr>
<tr><td>palatal N</td><td style="text-align: center">  Ñ   </td></tr>
<tr><td>retroflex t</td><td style="text-align: center">  ṭ  </td></tr>
<tr><td>retroflex T</td><td style="text-align: center">  Ṭ  </td></tr>
<tr><td>retroflex d</td><td style="text-align: center">  ḍ  </td></tr>
<tr><td>retroflex D</td><td style="text-align: center">  Ḍ  </td></tr>
<tr><td>retroflex n</td><td style="text-align: center">  ṇ  </td></tr>
<tr><td>retroflex N</td><td style="text-align: center">  Ṇ  </td></tr>
<tr><td>palatal s</td><td style="text-align: center">  ś   </td></tr>
<tr><td>palatal S</td><td style="text-align: center">  Ś   </td></tr>
<tr><td>retroflex s</td><td style="text-align: center">  ṣ  </td></tr>
<tr><td>retroflex S</td><td style="text-align: center">  Ṣ  </td></tr>
<tr><td>anusvara</td><td style="text-align: center">  ṃ  </td></tr>
<tr><td>visarga</td><td style="text-align: center">  ḥ  </td></tr>
<tr><td>long e</td><td style="text-align: center">  ē   </td></tr>
<tr><td>long o</td><td style="text-align: center">  ō   </td></tr>
<tr><td>l underbar</td><td style="text-align: center">  ḻ  </td></tr>
<tr><td>r underbar</td><td style="text-align: center">  ṟ  </td></tr>
<tr><td>n underbar</td><td style="text-align: center">  ṉ  </td></tr>
<tr><td>k underbar</td><td style="text-align: center">  ḵ  </td></tr>
<tr><td>t underbar</td><td style="text-align: center">  ṯ  </td></tr>
</table>
<br>
<p>
Unless indicated otherwise, accents have been dropped in order <br>
to facilitate word search.<br>
<br>
For a comprehensive list of GRETIL encodings and formats see:<br>
http://gretil.sub.uni-goettingen.de/gretil/gretdiac.pdf<br>
and<br>
http://gretil.sub.uni-goettingen.de/gretil/gretdias.pdf<br>
<br>
For further information see:<br>
http://gretil.sub.uni-goettingen.de/gretil.htm</p>
<br>
<hr>
<BR>
<BR>
<BR>


<BR>
<BR>
Viṣayasūcī by K.K. Dixit (verse numbers bracketed)<BR>
<BR>
===pahalā stabaka===<BR>
graṃtha prastāvanā<BR>
1.1 mokṣasādhanarūpa se dharma kī upādeyatā (1-29)<BR>
1.2 bhūtacaitanyavādakhaṃḍana (30-78)<BR>
1.3 maiṃ viṣayaka pratyakṣa anubhava se ātmā kī siddhi (79-87)<BR>
1.4 ātmā tathā karma ke saṃbaṃdha meṃ matamatāntara (88-109)<BR>
1.5 bhūtacaitanyavādakhaṃḍana kā upasaṃhāra (110-112)<BR>
<BR>
===dūsarā stabaka===<BR>
2.1 puṇya, pāpa tathā mokṣa se saṃbaṃdhita kuccha praśna (113-163)<BR>
2.2 kālavāda, svabhāvavāda, niyativāda, karmavāda, kālādisāmagrīvāda (164-193)<BR>
<BR>
===tīsarā stabaka===<BR>
<BR>
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>CUMULATIVE EXCERPTS of Sanskrit portions from Javano-Balinese texts, plain text</title>
    <style>
      body     { background-color: #FFFFFF; font-family: Arial Unicode Ms, Arial Unicode Ms Standard }
      .large { font-size: 18px; }
      .red { color: #FF0000; }
      .blue { color: #0000FF; }
      td { font-family: 'Arial Unicode MS', 'Arial Unicode MS Standard', 'Gentium', 'Gandhari Unicode', 'Lucida Grande', 'CyberBit', 'Bitstream CyberBase', 'Bitstream CyberCJK',
'Code2000', 'Courier New', 'Doulos SIL', 'Fixedsys Excelsio', 'Free Monospaced',
'Free Serif', 'Everson Mono Unicode', 'Arial', 'CN-Arial', 'CN-Times'; }
      hr { width: 50%; height: 1px; margin-left: 0; margin-right: auto; }
    </style>
  </head>
  <body>
<br><br>CUMULATIVE EXCERPTS of Sanskrit portions from Javano-Balinese texts<br>
<br>
<br>
Input by Andrea Acri, Arlo Griffiths, and Timothy Lubin<br>
(for details see files of the individual texts)<br>
[GRETIL-Version: 2018-09-12]<br>
<br>
<br>
CURRENTLY COMPRISING:<br>
<br>
Śaiva:<br>
GpT_ = Gaṇapatitattva<br>
JñS_ = Jñānasiddhānta<br>
TJ_ = Saṅ Hyaṅ Tattvajñāna<br>
MJ_ = Saṅ Hyaṅ Mahājñāna<br>
VpT = Vṛhaspatitattva<br>
<br>
Didactic:<br>
VS_ = Vratiśāsana<br>
Slo_ = Ślokāntara<br>
<br>
<br>
PLAIN TEXT VERSION<br>
In order to facilitate word search, all brackets and all special characters<br>
have been removed or reduced to conform to GRETIL's character list below.<br>
<br>
<br>
<br>
<br>
 <hr>
 <br>
 THIS <a href="http://gretil.sub.uni-goettingen.de/gretil.htm" target="_blank">GRETIL</a> TEXT FILE IS FOR REFERENCE PURPOSES ONLY!   <br>
 COPYRIGHT AND TERMS OF USAGE AS FOR SOURCE FILE. <br>
 <br>
 Text converted to Unicode (UTF-8).<br>
 (This file is to be used with a UTF-8 font and your browser's VIEW configuration <br>
 set to UTF-8.)
 <br>
 <br>
 <table style="width: 50%">
 <tr><td>description:</td><td style="text-align: center">multibyte sequence:</td></tr>
 <tr><td>long a</td><td style="text-align: center">  ā   </td></tr>
 <tr><td>long A</td><td style="text-align: center">  Ā   </td></tr>
 <tr><td>long i</td><td style="text-align: center">  ī   </td></tr>
 <tr><td>long I</td><td style="text-align: center">  Ī   </td></tr>
 <tr><td>long u</td><td style="text-align: center">  ū   </td></tr>
 <tr><td>long U</td><td style="text-align: center">  Ū   </td></tr>
 <tr><td>vocalic r</td><td style="text-align: center">  ṛ  </td></tr>
 <tr><td>vocalic R</td><td style="text-align: center">  Ṛ  </td></tr>
 <tr><td>long vocalic r</td><td style="text-align: center">  ṝ  </td></tr>
 <tr><td>vocalic l</td><td style="text-align: center">  ḷ  </td></tr>
 <tr><td>vocalic L</td><td style="text-align: center">  Ḷ  </td></tr>
 <tr><td>long vocalic l</td><td style="text-align: center">  ḹ  </td></tr>
 <tr><td>velar n</td><td style="text-align: center">  ṅ  </td></tr>
 <tr><td>velar N</td><td style="text-align: center">  Ṅ  </td></tr>
 <tr><td>palatal n</td><td style="text-align: center">  ñ   </td></tr>
 <tr><td>palatal N</td><td style="text-align: center">  Ñ   </td></tr>
 <tr><td>retroflex t</td><td style="text-align: center">  ṭ  </td></tr>
 <tr><td>retroflex T</td><td style="text-align: center">  Ṭ  </td></tr>
 <tr><td>retroflex d</td><td style="text-align: center">  ḍ  </td></tr>
 <tr><td>retroflex D</td><td style="text-align: center">  Ḍ  </td></tr>
 <tr><td>retroflex n</td><td style="text-align: center">  ṇ  </td></tr>
 <tr><td>retroflex N</td><td style="text-align: center">  Ṇ  </td></tr>
 <tr><td>palatal s</td><td style="text-align: center">  ś   </td></tr>
 <tr><td>palatal S</td><td style="text-align: center">  Ś   </td></tr>
 <tr><td>retroflex s</td><td style="text-align: center">  ṣ  </td></tr>
 <tr><td>retroflex S</td><td style="text-align: center">  Ṣ  </td></tr>
 <tr><td>anusvara</td><td style="text-align: center">  ṃ  </td></tr>
 <tr><td>visarga</td><td style="text-align: center">  ḥ  </td></tr>
 <tr><td>long e</td><td style="text-align: center">  ē   </td></tr>
 <tr><td>long o</td><td style="text-align: center">  ō   </td></tr>
 <tr><td>l underbar</td><td style="text-align: center">  ḻ  </td></tr>
 <tr><td>r underbar</td><td style="text-align: center">  ṟ  </td></tr>
 <tr><td>n underbar</td><td style="text-align: center">  ṉ  </td></tr>
 <tr><td>k underbar</td><td style="text-align: center">  ḵ  </td></tr>
 <tr><td>t underbar</td><td style="text-align: center">  ṯ  </td></tr>
 </table>
 <br>
 <p>
 Unless indicated otherwise, accents have been dropped in order <br>
 to facilitate word search.<br>
 <br>
 For a comprehensive list of GRETIL encodings and formats see:<br>
 http://gretil.sub.uni-goettingen.de/gretil/gretdiac.pdf<br>
 and<br>
 http://gretil.sub.uni-goettingen.de/gretil/gretdias.pdf<br>
 <br>
 For further information see:<br>
 http://gretil.sub.uni-goettingen.de/gretil.htm</p>
 <br>
 <hr>
 <br>
 <br>
 <br>
<br>
<br>
<br>
<b>Śaiva: Gaṇapatitattva</b><br>
<br>
[GpT_01s-ab] gaṇapatiḥ śivam pṛcchad gaṅgomayoḥ siddhārthadaḥ /<br>
[GpT_01s-cd] devagaṇaguruḥ putraḥ śaktivīryālokaśriyai // 1 //<br>
<br>
[GpT_02s-ab] śvāso niḥśvāsaḥ samyoga ātmatrayam iti smṛtam /<br>
[GpT_02s-cd] triśivaṃ tripuruṣatvam aikātmya eva śūnyatā // 2 //<br>
<br>
[GpT_03s-ab] pratyāhāras tathā dhyānaṃ prāṇāyāmo 'tha dhāraṇaṃ /<br>
[GpT_03s-cd] tarkaś caiva samādhis tu ṣaḍaṅgam iti kathyate // 3 //<br>
from bs4 import BeautifulSoup
file2 = open("avs___u.htm","r+") 
soup = BeautifulSoup(file2, 'html.parser')
txt = soup.get_text()
#the txt has line breaks I think but its just one string nonetheless(printing it seems fine)

rows = soup.find_all('p')
for row in rows:          # Print all occurrences
    print(row.get_text())
#I tried getting only the p tags but turns out I cannot generalise this over all the files.