Python /n在美丽的汤文本中

Python /n在美丽的汤文本中,python,beautifulsoup,Python,Beautifulsoup,我正在努力为一些NLP工作获取youtube视频的成绩单,我想我可以很好地获得它,但也有一些问题。例如: from xml.etree import cElementTree as ET from bs4 import BeautifulSoup as bs from urllib2 import urlopen URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM' def make_soup(url):

我正在努力为一些NLP工作获取youtube视频的成绩单,我想我可以很好地获得它,但也有一些问题。例如:

from xml.etree import cElementTree as ET
from bs4 import BeautifulSoup as bs
from urllib2 import urlopen

URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM'
def make_soup(url):
    html = urlopen(url).read()
    return bs(html, "lxml")

soup = make_soup(URL)
takeaways = soup.findAll('text')

All_text = []
for i in takeaways:
    root = ET.fromstring(str(i))
    reslist = list(root.iter())
    try:
        result = ' '.join([element.text for element in reslist])
    except:
        pass
    All_text.append(result)
其中一行的示例结果:

'Let's learn a little bit\nabout the dot product.'
这似乎可以得到文本,但我也得到了/n,这是xml的返回字符,我还得到了这个奇怪的字符来代替撇号,我认为这是由于编码

有人知道我如何清理这两个吗?

'-是未经扫描的HTML代码,在python 3.2及以上版本中使用

import html
html.unescape(<your_string>) 
导入html
html.unescape()

\n
是新行,如果您不需要,您必须手动替换它们,可以使用python2或python3替换,如另一个答案中所述

此外,由于您正在解析xml,并且安装了lxml,因此您的代码可以简化为:

import lxml.etree as et
from HTMLParser import HTMLParser
unescape = HTMLParser().unescape

URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM'
tree = et.parse(URL)
print([unescape(t.replace("\n", " ")) for t in tree.xpath('//text/text()')])
这将给你:

[u"Let's learn a little bit about the dot product.", 'The dot product, frankly, out of the two ways of multiplying', 'vectors, I think is the easier one.', 'So what does the dot product do?', u"Why don't I give you the definition, and then I'll give", 'you an intuition.', u"So if I have two vectors; vector a dot vector b-- that's", 'how I draw my arrows.', 'I can draw my arrows like that.', 'That is equal to the magnitude of vector a times the', 'magnitude of vector b times cosine of the', 'angle between them.', 'Now where does this come from?', 'This might seem a little arbitrary, but I think with a', 'visual explanation, it will make a little bit more sense.', 'So let me draw, arbitrarily, these two vectors.', 'So that is my vector a-- nice big and fat vector.', u"It's good for showing the point.", 'And let me draw vector b like that.', 'Vector b.', 'And then let me draw the cosine, or let me, at least,', 'draw the angle between them.', 'This is theta.', u"So there's two ways of view this.", 'Let me label them.', 'This is vector a.', u"I'm trying to be color consistent.", 'This is vector b.', u"So there's two ways of viewing this product.", 'You could view it as vector a-- because multiplication is', 'associative, you could switch the order.', 'So this could also be written as, the magnitude of vector a', u"times cosine of theta, times-- and I'll do it in color", 'appropriate-- vector b.', 'And this times, this is the dot product.', u"I almost don't have to write it.", 'This is just regular multiplication, because these', 'are all scalar quantities.', u"When you see the dot between vectors, you're talking about", 'the vector dot product.', 'So if we were to just rearrange this expression this', 'way, what does it mean?', 'What is a cosine of theta?', 'Let me ask you a question.', 'If I were to drop a right angle, right here,', u"perpendicular to b-- so let's just drop a right angle", 'there-- cosine of theta soh-coh-toa so, cah cosine--', 'is equal to adjacent of a hypotenuse, right?', u"Well, what's the adjacent?", u"It's equal to this.", 'And the hypotenuse is equal to the magnitude of a, right?', 'Let me re-write that.', 'So cosine of theta-- and this applies to the a vector.', 'Cosine of theta of this angle is equal to ajacent, which', u"is-- I don't know what you could call this-- let's call", 'this the projection of a onto b.', u"It's like if you were to shine a light perpendicular to b--", 'if there was a light source here and the light was', 'straight down, it would be the shadow of a onto b.', 'Or you could almost think of it as the part of a that goes', 'in the same direction of b.', 'So this projection, they call it-- at least the way I get', 'the intuition of what a projection is, I kind of view', 'it as a shadow.', 'If you had a light source that came up perpendicular, what', 'would be the shadow of that vector on to this one?', 'So if you think about it, this shadow right here-- you could', 'call that, the projection of a onto b.', u"Or, I don't know.", u"Let's just call it, a sub b.", u"And it's the magnitude of it, right?", u"It's how much of vector a goes on vector b over-- that's the", 'adjacent side-- over the hypotenuse.', 'The hypotenuse is just the magnitude of vector a.', u"It's just our basic calculus.", 'Or another way you could view it, just multiply both sides', 'by the magnitude of vector a.', 'You get the projection of a onto b, which is just a fancy', 'way of saying, this side; the part of a that goes in the', 'same direction as b-- is another way to say it-- is', 'equal to just multiplying both sides times the magnitude of a', 'is equal to the magnitude of a, cosine of theta.', 'Which is exactly what we have up here.', 'And the definition of the dot product.', 'So another way of visualizing the dot product is, you could', 'replace this term with the magnitude of the projection of', 'a onto b-- which is just this-- times the', 'magnitude of b.', u"That's interesting.", u"All the dot product of two vectors is-- let's just take", 'one vector.', u"Let's figure out how much of that vector-- what component", u"of it's magnitude-- goes in the same direction as the", u"other vector, and let's just multiply them.", 'And where is that useful?', 'Well, think about it.', 'What about work?', 'When we learned work in physics?', 'Work is force times distance.', u"But it's not just the total force", 'times the total distance.', u"It's the force going in the same", 'direction as the distance.', u"You should review the physics playlist if you're watching", u"this within the calculus playlist. Let's say I have a", '10 newton object.', u"It's sitting on ice, so there's no friction.", u"We don't want to worry about fiction right now.", u"And let's say I pull on it.", u"Let's say my force vector-- This is my force vector.", u"Let's say my force vector is 100 newtons.", u"I'm making the numbers up.", '100 newtons.', u"And Let's say I slide it to the right, so my distance", 'vector is 10 meters parallel to the ground.', 'And the angle between them is equal to 60 degrees, which is', 'the same thing is pi over 3.', u"We'll stick to degrees.", u"It's a little bit more intuitive.", u"It's 60 degrees.", 'This distance right here is 10 meters.', 'So my question is, by pulling on this rope, or whatever, at', 'the 60 degree angle, with a force of 100 newtons, and', 'pulling this block to the right for 10 meters, how much', 'work am I doing?', 'Well, work is force times the distance, but not just the', 'total force.', 'The magnitude of the force in the direction of the distance.', u"So what's the magnitude of the force in the", 'direction of the distance?', 'It would be the horizontal component of this force', 'vector, right?', 'So it would be 100 newtons times the', 'cosine of 60 degrees.', 'It will tell you how much of that 100', 'newtons goes to the right.', 'Or another way you could view it if this', 'is the force vector.', 'And this down here is the distance vector.', 'You could say that the total work you performed is equal to', 'the force vector dot the distance vector, using the dot', 'product-- taking the dot product, to the force and the', 'distance factor.', 'And we know that the definition is the magnitude of', 'the force vector, which is 100 newtons, times the magnitude', 'of the distance vector, which is 10 meters, times the cosine', 'of the angle between them.', 'Cosine of the angle is 60 degrees.', u"So that's equal to 1,000 newton meters", 'times cosine of 60.', 'Cosine of 60 is what?', u"It's square root of 3 over 2.", 'Square root of 3 over 2, if I remember correctly.', 'So times the square root of 3 over 2.', 'So the 2 becomes 500.', 'So it becomes 500 square roots of 3 joules, whatever that is.', u"I don't know 700 something, I'm guessing.", u"Maybe it's 800 something.", u"I'm not quite sure.", 'But the important thing to realize is that the dot', 'product is useful.', 'It applies to work.', 'It actually calculates what component of what vector goes', 'in the other direction.', 'Now you could interpret it the other way.', 'You could say this is the magnitude of a', 'times b cosine of theta.', u"And that's completely valid.", u"And what's b cosine of theta?", 'Well, if you took b cosine of theta, and you could work this', u"out as an exercise for yourself, that's the amount of", u"the magnitude of the b vector that's", 'going in the a direction.', u"So it doesn't matter what order you go.", 'So when you take the cross product, it matters whether', 'you do a cross b, or b cross a.', u"But when you're doing the dot product, it doesn't matter", 'what order.', 'So b cosine theta would be the magnitude of vector b that', 'goes in the direction of a.', 'So if you were to draw a perpendicular line here, b', 'cosine theta would be this vector.', 'That would be b cosine theta.', 'The magnitude of b cosine theta.', 'So you could say how much of vector b goes in the same', 'direction as a?', 'Then multiply the two magnitudes.', 'Or you could say how much of vector a goes in the same', 'direction is vector b?', 'And then multiply the two magnitudes.', 'And now, this is, I think, a good time to just make sure', 'you understand the difference between the dot product and', 'the cross product.', 'The dot product ends up with just a number.', 'You multiply two vectors and all you have is a number.', 'You end up with just a scalar quantity.', 'And why is that interesting?', 'Well, it tells you how much do these-- you could almost say--', 'these vectors reinforce each other.', u"Because you're taking the parts of their magnitudes that", 'go in the same direction and multiplying them.', 'The cross product is actually almost the opposite.', u"You're taking their orthogonal components, right?", 'The difference was, this was a a sine of theta.', u"I don't want to mess you up this picture too much.", 'But you should review the cross product videos.', u"And I'll do another video where I actually compare and", 'contrast them.', u"But the cross product is, you're saying, let's multiply", 'the magnitudes of the vectors that are perpendicular to each', u"other, that aren't going in the same direction, that are", 'actually orthogonal to each other.', u"And then, you have to pick a direction since you're not", 'saying, well, the same direction that', u"they're both going in.", u"So you're picking the direction that's orthogonal to", 'both vectors.', u"And then, that's why the orientation matters and you", u"have to take the right hand rule, because there's actually", 'two vectors that are perpendicular to any other two', 'vectors in three dimensions.', u"Anyway, I'm all out of time.", u"I'll continue this, hopefully not too confusing, discussion", 'in the next video.', u"I'll compare and contrast the cross", 'product and the dot product.', 'See you in the next video.']

另一方面,如果在第一次迭代中没有定义
result
,您的代码将出错,如果在任何其他迭代中定义了,您将再次得到最后一个结果,您将需要在通过的地方继续,并且您不应使用覆盖层,除非,捕获所需内容并打印/记录错误。

尝试
result.encode('utf8')
result.decode('utf8')
。通常其中一个是有效的。请让我知道它是否有效。“我会对这个问题做出一个真实的回答。”乌尔吉尔丁根帮他检查了一下,不,这没用。你不需要使用BeautifulSoup。您正在下载的“html”实际上是XML,因此您只需要使用XML解析器,如
XML.etree
。非常感谢Padraic和其他人!这正是我需要的。总是很高兴向大家学习!有趣的是,
unescape
是HTMLPasser类的一部分,并且在源代码中。这么有用的东西居然能被这样想,这似乎很奇怪。