web-scraping - python BeautifulSoup将换行符替换为句点和空格

  显示原文与译文双语对照的内容

我在用 BeautifulSoap scraping一些链接。

下面是我正在废弃的URL的源代码的相关部分:


<div class="description">


Planet Nine was initially proposed to explain the clustering of orbits


Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 


</div>



下面是我的BeautifulSoap代码( 仅相关部分),用于获取 description 标记中的文本:


quote_page = sys.argv[1]


page = urllib2.urlopen(quote_page)


soup = BeautifulSoup(page, 'html.parser')



description_box = soup.find('div', {'class':'description'})


description = description_box.get_text(separator="").strip()


print description



使用 python script.py https://example.com/page/2000 运行脚本。 给出以下输出:


Planet Nine was initially proposed to explain the clustering of orbits


Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 



如何替换带有一个空格的行的换行符,如下所示:


Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.



知道我怎么做?

时间:

试试这个


description = description_box.get_text(separator="").rstrip("n")



来自这里的:


html = '''<div class="description">


Planet Nine was initially proposed to explain the clustering of orbits


Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.


</div>'''


n = 2 # occurrence i.e. 2nd in this case


sep = 'n' # sep i.e. newline 


cells = html.split(sep)



from bs4 import BeautifulSoup



html = sep.join(cells[:n]) +"." + sep.join(cells[n:])


soup = BeautifulSoup(html, 'html.parser')


title_box = soup.find('div', attrs={'class': 'description'})


title = title_box.get_text().strip()


print (title)



输出:


Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.



拆分行,然后在进行分析之前加入。


from bs4 import BeautifulSoup



htmldata='''<div class="description">


Planet Nine was initially proposed to explain the clustering of orbits


Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 


</div>'''


htmldata="".join(item.strip() for item in htmldata.split("n"))


soup=BeautifulSoup(htmldata,'html.parser')


description_box = soup.find('div', class_='description')


print(description_box.text)



输出:


Planet Nine was initially proposed to explain the clustering of orbitsOf Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.



使用拆分并与选择一起使用


from bs4 import BeautifulSoup as bs



html = '''


<div class="description">


Planet Nine was initially proposed to explain the clustering of orbits


Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 


</div>


'''


soup = bs(html, 'lxml')


text = ' '.join(soup.select_one('.description').text.split('n'))


print(text)



...