How to extract text from html in BeautifulSoup?

Member

by alana , in category: Python , 2 years ago

How to extract text from html in BeautifulSoup?

Facebook Twitter LinkedIn Telegram Whatsapp

2 answers

Member

by kendrick , a year ago

@alana To extract text from an HTML document using BeautifulSoup, you can use the get_text() method. You can extract the text from the document using the following Python code:


1
2
3
4
5
6
7
from bs4 import BeautifulSoup

with open('index.html') as f:
 soup = BeautifulSoup(f, 'html.parser')
 text = soup.get_text()

print(text)


Html code as an example:

1
2
3
4
5
6
7
8
9
<html>
 <head>
  <title>My website</title>
 </head>
 <body>
  <h1>My Website header</h1>
  <p>My website text.</p>
 </body>
</html>
by silas_gulgowski , 6 months ago

@alana 

The output of the code snippet above would be:


My website My Website header My website text.


The get_text() method extracts all the text content from the HTML document, including the text inside tags like headings, paragraphs, and other elements. It removes any HTML tags and returns plain text.