BeaufiulSoup - a STRONG search method to extract text from HTML page

I got stuck for writing HTML parser for a while until I found a search method beautifulsoup. Oh my goodness, it is awesome! It should be renamed as beautifullife since it makes life easier.

I wrote a python class as to show how to use BeautifulSoup to exact needed data, my blog is used as an example.

# This file is created by Danxue(Dannie) Huang, as an example of using BeautifulSoup

# The goal of this example is to get specified data from a website

# I use my blog as an example, let's say, I try to get the title of my blog

# First of all, we need to import urllib to read an URL

import urllib.request

# Second, we need to import BeautifulSouop from package bs4

from bs4 import BeautifulSoup

class example_bs():

    # Pass in an URL. I use my blog as an example here

    def parse_page(self, url):
        page_str = self._read_html_page(url)
        return BeautifulSoup(page_str, "html.parser")

    def _read_html_page(self, url, codec="utf-8"):
        url_response = urllib.request.urlopen(url)
        return url_response.read().decode(codec)

# Pass in an URL for test here, I use my blog as an example here

blog_url = "https://danxuehuang.blogspot.com/"blog_data = example_bs()
blog = blog_data.parse_page(url = blog_url)

# Look at the source code of HTML, find out where the source code of data we need is locasted

blog_title = blog.find(name="div", attrs={"class": "titlewrapper"}).find(name="h1", attrs={"class": "title"})

# print the data we need

print(blog_title)

# print the plain text of data we need

print(blog_title.string)

Boom! Screenshot of output:

RTEMS Release Notes Generator & RTEMS POSIX User Guide Generator

Wednesday, May 30, 2018

BeaufiulSoup - a STRONG search method to extract text from HTML page

No comments:

Post a Comment