Wednesday, May 30, 2018

BeaufiulSoup - a STRONG search method to extract text from HTML page

I got stuck for writing HTML parser for a while until I found a search method beautifulsoup. Oh my goodness, it is awesome! It should be renamed as beautifullife since it makes life easier.

I wrote a python class as to show how to use BeautifulSoup to exact needed data, my blog is used as an example.



# This file is created by Danxue(Dannie) Huang, as an example of using BeautifulSoup
# The goal of this example is to get specified data from a website
# I use my blog as an example, let's say, I try to get the title of my blog
# First of all, we need to import urllib to read an URL
import urllib.request
# Second, we need to import BeautifulSouop from package bs4
from bs4 import BeautifulSoup

class example_bs():
    # Pass in an URL. I use my blog as an example here
    def parse_page(self, url):
        page_str = self._read_html_page(url)
        return BeautifulSoup(page_str, "html.parser")

    def _read_html_page(self, url, codec="utf-8"):
        url_response = urllib.request.urlopen(url)
        return url_response.read().decode(codec)
# Pass in an URL for test here, I use my blog as an example here
blog_url = "https://danxuehuang.blogspot.com/"blog_data = example_bs()
blog = blog_data.parse_page(url = blog_url)

# Look at the source code of HTML, find out where the source code of data we need is locasted
blog_title = blog.find(name="div", attrs={"class": "titlewrapper"}).find(name="h1", attrs={"class": "title"})

# print the data we need
print(blog_title)

# print the plain text of data we need
print(blog_title.string)


Boom! Screenshot of output:

RTEMS project - Google Summer of Code 2018

I just create a new blog by using Blogger because I think it is more user friendly than github page.

This blog is for RTEMS project as part of GSoC 2018.
RTEMS is an open source Real Time Executive System for Multiprocessor Systems: https://www.rtems.org/
About RTEMS projects as part of GSoC 2018: https://devel.rtems.org/wiki/GSoC/2018#GoogleSummerofCode2018
I will work on two tickets:
  1. RTEMS Release Notes Generator (ticket: #3314): https://devel.rtems.org/ticket/3314
  2. RTEMS POSIX User Guide Generator (ticket: #3333): https://devel.rtems.org/ticket/3333