Monday, August 6, 2018

Final Report for GSoC 2018 

In this summer, I work on two sub-projects for RTEMS Project:

RTEMS Release Notes Generator (ticket: #3314)

Ticket’s link:

https://devel.rtems.org/ticket/3314

Check all the code for this project in this Github repository:

https://github.com/dh0072/ReleaseNotesGenerator

Introduction: 

This project aims to automatically create the RTEMS release notes for a release from the Trac data by using XML parser (Python). Since all changes on a release branch must have a ticket, the ticket is assigned the Version and Milestone. Therefore, web pages are converted to a PDF as the release notes.

Details for Completed tasks:

1. Provided a command “python2 release_notes.py --milestone_id XXX” to directly generate a release notes in Markdown format from Trac
2. Generated Markdown version of release notes “tickets.md”
3. Created a python class “unicode_dict_reader.py” to make the program be compatible with both Python 2.7 and Python 3.6 
4. Created a Python class “tickets.py” to get all needed data from a ticket’s page and a milestone’s page
5. Created a Python class “markdown.py” to limit max column width to 20 characters and max number of columns to 4. Therefore, release notes can fit into A4 size.
6. Fetched needed information by parsing XML page
7. Calculated the tickets’ statistics for the given milestone


RTEMS POSIX User Guide Generator (ticket: #3333)

Ticket’s link:

https://devel.rtems.org/ticket/3333

Check all the code for this project in this Github repository:

https://github.com/dh0072/NewlibMarkup2SphinxConverter

Patch’s link: https://github.com/dh0072/GSoC2018GithubIO/blob/master/makedoc2rst_patch.md

Introduction: 

RTEMS uses the Newlib C Library for a significant portion of its POSIX support. Currently, the RTEMS POSIX Users Guide will not provide documentation for a method not based on Newlib's. Therefore, this project aims to automatically convert Newlib markup to Sphinx output and integrate with POSIX users guide. 

Details for Completed tasks:

1. Created Command Line Interface (Python argument parser) to specify path of C source file and path of destination rst markup file
2. Provided rst utilities to register commands (FUNCTION, INDEX, SYNOPSYS, etc) and their corresponding processors methods
3. Created class (makedoc2rst) to parse C-style block comments with makedoc format in C source file with Regular Expression matching
4. Extracted commands and its corresponding text in C source file
5. Generated rst markup based on command type and save it to destination rst file


Others

Though it is almost the end of GSoC project, I will keep an eye out for any feature requests bug fixes related to them. I believe it is one of the sprits of open source project – keep making contribution to the project and always maintain the code!

I am grateful to meet my mentors! Thanks to their professionalism, patience and passion, so I can finish these projects under their supervisions and have such a fruitful summer! 😊

Monday, June 11, 2018

Design development of RTEMS release notes generator

Design development of RTEMS release generator:

1. At first, we need to figure out what our goal is. Our goal can be divided to a couple of tasks. For example, we need to include all needed data in the release notes, to fix formatting issues, etc.

2. After our goal is divided into a couple of tasks, we need to figure out which one is the most essential one. At this point of time, getting needed data is the most important thing because some of needed data is missing in the release notes.


3. Thirdly, we need to figure out what is the best solution for our problem. In general, there are two ways to get needed data. One is to parse HTML page, the other way is to parse the XML in RSS feed. Finally, we decide to use XML parser to parse the XML. For more detailed reasons, please read my other post in this blog:

parse HTML page VS parse XML page

How do I feel about working in an open source project?

It is my first time to work on an open source project, so everything is new and exciting for me. It has been almost a month since I work on RTEMS project. Definitely 5 stars to my mentors because they are very helpful. Also, other people are always willing to help which is really impressive. I understand why helping each other is one of the most essential spirits of open source community.
As a developer working on an open source community, every work is preferred to be public:
1. Email is public. Normally, email is preferred to be public. As a developer, I subscribe two important mailing, user@rtems.org and devel@rtems.org. user@rtems.org is for communication between users and developers. For example, if a user has any questions, suggestions and comments, he/she would send email to user@rtems.org, so I can get feedback from a user immediately as a developer. As for devel@rtems.org, it is used by developers, we can talk about technical questions like how to fix a bug here.

2. All code is public. Since code is pushed on github, it is easy for my mentors and other people to review it. Also, if a user who has a technical background is interested in how the project works, it is easy for he/she to get access to its source code. Notice: copyright belongs to a specific developer or an organization.
3. Code is supposed to be consistent. As a part of developer team, my code is supposed to be consistent with other developers. For example, details like white space and column limits should be aware. It is not only making the code more consistent and professional, but also more readable for later developers.

It is just a beginning for me to work on an open source project, I will keep going! 😊

Why do we need a release notes tool in the RTEMS project?

During my participation of google summer of code this year, the first project I work on is RTEMS release notes generator. Why do we need a release note tools in the RTEMS project?

1. Missing data in release notes. Currently, the release notes could be regularly used, but some essential data are missed in the release notes. Therefore, we need to extract all of the needed data from a ticket’s page to put on the release notes.

2. Formatting issues. Some data is not readable because of formatting issues. Therefore, it is also one of my goals to provide a better formatting of the ticket. Also, date formatting is not that reasonable. For example, the date is a local setting but not a full date.

The release notes tool fits in the RTEMS project quite well, because release notes can be generated automatically from the Trac data including all of the needed data now. Therefore, the release notes is not only more readable, but also contains more essential data.

Sunday, June 10, 2018

parse HTML page VS parse XML page


If a webpage is formatted by RSS, there are two ways for us to get text data from a webpage. One way is to parse HTML page, the other way is to parse the XML in RSS feed.

At first, I use an external python package called BeautifulSoup to extract data from HTML page, thanks to the reminder from my mentor, I realized that the result is fragile to parse HTML. Therefore, I decide to parse the XML in RSS feed. Also, to parse an XML in RSS feed, it is easier to use the XML parser in the standard python.


In general, there are two reasons for me to parse XML in RSS feed using XML parser:


1. It is more stable to parse XML than to parse HTML. The webpage we are scraping might change frequently, if we extract data from HTML, our code might no longer work once the web template is changed. I use BeautifulSoup before to extract data from HTML. To use BS, I extract data by finding data in each tag visually. For example, I extract a comment within <comment> blahblahblah <,comment>, however, once the web template is changed, this comment "blahblahblah" might be stored in a new tag as <others> blahblahblah </others>, if so, my codes does not work anymore. We do not need to worry about this problem parsing XML in RSS feed since RSS feed is a stable file format.

2. It is more efficient. After deciding to parse XML but not to parse HTML, I need a tool to help me. XML parser in the standard python would be the best choice. It is efficient and user friendly. For example, I need to parse the yellow section of #2988 ticket information on:

https://devel.rtems.org/ticket/2988

Code is short:




Wednesday, June 6, 2018

Use logging instead of print in python

Quick! Think about a way to output a message in python! If you think about print immediately, you are not the only one. It is true that print is the most popular method to output a message in python, but using logging is actually better.

Advantages of using logging instead of print:


1. User friendly. A user can see when and where a logging comes from.


2. Easier to manage.  It is easier to format them.


3. Easier to differentiate. Logging can be differentiated based on severity.

Wednesday, May 30, 2018

BeaufiulSoup - a STRONG search method to extract text from HTML page

I got stuck for writing HTML parser for a while until I found a search method beautifulsoup. Oh my goodness, it is awesome! It should be renamed as beautifullife since it makes life easier.

I wrote a python class as to show how to use BeautifulSoup to exact needed data, my blog is used as an example.



# This file is created by Danxue(Dannie) Huang, as an example of using BeautifulSoup
# The goal of this example is to get specified data from a website
# I use my blog as an example, let's say, I try to get the title of my blog
# First of all, we need to import urllib to read an URL
import urllib.request
# Second, we need to import BeautifulSouop from package bs4
from bs4 import BeautifulSoup

class example_bs():
    # Pass in an URL. I use my blog as an example here
    def parse_page(self, url):
        page_str = self._read_html_page(url)
        return BeautifulSoup(page_str, "html.parser")

    def _read_html_page(self, url, codec="utf-8"):
        url_response = urllib.request.urlopen(url)
        return url_response.read().decode(codec)
# Pass in an URL for test here, I use my blog as an example here
blog_url = "https://danxuehuang.blogspot.com/"blog_data = example_bs()
blog = blog_data.parse_page(url = blog_url)

# Look at the source code of HTML, find out where the source code of data we need is locasted
blog_title = blog.find(name="div", attrs={"class": "titlewrapper"}).find(name="h1", attrs={"class": "title"})

# print the data we need
print(blog_title)

# print the plain text of data we need
print(blog_title.string)


Boom! Screenshot of output: