Target
Get all the titles and urls of my blogs.
Design
Firstly, the Crawler needs one controller(WebCrawler) to scheme the itself.The controller sends the root url to the urlManager.Then judges whether there is any url in the set.If the set is not empty, Crawler pops one from the set.Crawler downloads the web page of the given url, and sends the content to the parser.The parser returns what we what.If the set is not null, Crawler do what we have done before again.
urlManager gets all list page by root url and add them to new_urls(set).If the controller pop one url, urlManager move the url from new_urls(set) to old_urls(set).Also urlManager has methods of has_new_url and get_new_url.
htmlDownLoader download the web page of the given url.
htmlParser help us to grab useful infomation.We should tell where the infomation is(locate by tabel and the label’s id, class and other attributes).At last the parser returns what we want on the page we given.
Finally, the Crawler needs to present the result.Everytime when parser returns any datum, htmlOutputer adds them to its list.When all the urls in the set are used, htmlOutputer prints all the datum in one html file.We can open the file in Firefox and check the result.
Structure
Code
WebCrawler.py
|
|
urlManager.py
|
|
htmlDownLoader.py
|
|
htmlParser.py
|
|
htmlOutputer.py
|
|
Output
Run WebCrawler in IDE(here is pycharm)
Open output.html in Firefox