For note: 用python处理html文件

1. 首先要获取要处理的html(当然也可以处理保存在本地的html文件)
urllib这个模块用于处理url. 最简单的使用是用sock = urlopen(url)来打开个url,然后用sock.read()来读取url的内容

>>> import urllib
>>> sock = urllib.urlopen("http://www.baidu.com")
>>> sock.read()
'<html><head><meta http-equiv=Content-Type content="text/html;charset=gb2312"><title>\xb0\xd9\xb6\xc8\xd2\xbb\xcf\xc2\xa3\xac\xc4\xe3\xbe\xcd\xd6\xaa\xb5\xc0 ....

2. 然后把读取的url的内容传给sgmllib这个模块里的类SGMLParser的子类进行处理
python用sgmllib这个模块里的类SGMLParser的子类处理html. SGMLParser把它读到的html分成多种小块.在它眼里,html由8种数据组成
1) 开始标签
例如 <html>, <body>等. 每读到一个这样的数据,它就调用一个叫做start_tagname的函数,如果找不到这样一个函数,就调用do_tagname,如果还找不到, 就调用unknown_starttag.例如,读到一个body标签的时候,它会依次查找start_body,do_body和unknown_starttag函数.标签的属性和属性值对会组成一个tuple,然后所有的这些tuple组成一个列表作为参数传递给这些函数.对于unknown_starttag,还有一个tagname作为参数
2) 结束标签
例如 <html>, <body>等. 每读到这样一个数据,它就依次查找end_tagname和unknown_endtag函数来调用. 例如读到body的结束标签,它会依次查找end_body和unknown_endtag函数
3) 字符引用
像 这种数据,它会调用handle_charref. 引用中的数字,例如例子中的160, 会转为字符串作为参数传递给函数
4) 实体引用(Entity reference)
向<这种,它会调用handle_entityref. 引用中的名字,例如例子中的lt, 会作为参数传递给函数
5) 注释
. 它会调用 handle_comment
6) Processing instruction
<? ... >. When found, 会调用 handle_pi
7) 声明
<! ... >它会调用 handle_decl
8) 其他数据
对于其他数据, 它会调用 handle_data
对于第5到第8种数据, 那些文本会原样作为参数传给函数

下面是一个简单的示例

import sgmllib
import urllib
class test(sgmllib.SGMLParser):
    def reset(self):
        sgmllib.SGMLParser.reset(self);
    def start_a(self, attrs):
        print attrs

if __name__ == "__main__":
    sock = urllib.urlopen("http://www.baidu.com")
    html = sock.read();
    sock.close()
    t = test()
    t.feed(html)
    t.close()

其中的reset函数相当于普通类的__init__, 在其中要调用父类(SGMLParser)的reset
而feed函数则是把读到的html传给parser进行分析.
其中的close的作用和文件操作的close一样.
注意到,标签的属性会作为参数传递给start_tagname等的函数. 还要注意的是,属性的名字全都会被自动转成小写

下面是这个程序的输出

~$ python sgmlparsertest.py
[('href', 'http://passport.baidu.com/?login&tpl=mn')]
[('onclick', 's(this)'), ('href', 'http://news.baidu.com')]
[('onclick', 's(this)'), ('href', 'http://tieba.baidu.com')]
[('onclick', 's(this)'), ('href', 'http://zhidao.baidu.com')]
[('onclick', 's(this)'), ('href', 'http://mp3.baidu.com')]
[('onclick', 's(this)'), ('href', 'http://image.baidu.com')]
[('onclick', 's(this)'), ('href', 'http://video.baidu.com')]
[('href', '/gaoji/preferences.html')]
[('href', '/gaoji/advanced.html')]
[('href', 'http://hi.baidu.com')]
[('href', 'http://www.hao123.com')]
[('href', '/more/')]
[('onclick', "this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')"), ('href', 'http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com')]
[('href', 'http://e.baidu.com')]
[('href', 'http://top.baidu.com')]
[('href', '/home.html')]
[('href', 'http://ir.baidu.com')]
[('href', 'http://www.baidu.com/duty/')]
[('href', 'http://www.miibeian.gov.cn'), ('target', '_blank')]

转载请注明出处 http://fornote.blogspot.com

For note

2009-04-13

用python处理html文件

没有评论:

发表评论

Google Analytics

分类

博客归档

访问统计