BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag, NavigableString, BeautifulSoup, Comment. 1、Tag:
soup = BeautifulSoup('#The Dormouses story ',"lxml")tag = soup.titleprint(tag)>>The Dormouses story
Tag有两个重要的属性:name和attrs;
soup = BeautifulSoup('#The Dormouses story ',"lxml")tag = soup.titleprint(tag.name)>> title
#利用name属性修改html文档 soup = BeautifulSoup('#The Dormouses story ',"lxml") tag = soup.title tag.name = 'aaa' print(tag) >>The Dormouses story
#一个tag可能有很多个属性,如 tag 有一个 “class” 的属性,值为 “boldest” .soup = BeautifulSoup('Extremely bold',"lxml")tag = soup.bprint(tag.attrs)>>{ 'class': ['boldest']}# tag的属性的操作方法与字典相同:print(tag['class'])>>['boldest']#tag的属性可以被修改#tag['class'] = 'verybold'#print(tag)>> Extremely bold#tag的属性可以被添加tag['grade'] = 'first'print(tag)>> Extremely bold#tag的属性可以被删除del tag['class']print(tag)>> Extremely bold
2、NavigableString:BeautifulSoup用NavigableString
类来包装tag中的字符串
soup = BeautifulSoup('Extremely bold',"lxml")tag = soup.bprint(tag.string)>> Extremely bold
#tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:
tag.string.replace_with('change string') print(tag) >> change string
3、BeautifulSoup:表示的是一个文档的全部内容。大部分时候,可以把它当作
Tag
对象。
soup = BeautifulSoup('Extremely bold',"lxml")print(soup.attrs)>> {}print(soup.name)>> [document]
4、Comment:是一个特殊类型的
NavigableString
对象,为文档的注释部分。
soup = BeautifulSoup(' ','lxml')tag = soup.bcomment = tag.stringprint(comment)>> Hey, buddy. Want to buy a used parser?
遍历文档树:
(1) Tag的名字:
html_doc = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
"""soup = BeautifulSoup(html_doc, 'html.parser')print(soup.head) #指定获取的tag的nameprint(soup.title)print(soup.a) # 当有多个该名称的tag时,只能获取到第一个print(soup.find_all('a')) # 查找全部名称为a的Tag
(2).contents 和 .children
tag的 .contents
属性可以将tag的子节点以列表的方式输出
(3).descendants:遍历子孙节点