`_
describes differences between 3.0 and earlier versions.
Include Beautiful Soup in your application with a line like one of the
following:
.. code-block:: python
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything
Here's some code demonstrating the basic features of Beautiful
Soup. You can copy and paste this code into a Python session to run it
yourself.
.. code-block:: python
from BeautifulSoup import BeautifulSoup
import re
doc = ['Page title',
'This is paragraph one.',
'
This is paragraph two.',
'']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
#
#
#
# Page title
#
#
#
#
# This is paragraph
#
# one
#
# .
#
#
# This is paragraph
#
# two
#
# .
#
#
#
Here are some ways to navigate the soup:
.. code-block:: python
soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# Page title
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# This is paragraph one.
head.nextSibling.contents[0].nextSibling
# This is paragraph two.
Here are a couple of ways to search the soup for certain tags, or tags with certain properties:
.. code-block:: python
titleTag = soup.html.head.title
titleTag
# Page title
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p', align="center")
# [This is paragraph one.
]
soup.find('p', align="center")
# This is paragraph one.
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'
It's easy to modify the soup:
.. code-block:: python
titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
# New title
soup.p.extract()
soup.prettify()
#
#
#
# New title
#
#
#
#
# This is paragraph
#
# two
#
# .
#
#
#
soup.p.replaceWith(soup.b)
#
#
#
# New title
#
#
#
#
# two
#
#
#
soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " <p> tags!")
soup.body
# This page used to have two <p> tags!
Here's a real-world example. It fetches the `ICC Commercial Crime
Services weekly piracy report `_ , parses it with
Beautiful Soup, and pulls out the piracy incidents:
.. code-block:: python
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
Parsing a Document
----------------------------
A Beautiful Soup constructor takes an XML or HTML document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory.
If you give Beautiful Soup a perfectly-formed document, the parsed data structure looks just like the original document. But if there's something wrong with the document, Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.
.. _parsing_html:
Parsing HTML
~~~~~~~~~~~~~~~~~~
Use the ``BeautifulSoup`` class to parse an HTML document. Here are some
of the things that ``BeautifulSoup`` knows:
- Some tags can be nested () and some can't ().
- Table and list tags have a natural nesting order. For instance,
tags go inside | tags, not the other way around.
- The contents of a