Beautiful Soup Documentation ============================== .. (setq fill-column 999) `Beautiful Soup `_ is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. There's also a Ruby port called `Rubyful Soup `_ . This document illustrates all major features of Beautiful Soup version 3.0, with examples. It shows you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. Quick Start ---------------- Get Beautiful Soup `here `_ . The `changelog `_ describes differences between 3.0 and earlier versions. Include Beautiful Soup in your application with a line like one of the following: .. code-block:: python from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything Here's some code demonstrating the basic features of Beautiful Soup. You can copy and paste this code into a Python session to run it yourself. .. code-block:: python from BeautifulSoup import BeautifulSoup import re doc = ['Page title', '

This is paragraph one.', '

This is paragraph two.', ''] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # # # # Page title # # # #

# This is paragraph # # one # # . #

#

# This is paragraph # # two # # . #

# # Here are some ways to navigate the soup: .. code-block:: python soup.contents[0].name # u'html' soup.contents[0].contents[0].name # u'head' head = soup.contents[0].contents[0] head.parent.name # u'html' head.next # Page title head.nextSibling.name # u'body' head.nextSibling.contents[0] #

This is paragraph one.

head.nextSibling.contents[0].nextSibling #

This is paragraph two.

Here are a couple of ways to search the soup for certain tags, or tags with certain properties: .. code-block:: python titleTag = soup.html.head.title titleTag # Page title titleTag.string # u'Page title' len(soup('p')) # 2 soup.findAll('p', align="center") # [

This is paragraph one.

] soup.find('p', align="center") #

This is paragraph one.

soup('p', align="center")[0]['id'] # u'firstpara' soup.find('p', align=re.compile('^b.*'))['id'] # u'secondpara' soup.find('p').b.string # u'one' soup('p')[1].b.string # u'two' It's easy to modify the soup: .. code-block:: python titleTag['id'] = 'theTitle' titleTag.contents[0].replaceWith("New title") soup.html.head # New title soup.p.extract() soup.prettify() # # # # New title # # # #

# This is paragraph # # two # # . #

# # soup.p.replaceWith(soup.b) # # # # New title # # # # # two # # # soup.body.insert(0, "This page used to have ") soup.body.insert(2, " <p> tags!") soup.body # This page used to have two <p> tags! Here's a real-world example. It fetches the `ICC Commercial Crime Services weekly piracy report `_ , parses it with Beautiful Soup, and pulls out the piracy incidents: .. code-block:: python import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php") soup = BeautifulSoup(page) for incident in soup('td', width="90%"): where, linebreak, what = incident.contents[:3] print where.strip() print what.strip() print Parsing a Document ---------------------------- A Beautiful Soup constructor takes an XML or HTML document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory. If you give Beautiful Soup a perfectly-formed document, the parsed data structure looks just like the original document. But if there's something wrong with the document, Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure. .. _parsing_html: Parsing HTML ~~~~~~~~~~~~~~~~~~ Use the ``BeautifulSoup`` class to parse an HTML document. Here are some of the things that ``BeautifulSoup`` knows: - Some tags can be nested (
) and some can't (

). - Table and list tags have a natural nesting order. For instance, tags go inside tags, not the other way around. - The contents of a