Introduction
Beautiful Soup is a powerful Python library for extracting data from XML and HTML files. It helps format & organize the confusing XML/HTML structure to present it with an easily traversed Python object. With only a few lines of code you can easily extract information from most websites or files. This blog post will barely scratch the surface of what's possible with BeautifulSoup, be sure to visit the reference links at the bottom of this post to learn more.
Installing BeautifulSoup
If you're using a Debian based distribution of Linux, BeautifulSoup can be installed by executing the following command.
$ apt-get install python-bs4
If you're unable to use the Debian system package manager, you can install BeautifulSoup using easy_install or pip.
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
If you can't install using any of the following methods it's possible to use the source tarball and install with setup.py.
$ python setup.py install
To learn more about installing or any possible errors that could occur, visit the BeautifulSoup site.
Your First Soup Object
The soup object is the most used object in the BeautifulSoup library as it will house the entire HTML/XML structure that you'll query information from. Creating this object requires 2 lines of code.
soup = BeautifulSoup(html.read(), 'html.parser')
Taking this one step further, we'll use the soup object to print out the pages H1 tag.
from bs4 import BeautifulSoup
html = urlopen("http://www.infragistics.com")
soup = BeautifulSoup(html.read(), 'html.parser')
print soup.h1.get_text()
Outputs:
Experience Matters
Querying the Soup Object
BeautifulSoup has multiple ways to navigate or query the document structure.
- find(tag, attributes, recursive, text, keywords)
- findAll(tag, attributes, recursive, text, limit, keywords)
- navigation using tags
find Method
This method looks through the document and retrieves the first single item that matches the provided filters. If the method can't find what you've search, None is returned. One example would be you want to search for the title of the page.
page_title = soup.find("title")
The page_title variable now contains the page title wrapped in it's title tag. Another example would be if you wanted to search the page for a specific tag id.
element_result = soup.find(id="theid")
The element_result variable now contains the HTML element that matched the query for id, "theid".
findAll Method
This method looks through the tag's descendants and retrieves all descendants that match the provided filters. If method can't find what you've searched for an empty list is returned. One example and simplest usage would be that you want to search for all hyperlinks on a page.
results = soup.findAll("a")
The variable results now contains a list of all hyperlinks found on the page. Another example might be you want to find all hyperlinks on a page, but they are using a specific class name.
results = soup.findAll("a", "highlighted")
The variable results now contains a list of all hyperlinks found on the page that reference the class name "highlighted". Searching for tags along with their id is very simliar and could be done in multiple ways, below I'll demonstrate 2 different ways.
results = soup.findAll(id="theid")
Navigation using Tags
To understand how navigation using tags would work, imagine that the HTML structure is mapped like a tree.
- html
- -> head
- -> title
- -> meta
- -> link
- -> script
- body
- -> h1
- -> div.content
- and so on...
Using this reference along with a page's source if we wanted to print the page title, the code would look like this.
print soup.head.title
Outputs:
<title>Developer Controls and Design Tools - .Net Components & Controls</title>
Scraping a Website
Using what was learned in previous section we're now going to apply that knowledge to scraping the definition from an Urban Dictionary page. The Python script looks for command line arguments that are comma separated to define. When scraping the definition from the page we use BeautifulSoup to search the page for a div tag that has the class name "meaning".
from urllib import url open
from bs4 import BeautifulSoup
def main(argv):
words = []
rootUrl = 'http://www.urbandictionary.com/define.php?term='
usageText = sys.argv[0] + ' -w <word1>,<word2>,<word3>.....'
try:
if (len(argv) == 0):
print usageText
sys.exit(2)
opts, args = getopt.getopt(argv, "w:v")
except getopt.GetoptError:
print usageText
sys.exit(2)
for opt, arg in opts:
if opt == "-w":
words = set(arg.split(","))
for word in words:
wordUrl = rootUrl + word
html = urlopen(wordUrl)
soup = BeautifulSoup(html.read(), 'html.parser')
meaning = soup.findAll("div", "meaning")
print word + " -- " + meaning[0].get_text().replace("\n", "")
if __name__ == "__main__":
main(sys.argv[1:])
Outputs:
python urbandict.py -w programming
programming -- The art of turning caffeine into Error Messages.
References
The reference links below are related to this blog post. If you're interested in more information about using BeautifulSoup a great resource is the Web Scraping with Python book.
BeautifulSoup: Installing BeautifulSoup, Kinds of Objects, find, findAll
easy_install: Installing easy_install
pip: Installing pip