Using BeautifulSoup to Scrape Websites

Torrey Betts / Wednesday, March 30, 2016

Introduction

Beautiful Soup is a powerful Python library for extracting data from XML and HTML files. It helps format & organize the confusing XML/HTML structure to present it with an easily traversed Python object. With only a few lines of code you can easily extract information from most websites or files. This blog post will barely scratch the surface of what's possible with BeautifulSoup, be sure to visit the reference links at the bottom of this post to learn more.

Installing BeautifulSoup

If you're using a Debian based distribution of Linux, BeautifulSoup can be installed by executing the following command.

$ apt-get install python-bs4

If you're unable to use the Debian system package manager, you can install BeautifulSoup using easy_install or pip.

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

If you can't install using any of the following methods it's possible to use the source tarball and install with setup.py.

$ python setup.py install

To learn more about installing or any possible errors that could occur, visit the BeautifulSoup site.

Your First Soup Object

The soup object is the most used object in the BeautifulSoup library as it will house the entire HTML/XML structure that you'll query information from. Creating this object requires 2 lines of code.

html = urlopen("http://www.infragistics.com")
soup = BeautifulSoup(html.read(), 'html.parser')

Taking this one step further, we'll use the soup object to print out the pages H1 tag.

from urllib import url open
from bs4 import BeautifulSoup

html = urlopen("http://www.infragistics.com")
soup = BeautifulSoup(html.read(), 'html.parser')
print soup.h1.get_text()

Outputs:

Experience Matters

Querying the Soup Object

BeautifulSoup has multiple ways to navigate or query the document structure.

  • find(tag, attributes, recursive, text, keywords)
  • findAll(tag, attributes, recursive, text, limit, keywords)
  • navigation using tags

find Method

This method looks through the document and retrieves the first single item that matches the provided filters. If the method can't find what you've search, None is returned. One example would be you want to search for the title of the page.

page_title = soup.find("title")

The page_title variable now contains the page title wrapped in it's title tag. Another example would be if you wanted to search the page for a specific tag id.

element_result = soup.find(id="theid")

The element_result variable now contains the HTML element that matched the query for id, "theid".

findAll Method

This method looks through the tag's descendants and retrieves all descendants that match the provided filters. If method can't find what you've searched for an empty list is returned. One example and simplest usage would be that you want to search for all hyperlinks on a page.

results = soup.findAll("a")

The variable results now contains a list of all hyperlinks found on the page. Another example might be you want to find all hyperlinks on a page, but they are using a specific class name.

results = soup.findAll("a", "highlighted")

The variable results now contains a list of all hyperlinks found on the page that reference the class name "highlighted". Searching for tags along with their id is very simliar and could be done in multiple ways, below I'll demonstrate 2 different ways.

results = soup.findAll("a", id="their")
results = soup.findAll(id="theid")

Navigation using Tags

To understand how navigation using tags would work, imagine that the HTML structure is mapped like a tree.

  • html
  • -> head
  • -> title
  • -> meta
  • -> link
  • -> script
  • body 
  • -> h1
  • -> div.content
  • and so on...

Using this reference along with a page's source if we wanted to print the page title, the code would look like this.

print soup.head.title

Outputs:

 <title>Developer Controls and Design Tools - .Net Components & Controls</title> 

Scraping a Website

Using what was learned in previous section we're now going to apply that knowledge to scraping the definition from an Urban Dictionary page. The Python script looks for command line arguments that are comma separated to define. When scraping the definition from the page we use BeautifulSoup to search the page for a div tag that has the class name "meaning".

import sys, getopt
from urllib import url open
from bs4 import BeautifulSoup

def main(argv):
   words = []
   rootUrl = 'http://www.urbandictionary.com/define.php?term='
   usageText = sys.argv[0] + ' -w <word1>,<word2>,<word3>.....'

   try:
      if (len(argv) == 0):
         print usageText
         sys.exit(2)
      opts, args = getopt.getopt(argv, "w:v")
   except getopt.GetoptError:
      print usageText
      sys.exit(2)

   for opt, arg in opts:
      if opt == "-w":
         words = set(arg.split(","))

   for word in words:
      wordUrl = rootUrl + word
      html = urlopen(wordUrl)
      soup = BeautifulSoup(html.read(), 'html.parser')
      meaning = soup.findAll("div", "meaning")
      print word + " -- " + meaning[0].get_text().replace("\n", "")

if __name__ == "__main__":
   main(sys.argv[1:])

Outputs:

python urbandict.py -w programming
programming -- The art of turning caffeine into Error Messages.

References

The reference links below are related to this blog post. If you're interested in more information about using BeautifulSoup a great resource is the Web Scraping with Python book.

BeautifulSoup: Installing BeautifulSoupKinds of ObjectsfindfindAll

easy_install: Installing easy_install

pip: Installing pip

By Torrey Betts