Skip to main content

Web Scraping in Python


Web Scraping Basics

What is web scraping all about? Consider the following scenario:
Imagine that one day, out of the blue, you find yourself thinking “Gee, I wonder who the five most popular mathematicians are?”
You do a bit of thinking, and you get the idea to use Wikipedia’s XTools to measure the popularity of a mathematician by equating popularity with page views. For example, look at the page on Henri Poincaré. There you can see that Poincaré’s pageviews for the last 60 days are, as of December 2017, around 32,000.
Next, you Google for “famous mathematicians” and find this resource which lists 100 names. Now you have both a page listing mathematician’s names and you have a website that provides information about how “popular” that mathematician is. Now what?
This is where Python and web scraping come in. Web scraping is about downloadingstructured data from the web, selecting some of that data, and passing along what you selected to another process.
In this tutorial, you will be writing a Python program that downloads the list of 100 mathematicians and their XTools pages, selects data about their popularity, and finishes by telling us the all time top 5 most popular mathematicians! Lets get started.

Setting Up Your Python Web Scraper

You will be using Python 3 and Python virtual environments throughout the tutorial. Feel free to set things up however you like, but here is how I tend to do it:
$ python3 -m venv venv
$ . ./venv/bin/activate
You will only need to install these two packages:
Let’s install these dependencies with pip:
$ pip install requests BeautifulSoup4
Finally, if you want to follow along, fire up your favorite text editor and create a file called mathematicians.py. Get started by including these import statements at the top:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Making Web Requests

Your first task will be to download web pages. The requests package comes to the rescue. requests aims to be an easy to use tool for doing all things HTTP in Python, and it doesn’t dissapoint. In this tutorial, you will only need requests.get function, but the you should definitely checkout the full documentation when you want to go further.
First your function:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns true if the response seems to be HTML, false otherwise
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
The simple_get function accepts a single url argument. It then makes a GET request to that url. If nothing goes wrong, you end up with the raw HTML content for the page you requested. If there were any problems with your request (like the url is bad or the remote server is down) then your funciton returns None.
You may have noticed the use of the closing function in your definition of simple_get. The closing function ensures that any network resources are freed when they go out of scope in that with block. Using closing like that is good practice and helps to to prevent fatal errors and network timeouts.
You can test the simple_get like this:
>>> from mathematicians import simple_get
>>> raw_html = simple_get('https://realpython.com/blog/)
>>> len(raw_html)
33878

>>> no_html = simple_get('https://realpython.com/blog/nope-not-gonna-find-it')
>>> no_html is None
True

Wrangling HTML With BeautifulSoup

Once you have raw HTML in front of you, you can start to select and extract. For this purpose you will be using BeautifulSoup. The BeautifulSoup constructor parses raw HTML strings and produces an object that mirrors the HTML document’s strucutre. The object includes a slew of methods to select, view, and manipulate DOM nodes and text content.
As a quick, contrived, example, consider the following HTML document:
<!DOCTYPE html>
<html>
<head>
  <title>Contrived Example</title>
</head>
<body>
<p id="eggman"> I am the egg man </p>
<p id="walrus"> I am the walrus </p>
</body>
</html>
Suppose that the above HTML is saved in the file contrived.html, then you can use BeautifulSoup like this:
>>> from bs4 import BeautifulSoup
>>> raw_html = open('contrived.html').read()
>>> html = BeautifulSoup(raw_html, 'html.parser')
>>> for p in html.select('p'):
...     if p['id'] == 'walrus':
...         print(p.text)

'I am the walrus'
Breaking down the example, you first parse the raw HTML by passing it to the BeautifulSoup constructor. BeautifulSoup accepts multiple backend parsers but the standard bakend is 'html.parser', which you supply here as the second argument. (If you neglect to supply that 'html.parser' the code will still work, but you will see a warning print to your screen.)
The select method on your html object lets you use CSS selectors to locate elements in the document. In the above case, html.select('p') returns a list of paragraph elements. Each p has HTML attibutes that you can access like a dict. In the line if p['id'] == 'walrus', for example, you check if the id attribute is equal it the string 'walrus', which corresponds to <p id="walrus"> in the HTML.

Using BeautifulSoup to Get Mathematician Names

Now that you have given BeautifulSoup‘s select method a short test drive, how do you find out what to supply to select? The fastest way is to step out of Python and into your web browser’s developer tools. You can use your browser to examine the the document in some detail - I usually look for id or class element attributes or any other information that uniquely identifies the information I want to extract.
To make matters concrete, turn to the list of mathematicians you saw earlier. Spending a minute or two looking at this page’s source, you can see that each mathematician’s name appears inside the text content of an <li> tag. Even better, <li> tags on this page seem to contain nothing but names of mathematicians.
Here’s a quick look using Python:
>>> raw_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')
>>> html = BeautifulSoup(raw_html, 'html.parser')
>>> for i, li in enumerate(html.select('li')):
        print(i, li.text)

0  Isaac Newton
 Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler 
 Bernhard Riemann

 3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

# 5 ... and many more...
The above experiment shows that some of the <li> elements contain multiple names separated by newline characters, and others contain just a single name. With this information in mind, you can write your function to extract a single list of names:
def get_names():
    """
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    """
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set()
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
                    names.add(name.strip())
        return list(names)

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))
The get_names function downloads the page and iterates over the <li> elements, picking out each name that occurs. Next, you add each name to a Python set, which ensures that you don’t end up with duplicate names. Finally you convert the set to a list and return it.

Getting the Popularity Score

Nice, you’re nearly done! Now that you have a list of names, you need to pick out the pageviews for each one. The function you write is similar to the function you made to get the list of names, only now you supply a name and pick out an integer value from the page.
Again, you should first check out an example page in your browser’s developer tools. It looks like the text appears inside an <a> element, and that the href attribute of that element always contains the string 'latest-60' as a substring. That’s all the information you need to write your function!
def get_hits_on_name(name):
    """
    Accepts a `name` of a mathematician and returns the number
    of hits that mathematician's wikipedia page received in the 
    last 60 days, as an `int`
    """
    # url_root is a template string that used to buld a URL.
    url_root = 'https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/{}'
    response = simple_get(url_root.format(name))

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')

        hit_link = [a for a in html.select('a')
                    if a['href'].find('latest-60') > -1]

        if len(hit_link) > 0:
            # Strip commas:
            link_text = hit_link[0].text.replace(',', '')
            try:
                # Convert to integer
                return int(link_text)
            except:
                log_error("couldn't parse {} as an `int`".format(link_text))

    log_error('No pageviews found for {}'.format(name))
    return None

Putting It All Together

You have reached a point where you can finally find out which mathematician is most beloved by the public! The plan is simple:
  • Get a list of names,
  • Iterate over the list to get a “popularity score” for each name; and
  • Finish by sorting the names by popularity.
Simple right? Well there’s one thing that hasn’t been mentioned yet: errors.
Working with real-world data is messy, and trying to force messy data into a uniform shape will invariably result in the occassional error jumping in to mess with your nice clean vision of how things ought to be. Ideally, you would like to keep track of errors when they occur in order to get a better sense of the quality your data.
For your present purpose, you will track instances when you could not find a popularity score for a given mathematician’s name. At the end of the script, you will print a message showing the number of mathematicians who were left out of the rankings.
Here’s the code:
if __name__ == '__main__':
    print('Getting the list of names....')
    names = get_names()
    print('... done.\n')

    results = []

    print('Getting stats for each name....')

    for name in names:
        try:
            hits = get_hits_on_name(name)
            if hits is None:
                hits = -1
            results.append((hits, name))
        except:
            results.append((-1, name))
            log_error('error encountered while processing '
                      '{}, skipping'.format(name))

    print('... done.\n')

    results.sort()
    results.reverse()

    if len(results) > 5:
        top_marks = results[:5]
    else:
        top_marks = results

    print('\nThe most popular mathematicians are:\n')
    for (mark, mathematician) in top_marks:
        print('{} with {} page views'.format(mathematician, mark))

    no_results = len([res for res in results if res[0] == -1])
    print('\nBut we did not find results for '
          '{} mathematicians on the list'.format(no_results))
And that’s it!
When you run the script, you should see at the following report:
The most popular mathematicians are:

Albert Einstein with 1089615 page views
Isaac Newton with 581612 page views
Srinivasa Ramanujan with 407141 page views
Aristotle with 399480 page views
Galileo Galilei with 375321 page views

But we did not find results for 19 mathematicians on our list

Conclusion & Next Steps

Web scraping is a big field, and you have just finished a brief tour of that field using Python as you guide. You can get pretty far using just requests and BeautifulSoup

Popular posts from this blog

How to read or extract text data from passport using python utility.

Hi ,  Lets get start with some utility which can be really helpful in extracting the text data from passport documents which can be images, pdf.  So instead of jumping to code directly lets understand the MRZ, & how it works basically. MRZ Parser :                 A machine-readable passport (MRP) is a machine-readable travel document (MRTD) with the data on the identity page encoded in optical character recognition format Most travel passports worldwide are MRPs.  It can have 2 lines or 3 lines of machine-readable data. This method allows to process MRZ written in accordance with ICAO Document 9303 (endorsed by the International Organization for Standardization and the International Electrotechnical Commission as ISO/IEC 7501-1)). Some applications will need to be able to scan such data of someway, so one of the easiest methods is to recognize it from an image file. I 'll show you how to retrieve the MRZ infor...

How to generate class diagrams pictures in a Django/Open-edX project from console

A class diagram in the Unified Modeling Language ( UML ) is a type of static structure diagram that describes the structure of a system by showing the system’s classes, their attributes, operations (or methods), and the relationships among objects. https://github.com/django-extensions/django-extensions Step 1:   Install django extensions Command:  pip install django-extensions Step 2:  Add to installed apps INSTALLED_APPS = ( ... 'django_extensions' , ... ) Step 3:  Install diagrams generators You have to choose between two diagram generators: Graphviz or Dotplus before using the command or you will get: python manage.py graph_models -a -o myapp_models.png Note:  I prefer to use   pydotplus   as it easier to install than Graphviz and its dependencies so we use   pip install pydotplus . Command:  pip install pydotplus Step 4:  Generate diagrams Now we have everything installed...

How to Remove course from Open-edX

Go to vagrant  => 1. In the edx-platform directory:  - cd /edx/app/edxapp/edx-platform 2. Run the following Django management command:   - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp lms dump_course_ids --settings aws    - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp lms dump_course_ids --settings=devstack 3. Find the course ID which you'd like to delete in the resulting list of course IDs. 4. Copy the course ID into the following command and run it:  - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp cms delete_course <COURSE_ID> --settings aws  -   sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp cms delete_course <COURSE_ID> --settings=devstack  - You'll be asked to verify the deletion . To verify the deletion, run the command from step 2 above and ensure that the course ID is not in the list. Help reference : ...