Scraping data from the BBC with Python

There is an annoying tendency of the BBC News site to have numerous ancient stories in their ‘Popular’ sidebar. It was a good excuse to try out Python and collect a bit more information on when this occurs.

This Python script follows the following process:

Visit the BBC News homepage and scrape the ‘Most Popular’ sidebar.
Visit the URL of each story.
Collect the published date (from the meta data and front end)
Calculate the difference between present date / time and the published date.
Store the data as a CSV file.
Repeat the whole process every n minutes or seconds.

Beautiful Soup makes relatively light work of parsing what we want, along with PrettyTable, CSV, Regular Expressions and Requests.

# BBC News scraper
# Copyright (C) 2014 Oliver S
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

import time
import requests
import re
import csv
from prettytable import PrettyTable
from datetime import datetime
from bs4 import BeautifulSoup

def scrape_bbc_popular_stories():
    """ Scrape all of the shared and popular stories from BBC homepage and record published date """

    try:
        # request BBC news homepage    
        response = requests.get("http://www.bbc.co.uk/news/")

        # parse HTML using Beautiful Soup
        # returns a `soup` object
        soup = BeautifulSoup(response.content)

        # find all the posts in the page.

        # find each list element belonging to class ol[number]

        most_popular=soup.find_all('li',{'class': re.compile('ol[0-9]')})

        #article group is 0 = shared 1 = most read 2 = audio / video
        #story number is the number assigned to the article (the number next to the title)

        article_group = 0
        
        last_story_number = 0

        # set up prettytable for nice printout
        x = PrettyTable(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"])
        x.align["URL"] = "l"            # Left align URL
        x.align["Story Title"] = "l"    # Left align story title
        
        for each_story in most_popular:

            # store the URL of each story
            link = each_story.find('a').attrs['href']   
            print link

            # get the story title
            story_title = each_story.find('a').text 
            story_title = re.sub(r'^\W*\w+\W*', '', story_title)

            # get the story number (thats listed next to the article)
            story_number = each_story.span.text   
            story_number = re.sub(':\s*',"",story_number)

            # work out which group it belongs to
            if (abs(last_story_number-int(story_number)) >= 4)   :
                article_group += 1
                
            last_story_number = int(story_number)
            
            if (article_group == 0) :
                article_group_name = "Most Shared"
            elif (article_group == 1) :
                article_group_name = "Most Read"
            else :
                article_group_name = "Audio / Video"

            
            # if its an audio / video story that uses javascript to generate the time relations, ignore
            
            if article_group != 2:
                # retrieve the missed connection with requests
                response = requests.get(link)

                # Parse the html of the missed connection post
                soup = BeautifulSoup(response.content)

                # get the original publication date from the meta tags
                orig_pub_date_meta =  soup.find(attrs={"name":"OriginalPublicationDate"})['content']

                # get the original publication date from the front end (new BBC site)
                try:
                    orig_pub_date_front = soup.find('span',{'class':'story-date'}).text
                    orig_pub_date_front = datetime.strptime(orig_pub_date_front, "\n%d %B %Y\nLast updated at %H:%M\n")
                except AttributeError:
                    print "Old site - set front date to meta date"
                    orig_pub_date_front = orig_pub_date_meta
                    
                    
                # calculate the difference between original publishing date and time of checking
                orig_pub_date_meta = datetime.strptime(orig_pub_date_meta, "%Y/%m/%d %H:%M:%S")
                days_difference = abs((datetime.now()-orig_pub_date_meta).days)

                #"Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"
                x.add_row([days_difference,article_group_name,story_number,story_title,link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])
                #"Check TS", "Panel", "Title", "URL", "Pub Date Meta", "Pub Date Front"
                f.writerow([days_difference,article_group_name.encode('ascii','ignore'),story_number.encode('ascii','ignore'),story_title.encode('ascii','ignore'),link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])

        # print the nicely formatted table that PrettyTable generated, once all stories collected
        print x
        
    except requests.exceptions.RequestException as e:    # This is the correct syntax
        print e



if __name__ == '__main__':
    

    f = csv.writer(open("bbc_popular_stories.csv", "wb"))
    f.writerow(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"]) # Write column headers as the first line

    while True:
        scrape_bbc_popular_stories()
        print "Waiting for 5 minutes...."
        time.sleep(300)

Running this in IDLE shell looks like this:

and some of the ‘old’ results collected over a 3 hour period can be seen in this spreadsheet . Out of 46 news stories listed under popular stories during this period, 21 were over 67 days old!

Days Old	Story Title	URL	Present D/T	Original Pub Date
281	Rent ‘unaffordable’ in third of UK	http://www.bbc.co.uk/news/business-23273448	23/04/2014 00:01	15/07/2013 06:35
852	O’Donnell warns of UK challenges	http://www.bbc.co.uk/news/uk-politics-16295421	23/04/2014 00:01	22/12/2011 18:10
67	Pension system ‘is not working’	http://www.bbc.co.uk/news/business-26178113	23/04/2014 00:11	14/02/2014 11:31
593	Union joint strike action warning	http://www.bbc.co.uk/news/business-19514195	23/04/2014 00:26	06/09/2012 23:40
854	Government outlines pension deal	http://www.bbc.co.uk/news/business-16259238	23/04/2014 00:36	20/12/2011 22:20
244	Millions ‘worse off’ on new pension	http://www.bbc.co.uk/news/business-23770327	23/04/2014 00:46	21/08/2013 12:35
508	Plain packs for Australia smokers	http://www.bbc.co.uk/news/world-asia-20559585	23/04/2014 01:12	01/12/2012 01:08
509	Energy Bill for ‘cleaner economy’	http://www.bbc.co.uk/news/business-20539981	23/04/2014 01:17	29/11/2012 15:14
615	Australia court backs tobacco law	http://www.bbc.co.uk/news/business-19264245	23/04/2014 01:32	15/08/2012 01:56
887	Doctors call for car smoking ban	http://www.bbc.co.uk/news/health-15744352	23/04/2014 01:52	17/11/2011 17:10
435	Bedroom tax’s’ impact on the north	http://www.bbc.co.uk/news/uk-england-tyne-21412826	23/04/2014 02:02	11/02/2013 14:01
317	Labour ‘would cap welfare spending’	http://www.bbc.co.uk/news/uk-politics-22785282	23/04/2014 02:07	09/06/2013 16:37
323	Ed Balls seeks to restore Labour’s economic credibility	http://www.bbc.co.uk/news/uk-politics-22753040	23/04/2014 02:17	03/06/2013 13:24
281	Benefit cap ‘leads to more in work’	http://www.bbc.co.uk/news/business-23306092	23/04/2014 02:17	15/07/2013 17:51
818	Britons ‘becoming more dishonest’	http://www.bbc.co.uk/news/uk-16714872	23/04/2014 02:43	25/01/2012 08:54
218	Benefit cheats face 10-year terms	http://www.bbc.co.uk/news/uk-24104743	23/04/2014 02:48	16/09/2013 12:49
811	Family life on benefits	http://www.bbc.co.uk/news/uk-16812185	23/04/2014 02:53	01/02/2012 12:47
135	Most people in poverty are ‘in work’	http://www.bbc.co.uk/news/uk-25287068	23/04/2014 03:23	08/12/2013 21:49
187	Work ‘may be no way out of poverty’	http://www.bbc.co.uk/news/uk-politics-24553611	23/04/2014 03:28	17/10/2013 14:59
692	Professions ‘must be more open’	http://www.bbc.co.uk/news/uk-politics-18254219	23/04/2014 03:39	30/05/2012 15:32
309	Top unis ‘now less socially diverse’	http://www.bbc.co.uk/news/education-22912609	23/04/2014 04:04	17/06/2013 07:38

Oliver's Blog

Scraping data from the BBC with Python

Leave a Reply Cancel reply