Tag Archives: beautifulsoup

Scraping data from the BBC with Python

There is an annoying tendency of the BBC News site to have numerous ancient stories in their ‘Popular’ sidebar. It was a good excuse to try out Python and collect a bit more information on when this occurs.

This Python script follows the following process:

  • Visit the BBC News homepage and scrape the ‘Most Popular’ sidebar.
  • Visit the URL of each story.
  • Collect the published date (from the meta data and front end)
  • Calculate the difference between present date / time and the published date.
  • Store the data as a CSV file.
  • Repeat the whole process every n minutes or seconds.

Beautiful Soup makes relatively light work of parsing what we want, along with PrettyTable, CSV, Regular Expressions and Requests.

# BBC News scraper
# Copyright (C) 2014 Oliver S
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

import time
import requests
import re
import csv
from prettytable import PrettyTable
from datetime import datetime
from bs4 import BeautifulSoup

def scrape_bbc_popular_stories():
    """ Scrape all of the shared and popular stories from BBC homepage and record published date """

    try:
        # request BBC news homepage    
        response = requests.get("http://www.bbc.co.uk/news/")

        # parse HTML using Beautiful Soup
        # returns a `soup` object
        soup = BeautifulSoup(response.content)

        # find all the posts in the page.

        # find each list element belonging to class ol[number]

        most_popular=soup.find_all('li',{'class': re.compile('ol[0-9]')})

        #article group is 0 = shared 1 = most read 2 = audio / video
        #story number is the number assigned to the article (the number next to the title)

        article_group = 0
        
        last_story_number = 0

        # set up prettytable for nice printout
        x = PrettyTable(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"])
        x.align["URL"] = "l"            # Left align URL
        x.align["Story Title"] = "l"    # Left align story title
        
        for each_story in most_popular:

            # store the URL of each story
            link = each_story.find('a').attrs['href']   
            print link

            # get the story title
            story_title = each_story.find('a').text 
            story_title = re.sub(r'^\W*\w+\W*', '', story_title)

            # get the story number (thats listed next to the article)
            story_number = each_story.span.text   
            story_number = re.sub(':\s*',"",story_number)

            # work out which group it belongs to
            if (abs(last_story_number-int(story_number)) >= 4)   :
                article_group += 1
                
            last_story_number = int(story_number)
            
            if (article_group == 0) :
                article_group_name = "Most Shared"
            elif (article_group == 1) :
                article_group_name = "Most Read"
            else :
                article_group_name = "Audio / Video"

            
            # if its an audio / video story that uses javascript to generate the time relations, ignore
            
            if article_group != 2:
                # retrieve the missed connection with requests
                response = requests.get(link)

                # Parse the html of the missed connection post
                soup = BeautifulSoup(response.content)

                # get the original publication date from the meta tags
                orig_pub_date_meta =  soup.find(attrs={"name":"OriginalPublicationDate"})['content']

                # get the original publication date from the front end (new BBC site)
                try:
                    orig_pub_date_front = soup.find('span',{'class':'story-date'}).text
                    orig_pub_date_front = datetime.strptime(orig_pub_date_front, "\n%d %B %Y\nLast updated at %H:%M\n")
                except AttributeError:
                    print "Old site - set front date to meta date"
                    orig_pub_date_front = orig_pub_date_meta
                    
                    
                # calculate the difference between original publishing date and time of checking
                orig_pub_date_meta = datetime.strptime(orig_pub_date_meta, "%Y/%m/%d %H:%M:%S")
                days_difference = abs((datetime.now()-orig_pub_date_meta).days)

                #"Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"
                x.add_row([days_difference,article_group_name,story_number,story_title,link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])
                #"Check TS", "Panel", "Title", "URL", "Pub Date Meta", "Pub Date Front"
                f.writerow([days_difference,article_group_name.encode('ascii','ignore'),story_number.encode('ascii','ignore'),story_title.encode('ascii','ignore'),link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])

        # print the nicely formatted table that PrettyTable generated, once all stories collected
        print x
        
    except requests.exceptions.RequestException as e:    # This is the correct syntax
        print e



if __name__ == '__main__':
    

    f = csv.writer(open("bbc_popular_stories.csv", "wb"))
    f.writerow(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"]) # Write column headers as the first line

    while True:
        scrape_bbc_popular_stories()
        print "Waiting for 5 minutes...."
        time.sleep(300)

Running this in IDLE shell looks like this:

and some of the ‘old’ results collected over a 3 hour period can be seen in this spreadsheet . Out of 46 news stories listed under popular stories during this period, 21 were over 67 days old!

Days OldStory TitleURLPresent D/TOriginal Pub Date
281Rent ‘unaffordable’ in third of UKhttp://www.bbc.co.uk/news/business-2327344823/04/2014 00:0115/07/2013 06:35
852O’Donnell warns of UK challengeshttp://www.bbc.co.uk/news/uk-politics-1629542123/04/2014 00:0122/12/2011 18:10
67Pension system ‘is not working’http://www.bbc.co.uk/news/business-2617811323/04/2014 00:1114/02/2014 11:31
593Union joint strike action warninghttp://www.bbc.co.uk/news/business-1951419523/04/2014 00:2606/09/2012 23:40
854Government outlines pension dealhttp://www.bbc.co.uk/news/business-1625923823/04/2014 00:3620/12/2011 22:20
244Millions ‘worse off’ on new pensionhttp://www.bbc.co.uk/news/business-2377032723/04/2014 00:4621/08/2013 12:35
508Plain packs for Australia smokershttp://www.bbc.co.uk/news/world-asia-2055958523/04/2014 01:1201/12/2012 01:08
509Energy Bill for ‘cleaner economy’http://www.bbc.co.uk/news/business-2053998123/04/2014 01:1729/11/2012 15:14
615Australia court backs tobacco lawhttp://www.bbc.co.uk/news/business-1926424523/04/2014 01:3215/08/2012 01:56
887Doctors call for car smoking banhttp://www.bbc.co.uk/news/health-1574435223/04/2014 01:5217/11/2011 17:10
435Bedroom tax’s’ impact on the northhttp://www.bbc.co.uk/news/uk-england-tyne-2141282623/04/2014 02:0211/02/2013 14:01
317Labour ‘would cap welfare spending’http://www.bbc.co.uk/news/uk-politics-2278528223/04/2014 02:0709/06/2013 16:37
323Ed Balls seeks to restore Labour’s economic credibilityhttp://www.bbc.co.uk/news/uk-politics-2275304023/04/2014 02:1703/06/2013 13:24
281Benefit cap ‘leads to more in work’http://www.bbc.co.uk/news/business-2330609223/04/2014 02:1715/07/2013 17:51
818Britons ‘becoming more dishonest’http://www.bbc.co.uk/news/uk-1671487223/04/2014 02:4325/01/2012 08:54
218Benefit cheats face 10-year termshttp://www.bbc.co.uk/news/uk-2410474323/04/2014 02:4816/09/2013 12:49
811Family life on benefitshttp://www.bbc.co.uk/news/uk-1681218523/04/2014 02:5301/02/2012 12:47
135Most people in poverty are ‘in work’http://www.bbc.co.uk/news/uk-2528706823/04/2014 03:2308/12/2013 21:49
187Work ‘may be no way out of poverty’http://www.bbc.co.uk/news/uk-politics-2455361123/04/2014 03:2817/10/2013 14:59
692Professions ‘must be more open’http://www.bbc.co.uk/news/uk-politics-1825421923/04/2014 03:3930/05/2012 15:32
309Top unis ‘now less socially diverse’http://www.bbc.co.uk/news/education-2291260923/04/2014 04:0417/06/2013 07:38