There is an annoying tendency of the BBC News site to have numerous ancient stories in their ‘Popular’ sidebar. It was a good excuse to try out Python and collect a bit more information on when this occurs.
This Python script follows the following process:
- Visit the BBC News homepage and scrape the ‘Most Popular’ sidebar.
- Visit the URL of each story.
- Collect the published date (from the meta data and front end)
- Calculate the difference between present date / time and the published date.
- Store the data as a CSV file.
- Repeat the whole process every n minutes or seconds.
Beautiful Soup makes relatively light work of parsing what we want, along with PrettyTable, CSV, Regular Expressions and Requests.
# BBC News scraper
# Copyright (C) 2014 Oliver S
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
import time
import requests
import re
import csv
from prettytable import PrettyTable
from datetime import datetime
from bs4 import BeautifulSoup
def scrape_bbc_popular_stories():
""" Scrape all of the shared and popular stories from BBC homepage and record published date """
try:
# request BBC news homepage
response = requests.get("http://www.bbc.co.uk/news/")
# parse HTML using Beautiful Soup
# returns a `soup` object
soup = BeautifulSoup(response.content)
# find all the posts in the page.
# find each list element belonging to class ol[number]
most_popular=soup.find_all('li',{'class': re.compile('ol[0-9]')})
#article group is 0 = shared 1 = most read 2 = audio / video
#story number is the number assigned to the article (the number next to the title)
article_group = 0
last_story_number = 0
# set up prettytable for nice printout
x = PrettyTable(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"])
x.align["URL"] = "l" # Left align URL
x.align["Story Title"] = "l" # Left align story title
for each_story in most_popular:
# store the URL of each story
link = each_story.find('a').attrs['href']
print link
# get the story title
story_title = each_story.find('a').text
story_title = re.sub(r'^\W*\w+\W*', '', story_title)
# get the story number (thats listed next to the article)
story_number = each_story.span.text
story_number = re.sub(':\s*',"",story_number)
# work out which group it belongs to
if (abs(last_story_number-int(story_number)) >= 4) :
article_group += 1
last_story_number = int(story_number)
if (article_group == 0) :
article_group_name = "Most Shared"
elif (article_group == 1) :
article_group_name = "Most Read"
else :
article_group_name = "Audio / Video"
# if its an audio / video story that uses javascript to generate the time relations, ignore
if article_group != 2:
# retrieve the missed connection with requests
response = requests.get(link)
# Parse the html of the missed connection post
soup = BeautifulSoup(response.content)
# get the original publication date from the meta tags
orig_pub_date_meta = soup.find(attrs={"name":"OriginalPublicationDate"})['content']
# get the original publication date from the front end (new BBC site)
try:
orig_pub_date_front = soup.find('span',{'class':'story-date'}).text
orig_pub_date_front = datetime.strptime(orig_pub_date_front, "\n%d %B %Y\nLast updated at %H:%M\n")
except AttributeError:
print "Old site - set front date to meta date"
orig_pub_date_front = orig_pub_date_meta
# calculate the difference between original publishing date and time of checking
orig_pub_date_meta = datetime.strptime(orig_pub_date_meta, "%Y/%m/%d %H:%M:%S")
days_difference = abs((datetime.now()-orig_pub_date_meta).days)
#"Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"
x.add_row([days_difference,article_group_name,story_number,story_title,link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])
#"Check TS", "Panel", "Title", "URL", "Pub Date Meta", "Pub Date Front"
f.writerow([days_difference,article_group_name.encode('ascii','ignore'),story_number.encode('ascii','ignore'),story_title.encode('ascii','ignore'),link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front])
# print the nicely formatted table that PrettyTable generated, once all stories collected
print x
except requests.exceptions.RequestException as e: # This is the correct syntax
print e
if __name__ == '__main__':
f = csv.writer(open("bbc_popular_stories.csv", "wb"))
f.writerow(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"]) # Write column headers as the first line
while True:
scrape_bbc_popular_stories()
print "Waiting for 5 minutes...."
time.sleep(300)
Running this in IDLE shell looks like this:

and some of the ‘old’ results collected over a 3 hour period can be seen in this spreadsheet . Out of 46 news stories listed under popular stories during this period, 21 were over 67 days old!
| Days Old | Story Title | URL | Present D/T | Original Pub Date |
|---|---|---|---|---|
| 281 | Rent ‘unaffordable’ in third of UK | http://www.bbc.co.uk/news/business-23273448 | 23/04/2014 00:01 | 15/07/2013 06:35 |
| 852 | O’Donnell warns of UK challenges | http://www.bbc.co.uk/news/uk-politics-16295421 | 23/04/2014 00:01 | 22/12/2011 18:10 |
| 67 | Pension system ‘is not working’ | http://www.bbc.co.uk/news/business-26178113 | 23/04/2014 00:11 | 14/02/2014 11:31 |
| 593 | Union joint strike action warning | http://www.bbc.co.uk/news/business-19514195 | 23/04/2014 00:26 | 06/09/2012 23:40 |
| 854 | Government outlines pension deal | http://www.bbc.co.uk/news/business-16259238 | 23/04/2014 00:36 | 20/12/2011 22:20 |
| 244 | Millions ‘worse off’ on new pension | http://www.bbc.co.uk/news/business-23770327 | 23/04/2014 00:46 | 21/08/2013 12:35 |
| 508 | Plain packs for Australia smokers | http://www.bbc.co.uk/news/world-asia-20559585 | 23/04/2014 01:12 | 01/12/2012 01:08 |
| 509 | Energy Bill for ‘cleaner economy’ | http://www.bbc.co.uk/news/business-20539981 | 23/04/2014 01:17 | 29/11/2012 15:14 |
| 615 | Australia court backs tobacco law | http://www.bbc.co.uk/news/business-19264245 | 23/04/2014 01:32 | 15/08/2012 01:56 |
| 887 | Doctors call for car smoking ban | http://www.bbc.co.uk/news/health-15744352 | 23/04/2014 01:52 | 17/11/2011 17:10 |
| 435 | Bedroom tax’s’ impact on the north | http://www.bbc.co.uk/news/uk-england-tyne-21412826 | 23/04/2014 02:02 | 11/02/2013 14:01 |
| 317 | Labour ‘would cap welfare spending’ | http://www.bbc.co.uk/news/uk-politics-22785282 | 23/04/2014 02:07 | 09/06/2013 16:37 |
| 323 | Ed Balls seeks to restore Labour’s economic credibility | http://www.bbc.co.uk/news/uk-politics-22753040 | 23/04/2014 02:17 | 03/06/2013 13:24 |
| 281 | Benefit cap ‘leads to more in work’ | http://www.bbc.co.uk/news/business-23306092 | 23/04/2014 02:17 | 15/07/2013 17:51 |
| 818 | Britons ‘becoming more dishonest’ | http://www.bbc.co.uk/news/uk-16714872 | 23/04/2014 02:43 | 25/01/2012 08:54 |
| 218 | Benefit cheats face 10-year terms | http://www.bbc.co.uk/news/uk-24104743 | 23/04/2014 02:48 | 16/09/2013 12:49 |
| 811 | Family life on benefits | http://www.bbc.co.uk/news/uk-16812185 | 23/04/2014 02:53 | 01/02/2012 12:47 |
| 135 | Most people in poverty are ‘in work’ | http://www.bbc.co.uk/news/uk-25287068 | 23/04/2014 03:23 | 08/12/2013 21:49 |
| 187 | Work ‘may be no way out of poverty’ | http://www.bbc.co.uk/news/uk-politics-24553611 | 23/04/2014 03:28 | 17/10/2013 14:59 |
| 692 | Professions ‘must be more open’ | http://www.bbc.co.uk/news/uk-politics-18254219 | 23/04/2014 03:39 | 30/05/2012 15:32 |
| 309 | Top unis ‘now less socially diverse’ | http://www.bbc.co.uk/news/education-22912609 | 23/04/2014 04:04 | 17/06/2013 07:38 |