There is an annoying tendency of the BBC News site to have numerous ancient stories in their ‘Popular’ sidebar. It was a good excuse to try out Python and collect a bit more information on when this occurs.
This Python script follows the following process:
- Visit the BBC News homepage and scrape the ‘Most Popular’ sidebar.
- Visit the URL of each story.
- Collect the published date (from the meta data and front end)
- Calculate the difference between present date / time and the published date.
- Store the data as a CSV file.
- Repeat the whole process every n minutes or seconds.
Beautiful Soup makes relatively light work of parsing what we want, along with PrettyTable, CSV, Regular Expressions and Requests.
# BBC News scraper # Copyright (C) 2014 Oliver S # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import time import requests import re import csv from prettytable import PrettyTable from datetime import datetime from bs4 import BeautifulSoup def scrape_bbc_popular_stories(): """ Scrape all of the shared and popular stories from BBC homepage and record published date """ try: # request BBC news homepage response = requests.get("http://www.bbc.co.uk/news/") # parse HTML using Beautiful Soup # returns a `soup` object soup = BeautifulSoup(response.content) # find all the posts in the page. # find each list element belonging to class ol[number] most_popular=soup.find_all('li',{'class': re.compile('ol[0-9]')}) #article group is 0 = shared 1 = most read 2 = audio / video #story number is the number assigned to the article (the number next to the title) article_group = 0 last_story_number = 0 # set up prettytable for nice printout x = PrettyTable(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"]) x.align["URL"] = "l" # Left align URL x.align["Story Title"] = "l" # Left align story title for each_story in most_popular: # store the URL of each story link = each_story.find('a').attrs['href'] print link # get the story title story_title = each_story.find('a').text story_title = re.sub(r'^\W*\w+\W*', '', story_title) # get the story number (thats listed next to the article) story_number = each_story.span.text story_number = re.sub(':\s*',"",story_number) # work out which group it belongs to if (abs(last_story_number-int(story_number)) >= 4) : article_group += 1 last_story_number = int(story_number) if (article_group == 0) : article_group_name = "Most Shared" elif (article_group == 1) : article_group_name = "Most Read" else : article_group_name = "Audio / Video" # if its an audio / video story that uses javascript to generate the time relations, ignore if article_group != 2: # retrieve the missed connection with requests response = requests.get(link) # Parse the html of the missed connection post soup = BeautifulSoup(response.content) # get the original publication date from the meta tags orig_pub_date_meta = soup.find(attrs={"name":"OriginalPublicationDate"})['content'] # get the original publication date from the front end (new BBC site) try: orig_pub_date_front = soup.find('span',{'class':'story-date'}).text orig_pub_date_front = datetime.strptime(orig_pub_date_front, "\n%d %B %Y\nLast updated at %H:%M\n") except AttributeError: print "Old site - set front date to meta date" orig_pub_date_front = orig_pub_date_meta # calculate the difference between original publishing date and time of checking orig_pub_date_meta = datetime.strptime(orig_pub_date_meta, "%Y/%m/%d %H:%M:%S") days_difference = abs((datetime.now()-orig_pub_date_meta).days) #"Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)" x.add_row([days_difference,article_group_name,story_number,story_title,link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front]) #"Check TS", "Panel", "Title", "URL", "Pub Date Meta", "Pub Date Front" f.writerow([days_difference,article_group_name.encode('ascii','ignore'),story_number.encode('ascii','ignore'),story_title.encode('ascii','ignore'),link,datetime.now().strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_meta.strftime('%Y/%m/%d %H:%M:%S'),orig_pub_date_front]) # print the nicely formatted table that PrettyTable generated, once all stories collected print x except requests.exceptions.RequestException as e: # This is the correct syntax print e if __name__ == '__main__': f = csv.writer(open("bbc_popular_stories.csv", "wb")) f.writerow(["Days Old", "Group Name", "Story Number", "Story Title", "URL", "Present D/T", "Original Pub Date", "Orig Pub Date (front)"]) # Write column headers as the first line while True: scrape_bbc_popular_stories() print "Waiting for 5 minutes...." time.sleep(300)
Running this in IDLE shell looks like this:
and some of the ‘old’ results collected over a 3 hour period can be seen in this spreadsheet . Out of 46 news stories listed under popular stories during this period, 21 were over 67 days old!
Days Old | Story Title | URL | Present D/T | Original Pub Date |
---|---|---|---|---|
281 | Rent ‘unaffordable’ in third of UK | http://www.bbc.co.uk/news/business-23273448 | 23/04/2014 00:01 | 15/07/2013 06:35 |
852 | O’Donnell warns of UK challenges | http://www.bbc.co.uk/news/uk-politics-16295421 | 23/04/2014 00:01 | 22/12/2011 18:10 |
67 | Pension system ‘is not working’ | http://www.bbc.co.uk/news/business-26178113 | 23/04/2014 00:11 | 14/02/2014 11:31 |
593 | Union joint strike action warning | http://www.bbc.co.uk/news/business-19514195 | 23/04/2014 00:26 | 06/09/2012 23:40 |
854 | Government outlines pension deal | http://www.bbc.co.uk/news/business-16259238 | 23/04/2014 00:36 | 20/12/2011 22:20 |
244 | Millions ‘worse off’ on new pension | http://www.bbc.co.uk/news/business-23770327 | 23/04/2014 00:46 | 21/08/2013 12:35 |
508 | Plain packs for Australia smokers | http://www.bbc.co.uk/news/world-asia-20559585 | 23/04/2014 01:12 | 01/12/2012 01:08 |
509 | Energy Bill for ‘cleaner economy’ | http://www.bbc.co.uk/news/business-20539981 | 23/04/2014 01:17 | 29/11/2012 15:14 |
615 | Australia court backs tobacco law | http://www.bbc.co.uk/news/business-19264245 | 23/04/2014 01:32 | 15/08/2012 01:56 |
887 | Doctors call for car smoking ban | http://www.bbc.co.uk/news/health-15744352 | 23/04/2014 01:52 | 17/11/2011 17:10 |
435 | Bedroom tax’s’ impact on the north | http://www.bbc.co.uk/news/uk-england-tyne-21412826 | 23/04/2014 02:02 | 11/02/2013 14:01 |
317 | Labour ‘would cap welfare spending’ | http://www.bbc.co.uk/news/uk-politics-22785282 | 23/04/2014 02:07 | 09/06/2013 16:37 |
323 | Ed Balls seeks to restore Labour’s economic credibility | http://www.bbc.co.uk/news/uk-politics-22753040 | 23/04/2014 02:17 | 03/06/2013 13:24 |
281 | Benefit cap ‘leads to more in work’ | http://www.bbc.co.uk/news/business-23306092 | 23/04/2014 02:17 | 15/07/2013 17:51 |
818 | Britons ‘becoming more dishonest’ | http://www.bbc.co.uk/news/uk-16714872 | 23/04/2014 02:43 | 25/01/2012 08:54 |
218 | Benefit cheats face 10-year terms | http://www.bbc.co.uk/news/uk-24104743 | 23/04/2014 02:48 | 16/09/2013 12:49 |
811 | Family life on benefits | http://www.bbc.co.uk/news/uk-16812185 | 23/04/2014 02:53 | 01/02/2012 12:47 |
135 | Most people in poverty are ‘in work’ | http://www.bbc.co.uk/news/uk-25287068 | 23/04/2014 03:23 | 08/12/2013 21:49 |
187 | Work ‘may be no way out of poverty’ | http://www.bbc.co.uk/news/uk-politics-24553611 | 23/04/2014 03:28 | 17/10/2013 14:59 |
692 | Professions ‘must be more open’ | http://www.bbc.co.uk/news/uk-politics-18254219 | 23/04/2014 03:39 | 30/05/2012 15:32 |
309 | Top unis ‘now less socially diverse’ | http://www.bbc.co.uk/news/education-22912609 | 23/04/2014 04:04 | 17/06/2013 07:38 |