HTML Page Scraping (of the Quick and Dirty Variety)

Category: Python - Miscellaneous

This recipe is really about using previously identified information in a web page--ie, state information--to decide how to use newly identified information. To view a page of the kind that can be scraped using the code below visit
http://www.archives.ca/02/02012202_e.html
select "Ontario", enter "Cornwall" in the Geographic Location box, and select "MAX" in the Number of References per page list.

Two kinds of document images are offered within each page served by the census site, namely, schedule 1 document images and schedule 2. Only the schedule 1
documents provide information in which I have an interest at present (namely surnames, birthdates, etc). I would, therefore, like to extract information that identifies schedule 1 images and ignore the others.

Put in terms of state, when my script notices that it has most recently seen HTML code indicating schedule 1 I want it to extract information in the URLs in the "option" tags; when it has found schedule 2 I want it to ignore the URLs. It might be that one of the simplest ways of doing this is to form a regular expression (RE) that alternates one RE that recognises schedule numbers and one RE that recognises the URLs, then use this whole RE with a "sub" function so that the matches can be processed in a purpose-built function.

Incidentally, I have found that Phil Schwartz' "Kodos" Python Regex Debugger makes it a lot faster to create and check REs. Many thanks, Phil! Date: 07 January, 2012


Web

Homepage: http://code.activestate.com/recipes/259143-html-page-scraping-of-the-quick-and-dirty-variety/?in=lang-python

Developer: Bill Bell

License: Python License

Operating System: Windows

Add a Comment

all are required fields

     
What do you think of this resource?

Select Your Rate:

Votes:0

 

Related Scripts Download

XOOPS is a dynamic OO (Object Oriented) based open source portal script written in PHP.

developer Developer: http://www.xoops.org
license License: Freeware
operating systems Operating System: win, linux, unix, etc


MaxWebPortal is a web portal and online community system which includes advanced features such as web-based administration, poll, private/public events calendar, user customizable color themes, classifieds, user control panel, online pager, link, file, article, picture managers and much more.

developer Developer: http://www.maxwebportal....
license License: Freeware
operating systems Operating System: Win 95/98/ME/2000/XP, Access 97/2000, SQL


HTML::Mason is a web site development and delivery system that constructs web pages and sites from shared, reusable building blocks called components.

developer Developer: Mason Team
license License: GNU General Public License (GPL)
operating systems Operating System: Apache


Super-dot.

developer Developer: Robert
license License: Unknown
operating systems Operating System: Not Available


Provides Slackware web hosting, shell accounts and domain registration.

developer Developer: CretaForce
license License: Unknown
operating systems Operating System: Not Available


Amdac Systems has been setting new standards in web hosting, software development, and marketing solutions since 2002.

developer Developer: Dave
license License: Unknown
operating systems Operating System: Not Available


We offer $1 Cheap, Free Setup, Reliable, Unlimited Domains Reseller/Shared Hosting.

developer Developer: Sonia
license License: Unknown
operating systems Operating System: Not Available


This short script allows a user to track the current status of a package sent through FedEx.

developer Developer: Chris Moffitt
license License: Python License
operating systems Operating System: Windows


This class replaces characters until there are no offending ones left.

developer Developer: John Nielsen
license License: Python License
operating systems Operating System: Windows