Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Perl Beginners > Link parsing (w...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 1 Topic 11013 of 11060
Post > Topic >>

Link parsing (was: Getting error...)

by noreply@[EMAIL PROTECTED] (Gunnar Hjalmarsson) May 1, 2008 at 08:03 PM

hotkitty wrote:
> I ultimately want to go to cnn.com/ politics, follow all links under 
> the "Election Coverage" headline and, w/in those links, save all the 
> links under the "Don't Miss" sections that appear in those stories. 
> However, after many hours and trial & error I've yet to complete the 
> task. I know mechanize can do this somehow but I've yet to figure out 
> how to put it all together.

It's not so much about putting it together; it's more like writing Perl 
code step by step...

> Here's the script I have so far, which gets me to only step one:

http://www.mail-archive.com/beginners%40perl.org/msg93769.html

Actually, I'm not sure that the code you have even gets you to step one.

As a parsing exercise, I wrote the code below. I chose to make use of 
LWP::Simple and HTML::TokeParser. Please study the docs for the latter: 
http://search.cpan.org/perldoc?HTML::TokeParser


#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $domain = 'http://edition.cnn.com';
my $uri = $domain . '/POLITICS/';

my $html = get($uri) or die "Fetching $uri failed";
my $p = HTML::TokeParser->new(\$html);

# go to start position in the document
while ( $p->get_tag('div') ) {
     last if $p->get_text eq 'Election coverage';
}

# extract links
my @[EMAIL PROTECTED]
 ( my $token = $p->get_token ) {
     if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
         push @[EMAIL PROTECTED]
 $token->[2]{href};
     }
     last if $token->[0] eq 'E' and $token->[1] eq 'ul';
}

foreach my $uri ( map $domain . $_, @[EMAIL PROTECTED]
 ) {
     my $html = get($uri) or warn "Fetching $uri failed" and next;
     my $p = HTML::TokeParser->new(\$html);

     # go to start position in the document
     $p->get_tag('h4');
     unless ( $p->get_text eq "Don't Miss" ) {
         warn "Didn't find section \"Don't Miss\"";
         next;
     }

     print "$uri\n";

     # extract links
     while ( my $token = $p->get_token ) {
         if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
             print '  ', $p->get_text, "\n";
             my $uri = substr($token->[2]{href}, 0, 4) eq 'http' ?
               $token->[2]{href} : $domain . $token->[2]{href};
             print "  $uri\n\n";
         }
         last if $token->[0] eq 'E' and $token->[1] eq 'ul';
     }
}

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl




 1 Posts in Topic:
Link parsing (was: Getting error...)
noreply@[EMAIL PROTECTED]  2008-05-01 20:03:24 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Mon May 12 20:32:25 CDT 2008.