Perl link extractor

Below is a simple script I use almost daily which I thought I would share with you all. The script fetches various URL’s then extracts all hyper links from the fetched data. It cleans them up a bit and prints the resulting data as standard output

#!/usr/bin/perl
use strict;
use HTML::LinkExtractor;
use LWP::Simple;

if(!$ARGV[0]){
        print "Usage:
        $0 URL URL URL ...
";
exit;
}


# Fetch and parse the link
for my $link (@ARGV){
        my $LX = new HTML::LinkExtractor;
        my $page = get($link);
        $LX->parse(\$page);

        # figure out URL base.
        my $base;
        if($link =~ /^(https?:\/{2}[^\/]+)\/?/i){
                $base = $1;
        }

        if($link !~ /\/$/){
                my @link = split(/\//, $link);
                pop @link;
                $link = join('/', @link);
        }

        for(@{ $LX->links } ){
                if(lc $_->{tag} eq 'a'){
                        my $url;
                        if($_->{href} =~ /^\//){
                                $url = $base . $_->{href};
                        } else {
                                $url = $link . '/' . $_->{href};
                        }

                        print qq{$url\n};
                }
        }
}

I call it linkext. Below is an example usage to fetch all of the links available on the Intel Lustre download page:

$ linkext https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//?C=N;O=D
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//?C=M;O=A
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//?C=S;O=A
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//?C=D;O=A
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-debuginfo-common-x86_64-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-devel-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-firmware-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-headers-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-debuginfo-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-dkms-2.6.0-1.el6.noarch.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-iokit-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-modules-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-osd-ldiskfs-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-osd-zfs-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-source-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-tests-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//perf-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//perf-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//python-perf-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//python-perf-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//sha256sum

So then, to have this a bit more useful lets parse it with egrep and pass the arguments to xargs, which executes wget to fetch our files:

linkext https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/ | egrep "\.rpm$" | xargs wget

Which would start downloading the various files. Or you can of course have xargs execute echo and display the full command line in which wget is to work on:

$ linkext https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/ | egrep "\.rpm$" | xargs echo wget
wget https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-debuginfo-common-x86_64-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-devel-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-firmware-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//kernel-headers-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-debuginfo-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-dkms-2.6.0-1.el6.noarch.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-iokit-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-modules-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-osd-ldiskfs-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-osd-zfs-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-source-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//lustre-tests-2.6.0-2.6.32_431.20.3.el6_lustre.x86_64.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//perf-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//perf-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//python-perf-2.6.32-431.20.3.el6_lustre.x86_64.rpm https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64//python-perf-debuginfo-2.6.32-431.20.3.el6_lustre.x86_64.rpm

I hope you all find this useful.

Leave a Reply

Your email address will not be published. Required fields are marked *