Creating a web site checker using R

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Creating a web site checker using R

chrishold
I use R a great deal but the huge web crawling power of it isn't an area I've used. I don't want to reinvent a cyberwheel and I suspect someone has done what I want.  That is a program that would run once a day (easy for me to set up as a cron task) and would crawl a single root of a web site (mine) and get the file size and a CRC or some similar check value for each page as pulled off the site (and, obviously, I'd want it not to follow off site links). The other key thing would be for it to store the values and URLs and be capable of being run in "create/update database" mode or in "check pages" mode and for the change mode run to Email me a warning if a page changes.  The reason I want this is that two of my sites have recently had content "disappear": neither I nor the ISP can see what's happened and we are lacking the very useful diagnostic of the date when the change happened which might have mapped it some component of WordPress, plugins or themes having updated.

I am failing to find anything such and all the services that offer site checking of this sort are prohibitively expensive for me (my sites are zero income and either personal or offering free utilities and information).

If anyone has done this, or something similar, I'd love to hear if you were willing to share it.  Failing that, I think I will have to create this but I know it will take me days as this isn't my area of R expertise and as, to be brutally honest, I'm a pretty poor programmer.  If I go that way, I'm sure people may be able to point me to things I may be (legitimately) able to recycle in parts to help construct this.

Thanks in advance,

Chris

--
Chris Evans <[hidden email]> Skype: chris-psyctc
Visiting Professor, University of Sheffield <[hidden email]>
I do some consultation work for the University of Roehampton <[hidden email]> and other places but this <[hidden email]> remains my main Email address.
I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take you to my blog which started with earlier joys in France and Spain!

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Creating a web site checker using R

Enrico Schumann-2
>>>>> "Chris" == Chris Evans <[hidden email]> writes:

    Chris> I use R a great deal but the huge web crawling power of
    Chris> it isn't an area I've used. I don't want to reinvent a
    Chris> cyberwheel and I suspect someone has done what I want.
    Chris> That is a program that would run once a day (easy for
    Chris> me to set up as a cron task) and would crawl a single
    Chris> root of a web site (mine) and get the file size and a
    Chris> CRC or some similar check value for each page as pulled
    Chris> off the site (and, obviously, I'd want it not to follow
    Chris> off site links). The other key thing would be for it to
    Chris> store the values and URLs and be capable of being run
    Chris> in "create/update database" mode or in "check pages"
    Chris> mode and for the change mode run to Email me a warning
    Chris> if a page changes.  The reason I want this is that two
    Chris> of my sites have recently had content "disappear":
    Chris> neither I nor the ISP can see what's happened and we
    Chris> are lacking the very useful diagnostic of the date when
    Chris> the change happened which might have mapped it some
    Chris> component of WordPress, plugins or themes having
    Chris> updated.

    Chris> I am failing to find anything such and all the services
    Chris> that offer site checking of this sort are prohibitively
    Chris> expensive for me (my sites are zero income and either
    Chris> personal or offering free utilities and information).

    Chris> If anyone has done this, or something similar, I'd love
    Chris> to hear if you were willing to share it.  Failing that,
    Chris> I think I will have to create this but I know it will
    Chris> take me days as this isn't my area of R expertise and
    Chris> as, to be brutally honest, I'm a pretty poor
    Chris> programmer.  If I go that way, I'm sure people may be
    Chris> able to point me to things I may be (legitimately) able
    Chris> to recycle in parts to help construct this.

    Chris> Thanks in advance,

    Chris> Chris

    Chris> --
    Chris> Chris Evans <[hidden email]> Skype: chris-psyctc
    Chris> Visiting Professor, University of Sheffield <[hidden email]>
    Chris> I do some consultation work for the University of Roehampton <[hidden email]> and other places but this <[hidden email]> remains my main Email address.
    Chris> I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
    Chris> Beware: French time, generally an hour ahead of UK.  That page will also take you to my blog which started with earlier joys in France and Spain!

Not an answer, but perhaps two pointers/ideas:

1) Since you know cron, I suppose you work on a
   Unix-like system, and you likely have a programme
   called 'wget' either installed or can easily install
   it. 'wget' has an option 'mirror', which allows you
   to mirror a website.

2) There is tools::md5sum for computing checksums. You
   could store those to a file and check changes in the
   files content (e.g. via 'diff').


regards
        Enrico
--
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.