Developing a web crawler

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Developing a web crawler

antujsrv
Hi,

I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets.
So how should i go about analyzing data that is not available in table format.

Few chunks of code that i wrote:
w <- getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
write.table(w,"test.txt")
t <- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

rex.dwyer
Perl seems like a 10x better choice for the task, but try looking at the examples in ?strsplit to get started.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of antujsrv
Sent: Thursday, March 03, 2011 4:23 AM
To: [hidden email]
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w <-
getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
write.table(w,"test.txt")
t <- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler / R "webkit" or something similar?

Mike Marchywka
In reply to this post by antujsrv







> Date: Thu, 3 Mar 2011 01:22:44 -0800
> From: [hidden email]
> To: [hidden email]
> Subject: [R] Developing a web crawler
>
> Hi,
>
> I wish to develop a web crawler in R. I have been using the functionalities
> available under the RCurl package.
> I am able to extract the html content of the site but i don't know how to go

In general this can be a big effort but there may be things in
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a "webkit"
package or other open source browser with or without an "R" interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser.


Now that you mention it, I wonder if there are browser plugins to handle
"R" content ( I'd have to give this some thought, put a script up as
a web page with mime type "test/R" and have it execute it in R. )



> about analyzing the html formatted document.
> I wish to know the frequency of a word in the document. I am only acquainted
> with analyzing data sets.
> So how should i go about analyzing data that is not available in table
> format.
>
> Few chunks of code that i wrote:
> w <-
> getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
> write.table(w,"test.txt")
> t <- readLines(w)
>
> readLines also didnt prove out to be of any help.
>
> Any help would be highly appreciated. Thanks in advance.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

Alexy Khrabrov
In reply to this post by antujsrv

On Mar 3, 2011, at 4:22 AM, antujsrv wrote:
>
> I wish to develop a web crawler in R.

As Rex said, there are faster languages, but R string processing got better due to the stringr package (R Journal 2010-2).  When Hadley is done with it, it will be like having it all in R!

-- Alexy
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

Btibert3
This post has NOT been accepted by the mailing list yet.
As mentioned, R might not be the best language to do this, but it certainly is capable for smaller scale projects.  I have used R to crawl some sites for data in the past, but mostly when I knew the structure and the pages that I wanted (mostly) in advance.  

I have used the XML package along with stringr to accomplish my goals.  Check out these links which might give you some ideas.

http://stackoverflow.com/questions/3713810/read-html-table-in-r-troubleshooting
http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r
http://www.brocktibert.com/blog/2011/02/08/create-a-web-crawler-in-r/

HTH
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

Stefan Th. Gries
In reply to this post by antujsrv
Hi

The book whose companion website is here
<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/qclwr.html>
deals with many of the things you need for a web crawler, and
assignment "other 5" on that site
(<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>)
is a web crawler.

Best,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler / R "webkit" or something similar? [off topic]

Matt Shotwell-3
In reply to this post by Mike Marchywka
On 03/03/2011 08:07 AM, Mike Marchywka wrote:

>
>
>
>
>
>
>
>> Date: Thu, 3 Mar 2011 01:22:44 -0800
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [R] Developing a web crawler
>>
>> Hi,
>>
>> I wish to develop a web crawler in R. I have been using the functionalities
>> available under the RCurl package.
>> I am able to extract the html content of the site but i don't know how to go
>
> In general this can be a big effort but there may be things in
> text processing packages you could adapt to execute html and javascript.
> However, I guess what I'd be looking for is something like a "webkit"
> package or other open source browser with or without an "R" interface.
> This actually may be an ideal solution for a lot of things as you get
> all the content handlers of at least some browser.
>
>
> Now that you mention it, I wonder if there are browser plugins to handle
> "R" content ( I'd have to give this some thought, put a script up as
> a web page with mime type "test/R" and have it execute it in R. )

There are server-side solutions for this sort of thing. See
http://rapache.net/ . Also, there was a string of messages on R-devel
some years ago addressing the mime type issue; beginning here:
http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't
know whether there was a resolution. Some suggestions were text/x-R,
text/x-Rd, application/x-RData.

-Matt

>
>
>
>> about analyzing the html formatted document.
>> I wish to know the frequency of a word in the document. I am only acquainted
>> with analyzing data sets.
>> So how should i go about analyzing data that is not available in table
>> format.
>>
>> Few chunks of code that i wrote:
>> w<-
>> getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
>> write.table(w,"test.txt")
>> t<- readLines(w)
>>
>> readLines also didnt prove out to be of any help.
>>
>> Any help would be highly appreciated. Thanks in advance.
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>    
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Matthew S Shotwell   Assistant Professor           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Rapache ( was Developing a web crawler )

Mike Marchywka






----------------------------------------

> Date: Thu, 3 Mar 2011 13:04:11 -0600
> From: [hidden email]
> To: [hidden email]
> Subject: Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]
>
> On 03/03/2011 08:07 AM, Mike Marchywka wrote:
> >
> >
> >
> >
> >
> >
> >
> >> Date: Thu, 3 Mar 2011 01:22:44 -0800
> >> From: [hidden email]
> >> To: [hidden email]
> >> Subject: [R] Developing a web crawler
> >>
> >> Hi,
> >>
> >> I wish to develop a web crawler in R. I have been using the functionalities
> >> available under the RCurl package.
> >> I am able to extract the html content of the site but i don't know how to go
> >
> > In general this can be a big effort but there may be things in
> > text processing packages you could adapt to execute html and javascript.
> > However, I guess what I'd be looking for is something like a "webkit"
> > package or other open source browser with or without an "R" interface.
> > This actually may be an ideal solution for a lot of things as you get
> > all the content handlers of at least some browser.
> >
> >
> > Now that you mention it, I wonder if there are browser plugins to handle
> > "R" content ( I'd have to give this some thought, put a script up as
> > a web page with mime type "test/R" and have it execute it in R. )
>
> There are server-side solutions for this sort of thing. See
> http://rapache.net/ . Also, there was a string of messages on R-devel
> some years ago addressing the mime type issue; beginning here:
> http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't
> know whether there was a resolution. Some suggestions were text/x-R,
> text/x-Rd, application/x-RData.
>
The rapache demo looks like something I could use right away
but I haven't looked into the handlers yet. I have installed rapache now
on my debian system ( still have config issues but I did get apach2 to restart LOL)
Before I plow into this too far, how would this compare/compete with something
like a PHP library for Rserve? That is the approach I had been pursuing.

Thanks.



> -Matt
>
> >

     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Rapache ( was Developing a web crawler )

Matt Shotwell-2
On Sun, 2011-03-06 at 08:06 -0500, Mike Marchywka wrote:

>
>
>
>
>
> ----------------------------------------
> > Date: Thu, 3 Mar 2011 13:04:11 -0600
> > From: [hidden email]
> > To: [hidden email]
> > Subject: Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]
> >
> > On 03/03/2011 08:07 AM, Mike Marchywka wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >> Date: Thu, 3 Mar 2011 01:22:44 -0800
> > >> From: [hidden email]
> > >> To: [hidden email]
> > >> Subject: [R] Developing a web crawler
> > >>
> > >> Hi,
> > >>
> > >> I wish to develop a web crawler in R. I have been using the functionalities
> > >> available under the RCurl package.
> > >> I am able to extract the html content of the site but i don't know how to go
> > >
> > > In general this can be a big effort but there may be things in
> > > text processing packages you could adapt to execute html and javascript.
> > > However, I guess what I'd be looking for is something like a "webkit"
> > > package or other open source browser with or without an "R" interface.
> > > This actually may be an ideal solution for a lot of things as you get
> > > all the content handlers of at least some browser.
> > >
> > >
> > > Now that you mention it, I wonder if there are browser plugins to handle
> > > "R" content ( I'd have to give this some thought, put a script up as
> > > a web page with mime type "test/R" and have it execute it in R. )
> >
> > There are server-side solutions for this sort of thing. See
> > http://rapache.net/ . Also, there was a string of messages on R-devel
> > some years ago addressing the mime type issue; beginning here:
> > http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't
> > know whether there was a resolution. Some suggestions were text/x-R,
> > text/x-Rd, application/x-RData.
> >
> The rapache demo looks like something I could use right away
> but I haven't looked into the handlers yet. I have installed rapache now
> on my debian system ( still have config issues but I did get apach2 to restart LOL)
> Before I plow into this too far, how would this compare/compete with something
> like a PHP library for Rserve? That is the approach I had been pursuing.
>
> Thanks.

Hi Mike,

If you've built and configured RApache, then the difficult "plowing" is
over :). RApache operates at the top (HTTP) layer of the OSI stack,
whereas Rserve works at the lower transport/network layer. Hence, the
scope of Rserve applications is far more general. Extending Rserve to
operate at the HTTP layer (via PHP) will mean more work.

RApache offers high level functionality, for example, to replace PHP
with R in web pages. No interface code is necessary. Here's a simple
"What's The Time?" webpage using RApache and yarr [1] to handle the
code:

<< setContentType("text/html\n\n") >>
<html>
<head><title>What's The Time?</title></head>
<body><pre><</= cat(format(Sys.time(), usetz=TRUE)) >></pre></body>
</html>

Here's a live version: [2]. Interfacing PHP with Rserve in this context
would be useful if installation of R and/or RApache on the web host were
prohibited. A PHP/Rserve framework might also be useful in other
contexts, for example, to extend PHP applications (e.g. WordPress,
MediaWiki).

Best,
Matt

[1] http://biostatmatt.com/archives/1000
[2] http://biostatmatt.com/yarr/time.yarr

>
> > -Matt
> >
> > >
>
>      
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

Evanescence
In reply to this post by antujsrv
Can i ask a question> Do I need a good math for developing a web crawler ?
( I want to develop a simple web crawler to do something )
Reply | Threaded
Open this post in threaded view
|

Re: Rapache ( was Developing a web crawler )

Mike Marchywka
In reply to this post by Matt Shotwell-2





----------------------------------------

> Subject: Re: [R] Rapache ( was Developing a web crawler )
> From: [hidden email]
> To: [hidden email]
> CC: [hidden email]
> Date: Sun, 6 Mar 2011 13:51:53 -0500
>
> On Sun, 2011-03-06 at 08:06 -0500, Mike Marchywka wrote:
> >
> >
> > ----------------------------------------
> > > Date: Thu, 3 Mar 2011 13:04:11 -0600
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]
> > >
> > > On 03/03/2011 08:07 AM, Mike Marchywka wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >> Date: Thu, 3 Mar 2011 01:22:44 -0800
> > > >> From: [hidden email]
> > > >> To: [hidden email]
> > > >>
> Hi Mike,
>
> If you've built and configured RApache, then the difficult "plowing" is
> over :). RApache operates at the top (HTTP) layer of the OSI stack,
> whereas Rserve works at the lower transport/network layer. Hence, the
> scope of Rserve applications is far more general. Extending Rserve to
> operate at the HTTP layer (via PHP) will mean more work.

I finally got back to this and started from scratch on a clean machine.
It took most of the day, on and off, but downloading and building R, apache,
and rapache was relatively easy and info page worked but I had to go
get Cairo and various packages to get graphic demo pages to work.
I'll probably have to play with it a bit to see if I can use
it for anything useful but getting it to run was not too difficult
( I think before I didn't bother to build Apache from source and the failure mode
wasnt real clear ) .




>
> RApache offers high level functionality, for example, to replace PHP
> with R in web pages. No interface code is necessary. Here's a simple
> "What's The Time?" webpage using RApache and yarr [1] to handle the
> code:
>
> << setContentType("text/html\n\n") >>
>
> Message body
> <>
>
>
> Here's a live version: [2]. Interfacing PHP with Rserve in this context
> would be useful if installation of R and/or RApache on the web host were
> prohibited. A PHP/Rserve framework might also be useful in other
> contexts, for example, to extend PHP applications (e.g. WordPress,
> MediaWiki).
>
> Best,
> Matt
>
> [1] http://biostatmatt.com/archives/1000
> [2] http://biostatmatt.com/yarr/time.yarr
>
> >
> > > -Matt
> > >
> > > >
> >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Developing a web crawler

antujsrv
In reply to this post by Stefan Th. Gries
Hi Stefan,

Thanks for the links you shared in the post, but i am unable to access the scripts and output. It requires a password.
If you can let me know the password for the .rar file of the "scripts_other 5", it would be really helpful.
thanks in advance.