Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?

Moser, Gary
Greetings,

 

I am trying to get all of the text from a web page as if I "selected
all" on the page, pasted into a text file, and then read in the text
file with read.csv().

 

# this is the actual page I'm trying to acquire text from:

web.pg <- readLines("http://www.airweb.org/?page=574")

 

# then parsed in hopes of an easier structure to work with:

web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE)

 

Now I have a lovely html tree, but don't know the best way to get just
the text components (job descriptions, job titles, etc...) as they
appear on the web site. I'd like to do a little text mining and make a
wordcloud using the text. Can anybody suggest a method to achieve this
result?

 

Thank you,

 

Gary R. Moser

Institutional Research Analyst

Heald College

p <- 415.808.1533

f <- 415.808.1598

[hidden email] <mailto:[hidden email]>

 



Disclaimer: This communication may contain Heald College confidential and proprietary data. This message is intended only for the personal and confidential use of the designated recipients named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. In addition, if you have received this message in error, please advise the sender by reply email and delete the message.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?

Henrique Dallazuanna
Use XPATH query:

web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE, useInternalNodes = TRUE)

# Job title
xpathApply(web.pg, "//span[@class='normal']//b", xmlValue)

On Wed, Oct 26, 2011 at 9:36 PM, Moser, Gary <[hidden email]> wrote:

> Greetings,
>
>
>
> I am trying to get all of the text from a web page as if I "selected
> all" on the page, pasted into a text file, and then read in the text
> file with read.csv().
>
>
>
> # this is the actual page I'm trying to acquire text from:
>
> web.pg <- readLines("http://www.airweb.org/?page=574")
>
>
>
> # then parsed in hopes of an easier structure to work with:
>
> web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE)
>
>
>
> Now I have a lovely html tree, but don't know the best way to get just
> the text components (job descriptions, job titles, etc...) as they
> appear on the web site. I'd like to do a little text mining and make a
> wordcloud using the text. Can anybody suggest a method to achieve this
> result?
>
>
>
> Thank you,
>
>
>
> Gary R. Moser
>
> Institutional Research Analyst
>
> Heald College
>
> p <- 415.808.1533
>
> f <- 415.808.1598
>
> [hidden email] <mailto:[hidden email]>
>
>
>
>
>
> Disclaimer: This communication may contain Heald College confidential and proprietary data. This message is intended only for the personal and confidential use of the designated recipients named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. In addition, if you have received this message in error, please advise the sender by reply email and delete the message.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.