Using R htmlParse() for manipulating URLs to access multiple pages

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using R htmlParse() for manipulating URLs to access multiple pages

Ilio Fornasero
I am trying to scrape a manual from web. For privacy reasons, I cannot write here the exact URL, anyway, the structure is as follows:

https://home.lala.com/bibi/blabla/chapter_i_organization/101_contracts/whatever/,DanaInfo=intranet.lala.com+
https://home.lala.com/bibi/blabla/chapter_i_organization/125_bills/,DanaInfo=intranet.lala.com+
https://home.lala.com/bibi/blabla/chapter_vii_operational_modalities/701_wonderwall_18_oasis/701_wonderwall_18_oasis/

and so forth. Of course, I don't want to scrape the single URLs one by one. Hence, I am considering the base URL for parsing and to start from there onward.

baseurl <- htmlParse( "https://home.lala.com/bibi/blabla/",
                  encoding = "UTF-8")
xpath <- "//div[@id='Page']/strong[2]"
GetAllPages <- as.numeric(xpathSApply(baseurl, xpath, xmlValue))

Nevertheless, it does not work at all:

> GetAllPages
numeric(0)

Any hint?


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.