Scraping info from a web site?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Scraping info from a web site?

Spencer Graves-4
Hi, All:


       What would you suggest one use to read the data on members of the
US Congress and their positions on net neutrality from
"https://www.battleforthenet.com/scoreboard" into R?


       I found recommendations for the "rvest" package to "Easily
Harvest (Scrape) Web Pages".  I tried the following:


URL <- 'https://www.battleforthenet.com/scoreboard/'
library(rvest)
Bftn <- read_html(URL)
str(Bftn)


List of 2
  $ node:<externalptr>
  $ doc :<externalptr>
  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"


        However, I don't know what to do with <externalptr>.


       The "Selectorgadget" vignette with rvest suggested selecting what
I wanted on the web page and pasting that as an argument into
"html_node".  This led me to try the following:


Bftn_nodes <- html_nodes(Bftn,
     '.psb-unknown , #house, #senate, #senate p')


str(Bftn_nodes)
List of 4
  $ :List of 2
   ..$ node:<externalptr>
   ..$ doc :<externalptr>
   ..- attr(*, "class")= chr "xml_node"
  $ :List of 2
   ..$ node:<externalptr>
   ..$ doc :<externalptr>
   ..- attr(*, "class")= chr "xml_node"
  $ :List of 2
   ..$ node:<externalptr>
   ..$ doc :<externalptr>
   ..- attr(*, "class")= chr "xml_node"
  $ :List of 2
   ..$ node:<externalptr>
   ..$ doc :<externalptr>
   ..- attr(*, "class")= chr "xml_node"
  - attr(*, "class")= chr "xml_nodeset"


       This seems like it may be progress, but I'm still confused on
what to do next.  Or maybe I should be using a different package? Or
posting this question to someplace else like StackOverflow.com?


       Thanks,
       Spencer Graves

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.