R web-scraping a multiple-level page

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

R web-scraping a multiple-level page

Ilio Fornasero
Hello.

I am trying to scrape a FAO webpage including multiple links from any of which I would like to collect the "News" part.

Yet, I have done this:

fao_base = 'http://www.fao.org'
fao_second_level = paste0(stem, '/countryprofiles/en/')

all_children = read_html(fao_second_level) %>%
  html_nodes(xpath = '//a[contains(@href, "?iso3=")]/@href') %>%
  html_text %>% paste0(fao_base, .)

Any suggestion on how to go on? I guess with a loop but I didn't have any success, yet.
Thanks

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R web-scraping a multiple-level page

Boris Steipe
For similar tasks I usually write a while loop operating on a queue. Conceptually:

initialize queue with first page
add first url to harvested urls

while queue not empty (2)
  unshift url from queue
  collect valid child pages that are not already in harvested list (1)
  add to harvested list
  add to queue

process all harvested pages



(1) - grep for the base url so you don't leave the site
    - use %in% to ensure you are not caught in a cycle

(2) Restrict the condition with a maximum number of cycles. More often than not assumptions about the world turn out to be overly rational.

Hope this helps,
B.




> On 2019-04-10, at 04:35, Ilio Fornasero <[hidden email]> wrote:
>
> Hello.
>
> I am trying to scrape a FAO webpage including multiple links from any of which I would like to collect the "News" part.
>
> Yet, I have done this:
>
> fao_base = 'http://www.fao.org'
> fao_second_level = paste0(stem, '/countryprofiles/en/')
>
> all_children = read_html(fao_second_level) %>%
>  html_nodes(xpath = '//a[contains(@href, "?iso3=")]/@href') %>%
>  html_text %>% paste0(fao_base, .)
>
> Any suggestion on how to go on? I guess with a loop but I didn't have any success, yet.
> Thanks
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R web-scraping a multiple-level page

chrishold


----- Original Message -----
> From: "Boris Steipe" <[hidden email]>
> To: "Ilio Fornasero" <[hidden email]>
> Cc: [hidden email]
> Sent: Wednesday, 10 April, 2019 12:34:15
> Subject: Re: [R] R web-scraping a multiple-level page

[snip]
 
> (2) Restrict the condition with a maximum number of cycles. More often than not
> assumptions about the world turn out to be overly rational.

Brilliant!! Fortune nomination?

And the advice was useful to me too though I'm not the OQ.

Thanks,

Chris

--
Chris Evans <[hidden email]> Skype: chris-psyctc
Visiting Professor, University of Sheffield <[hidden email]>
I do some consultation work for the University of Roehampton <[hidden email]> and other places but this <[hidden email]> remains my main Email address.
I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take you to my blog which started with earlier joys in France and Spain!

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.