Scraping from different level URLs website

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Scraping from different level URLs website

Ilio Fornasero
I am doing a research on World Bank (WB) projects on developing countries. To do so, I am scraping their website in order to collect the data I am interested in.

The structure of the webpage I want to scrape is the following:

  1.  List of countries the list of all countries in which WB has developed projects<http://projects.worldbank.org/country?lang=en&page=>

1.1. By clicking on a single country on 1. , one gets the single countries project list (that includes many webpages) it includes all the projects in a single countries <http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=3A> . Of course, here I have included just one page of a single countries, but every country has a number of pages dedicated to this subject

1.1.1. By clicking on a a single project on 1.1. , one gets - among the others - the project's overview option<http://projects.worldbank.org/P155642/?lang=en&tab=overview> I am interested in.

In other words, my problem is to find out a way to create a dataframe including all the countries, a complete list of all projects for each country and an overview of any single project.


Yet, this is the code that I have (unsuccessfully) written:

WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects"

 WB_proj <- function(x) {

  Sys.sleep(5)
 url <- sprintf("<a href="http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s">http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", x)

 html <- read_html(url)

 tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
     project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
    }

 WB_scrape <- map_df(1:5, WB_proj) %>%
 mutate(study_description =
       map(project_url,
           ~read_html(sprintf
     ("<a href="http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s">http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", .x)) %>%
            html_node() %>%
            html_text()))


Any suggestion?

Note: I am sorry if this question seems trivial, but I am quite a newbie in R and I haven't found a help on this by looking around (though I could have missed something, of course).


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Scraping from different level URLs website

Jeff Newmiller
They seem to release their data in xml and csv formats also... why are you scraping?
--
Sent from my phone. Please excuse my brevity.

On January 23, 2018 9:31:01 AM PST, Ilio Fornasero <[hidden email]> wrote:

>I am doing a research on World Bank (WB) projects on developing
>countries. To do so, I am scraping their website in order to collect
>the data I am interested in.
>
>The structure of the webpage I want to scrape is the following:
>
>1.  List of countries the list of all countries in which WB has
>developed projects<http://projects.worldbank.org/country?lang=en&page=>
>
>1.1. By clicking on a single country on 1. , one gets the single
>countries project list (that includes many webpages) it includes all
>the projects in a single countries
><http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=3A>
>. Of course, here I have included just one page of a single countries,
>but every country has a number of pages dedicated to this subject
>
>1.1.1. By clicking on a a single project on 1.1. , one gets - among the
>others - the project's overview
>option<http://projects.worldbank.org/P155642/?lang=en&tab=overview> I
>am interested in.
>
>In other words, my problem is to find out a way to create a dataframe
>including all the countries, a complete list of all projects for each
>country and an overview of any single project.
>
>
>Yet, this is the code that I have (unsuccessfully) written:
>
>WB_links <-
>"http://projects.worldbank.org/country?lang=en&page=projects"
>
> WB_proj <- function(x) {
>
>  Sys.sleep(5)
>url <-
>sprintf("<a href="http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s">http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s",
>x)
>
> html <- read_html(url)
>
>tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
>     project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
>    }
>
> WB_scrape <- map_df(1:5, WB_proj) %>%
> mutate(study_description =
>       map(project_url,
>           ~read_html(sprintf
>("<a href="http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s">http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s",
>.x)) %>%
>            html_node() %>%
>            html_text()))
>
>
>Any suggestion?
>
>Note: I am sorry if this question seems trivial, but I am quite a
>newbie in R and I haven't found a help on this by looking around
>(though I could have missed something, of course).
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.