Web scraping different levels of a website

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Web scraping different levels of a website

David Jankoski
Hey Ilio,

On the main website (the first link that you provided) if you
right-click on the title of any entry and select Inspect Element from
the menu, you will notice in the Developer Tools view that opens up
that the corresponding html looks like this

(example for the same link that you provided)

<div class="survey-row"
data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
study">
    <div class="data-access-icon data-access-remote" title="Data
available from external repository"></div>
        <h2 class="title">
            <a href="http://catalog.ihsn.org/index.php/catalog/7118"
title="Demographic and Health Survey 2015">
              Demographic and Health Survey 2015
            </a>
      </h2>

Notice how the number you are after is contained within the
"survey-row" div element, in the data-url attribute. Or alternatively
withing the <a> elem within the href attribute. It's up to you which
one you want to grab but the idea would be the same i.e.

1. read in the html
2. select all list-elements by css / xpath
3. grab the fwd link

Here is an example using the first option.

url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="

x <-
  url %>%
  GET() %>%
  content()

x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

hth.
david

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Web scraping different levels of a website

David Jankoski
Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
    "http://catalog.ihsn.org/index.php/catalog/",
    "search?view=s&",
    "ps={num_studies}&",
    "page=1&repo=&repo_ref=&sid=&_r=&sk=&vk=&",
    "from={year_from}&",
    "to={year_to}&",
    "sort_order=&sort_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are <a> within <h2> elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")


greetings,
david

On 18 January 2018 at 12:58, David Jankoski <[hidden email]> wrote:

>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
> <div class="survey-row"
> data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
> study">
>     <div class="data-access-icon data-access-remote" title="Data
> available from external repository"></div>
>         <h2 class="title">
>             <a href="http://catalog.ihsn.org/index.php/catalog/7118"
> title="Demographic and Health Survey 2015">
>               Demographic and Health Survey 2015
>             </a>
>       </h2>
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the <a> elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="
>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david




--

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Web scraping different levels of a website

Ilio Fornasero
Thanks again, David.

I am trying to figure out a way to convert the lists into a data.frame.

Any hint?

The usual ways (do.call, etc) do not seem to work...

Thanks

Ilio

________________________________
Da: David Jankoski <[hidden email]>
Inviato: venerd� 19 gennaio 2018 15:58
A: [hidden email]; [hidden email]
Oggetto: Re: [R] Web scraping different levels of a website

Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
    "http://catalog.ihsn.org/index.php/catalog/",
IHSN Survey Catalog<http://catalog.ihsn.org/index.php/catalog/>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of Afghanistan, United Nations Children�s Fund


    "search?view=s&",
    "ps={num_studies}&",
    "page=1&repo=&repo_ref=&sid=&_r=&sk=&vk=&",
    "from={year_from}&",
    "to={year_to}&",
    "sort_order=&sort_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are <a> within <h2> elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")


greetings,
david

On 18 January 2018 at 12:58, David Jankoski <[hidden email]> wrote:

>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
> <div class="survey-row"
> data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
Afghanistan - Demographic and Health Survey 2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS Program, ICF ...


> study">
>     <div class="data-access-icon data-access-remote" title="Data
> available from external repository"></div>
>         <h2 class="title">
>             <a href="http://catalog.ihsn.org/index.php/catalog/7118"
Afghanistan - Demographic and Health Survey 2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS Program, ICF ...


> title="Demographic and Health Survey 2015">
>               Demographic and Health Survey 2015
>             </a>
>       </h2>
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the <a> elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="
IHSN Survey Catalog<http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of Afghanistan, United Nations Children�s Fund


>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david



--

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com<http://www.hellotrip.com>

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.