How to choose a button and scrape the website data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to choose a button and scrape the website data

Guang Dai
hi all,
I'm working on scrapping some website data to build a database.
Under most cases, I can use package XML to get the dataset.
However, some of the website doesn't give a explicit address of the downloaded tables.

To be more specific, for example, I'm interested in the website http://ets.aeso.ca/
The data we are scraping is the "Pool Weekly Summary" under the category of "Historical".
However, after clicking "historical" and choose the "Pool Weekly Summary"  item on the website,
the address is always http://ets.aeso.ca/ and doesn't change.

In this case, I guess I need to tell R first click the "historical" button then choose the item before
scraping the data. But, the question is how?

Any suggestions are welcome.
Guang
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to choose a button and scrape the website data

Tyler Ritchie
That website uses javascript to submit the form (and doesn't work in
Chrome). You could build a javascript interpreter in R, have parse the
page, and then use the various javascript to submit the form. R just isn't
the right tool for that type of interaction.

Performing the task you want--as described--is possible, just not
reasonable with R. There are better tools for automating webpages such
as Automato [1] or Sikuli [2] which are handy tools.

But better would be to query the site directly. Checking the source of the
page each of the different report types stems from a different URL, passing
it arguments in the form of:

beginDate=03012012&endDate=03032012&SelectFormat=CSV

results in values from March 1st to 3rd of this year in a csv. To find the
URLs of interest go view the source and search for "Select a Report"

Easier still might be to contact AESO and ask them for the data.

[1] http://automa.to/
[2] http://sikuli.org/

-Tyler

On Mon, Mar 5, 2012 at 10:38 AM, Guang Dai <[hidden email]> wrote:

> hi all,
> I'm working on scrapping some website data to build a database.
> Under most cases, I can use package XML to get the dataset.
> However, some of the website doesn't give a explicit address of the
> downloaded tables.
>
> To be more specific, for example, I'm interested in the website
> http://ets.aeso.ca/
> The data we are scraping is the "Pool Weekly Summary" under the category
> of "Historical".
> However, after clicking "historical" and choose the "Pool Weekly Summary"
>  item on the website,
> the address is always http://ets.aeso.ca/ and doesn't change.
>
> In this case, I guess I need to tell R first click the "historical" button
> then choose the item before
> scraping the data. But, the question is how?
>
> Any suggestions are welcome.
> Guang
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to choose a button and scrape the website data

Guang Dai
Thank you, Tyler.
Just have a quick read on automa.to and sikuli.org, seems very promising.
Since I anticipate there many other cases where a similar issue can arise, I don't mind spending sometime to learn something that is very efficient for the purpose. any suggestions?
________________________________
From: Tyler Ritchie [mailto:[hidden email]]
Sent: Monday, March 05, 2012 1:40 PM
To: Guang Dai
Cc: [hidden email]
Subject: Re: [R] How to choose a button and scrape the website data

That website uses javascript to submit the form (and doesn't work in Chrome). You could build a javascript interpreter in R, have parse the page, and then use the various javascript to submit the form. R just isn't the right tool for that type of interaction.

Performing the task you want--as described--is possible, just not reasonable with R. There are better tools for automating webpages such as Automato [1] or Sikuli [2] which are handy tools.

But better would be to query the site directly. Checking the source of the page each of the different report types stems from a different URL, passing it arguments in the form of:

beginDate=03012012&endDate=03032012&SelectFormat=CSV

results in values from March 1st to 3rd of this year in a csv. To find the URLs of interest go view the source and search for "Select a Report"

Easier still might be to contact AESO and ask them for the data.

[1] http://automa.to/
[2] http://sikuli.org/

-Tyler

On Mon, Mar 5, 2012 at 10:38 AM, Guang Dai <[hidden email]<mailto:[hidden email]>> wrote:
hi all,
I'm working on scrapping some website data to build a database.
Under most cases, I can use package XML to get the dataset.
However, some of the website doesn't give a explicit address of the downloaded tables.

To be more specific, for example, I'm interested in the website http://ets.aeso.ca/
The data we are scraping is the "Pool Weekly Summary" under the category of "Historical".
However, after clicking "historical" and choose the "Pool Weekly Summary"  item on the website,
the address is always http://ets.aeso.ca/ and doesn't change.

In this case, I guess I need to tell R first click the "historical" button then choose the item before
scraping the data. But, the question is how?

Any suggestions are welcome.
Guang
______________________________________________
[hidden email]<mailto:[hidden email]> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.