help getting a research project started on regulations.gov

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

help getting a research project started on regulations.gov

Drake Gossi
Hello everyone,

I will be using R to manipulate this data
<https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064>.
Specifically, it's proposed changes to Title IX--over 11,000 publicly
available comments. So, the end goal is for me to tabulate each of these
11,000 comments in a csv file, so I can begin to manipulate and visualize
the data.

But I'm not there yet. I just put in for an API key and, while I have one,
I'm waiting for it to be activated. After that, though, I'm a little lost.
Do I need to scrape the comments from the site? Or does having the API
render that unnecessary? There is this interface
<https://regulationsgov.github.io/developers/console/> that works with the
API, but I don't know if, though it, I can get the data I need. I'm still
trying to figure out what JSON is.

Or, if I have to scrape the comments, can I do that with R? I can't get a
straight answer from the python people. I can't tell if I need to do this
through beautiful soup or through scrapy (or even if I need to do it at
all, as I said...). The trouble with the comments is, they are each on
their own URL, so--and again this is assuming that I will have to scrape
them--I don't know how to code in order to grab all of the comments from
all of the URLs.

I also am trying to figure out how to isolate the essence of the comments
in the html. From the python people, I've heard the following:

scrapy fetch 'url'
will download the raw page you are interested in. And you can look at
the raw source code. Important to appreciate that what you see in the
browser is often processed in your browser before you see it.

Of course, a scraper can do the same processing, but it's complicated.
So, start by looking at the raw source code. Maybe you can grab what you
need with simple parsing like Beautiful Soup does. Maybe you need to do
more. Scrapy is your friend.

 Beautiful soup is your friend here.  It can analyze the data within
the html tags on your scraped page. But often javascript is used on
'modern' web pages so the page is actually not just html, but
javascript that changes the html.  For this you need another tool -- i
think one is called scrapy.  Others here probably have experience with
that.

I think part of my problem relates to that yellow part. I was saying things
like

I think what I might be looking for is a div  class = GIY1LSJIXD, since
that's where the hierarchy seems to taper off in the html for the comment
I'm looking to scrape.


What I'm trying to do here is, locate the comment in the html so I can tell
the request function to extract it.

Any help anyone could offer here would be much appreciated. I'm very lost.

Drake

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help getting a research project started on regulations.gov

Bert Gunter-2
Please search yourself first!

"scrape JSON from web" at the rseek.org site produced what appeared to be
several relevant hits,
especially this CRAN task view:
https://cran.r-project.org/web/views/WebTechnologies.html


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Feb 19, 2019 at 3:07 PM Drake Gossi <[hidden email]> wrote:

> Hello everyone,
>
> I will be using R to manipulate this data
> <
> https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064
> >.
> Specifically, it's proposed changes to Title IX--over 11,000 publicly
> available comments. So, the end goal is for me to tabulate each of these
> 11,000 comments in a csv file, so I can begin to manipulate and visualize
> the data.
>
> But I'm not there yet. I just put in for an API key and, while I have one,
> I'm waiting for it to be activated. After that, though, I'm a little lost.
> Do I need to scrape the comments from the site? Or does having the API
> render that unnecessary? There is this interface
> <https://regulationsgov.github.io/developers/console/> that works with the
> API, but I don't know if, though it, I can get the data I need. I'm still
> trying to figure out what JSON is.
>
> Or, if I have to scrape the comments, can I do that with R? I can't get a
> straight answer from the python people. I can't tell if I need to do this
> through beautiful soup or through scrapy (or even if I need to do it at
> all, as I said...). The trouble with the comments is, they are each on
> their own URL, so--and again this is assuming that I will have to scrape
> them--I don't know how to code in order to grab all of the comments from
> all of the URLs.
>
> I also am trying to figure out how to isolate the essence of the comments
> in the html. From the python people, I've heard the following:
>
> scrapy fetch 'url'
> will download the raw page you are interested in. And you can look at
> the raw source code. Important to appreciate that what you see in the
> browser is often processed in your browser before you see it.
>
> Of course, a scraper can do the same processing, but it's complicated.
> So, start by looking at the raw source code. Maybe you can grab what you
> need with simple parsing like Beautiful Soup does. Maybe you need to do
> more. Scrapy is your friend.
>
>  Beautiful soup is your friend here.  It can analyze the data within
> the html tags on your scraped page. But often javascript is used on
> 'modern' web pages so the page is actually not just html, but
> javascript that changes the html.  For this you need another tool -- i
> think one is called scrapy.  Others here probably have experience with
> that.
>
> I think part of my problem relates to that yellow part. I was saying things
> like
>
> I think what I might be looking for is a div  class = GIY1LSJIXD, since
> that's where the hierarchy seems to taper off in the html for the comment
> I'm looking to scrape.
>
>
> What I'm trying to do here is, locate the comment in the html so I can tell
> the request function to extract it.
>
> Any help anyone could offer here would be much appreciated. I'm very lost.
>
> Drake
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.