Dataverse

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataverse

Ilio Fornasero
Hello.

I am trying to find a way to retrieve data from Harvard Dataverse website.
I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab,  .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution.
Any hint?



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R-help Digest, Vol 183, Issue 13

Thomas Levine-4
Ilio Fornasero writes:
> Hello.
>
> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab,  .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution.
> Any hint?

.tab does not identify a file format. It might be in a read.csv format
or a read.fwf format.

No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.)
so you could use system/system2: `system2('7z', c('e', ...)), or I think
7z.exe on Windows. You would need to install p7zip and read the manual
(`man 7z` on a Unix-like system).

Please send an example.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Dataverse (reading files with .tab and .7z suffixes)

Thomas Levine-4
In reply to this post by Ilio Fornasero
Ilio Fornasero writes:
> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here is
> that there are a bunch of data formats such as .tab,  .7z and so and
> I just can't find a way to retrieve the data I am interested in woth an
> unique solution.
> Any hint?

.tab does not identify a file format. That file might be in a read.csv
format or a read.fwf format.

No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.)
so you could use system/system2: `system2('7z', c('e', ...)), or I think
7z.exe on Windows. You would need to install p7zip and read the manual
(`man 7z` on a Unix-like system).

Please send an example.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Dataverse (reading files with .tab and .7z suffixes)

Thomas Levine-4
Ilio Fornasero writes:

> Yet, I am at this point.
>
>
>
>
> ## 01. Finding the dataverse server and making a search
> Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu")
> dataverse_search(".Hunger")
>
>
> ## 02. Loading the dataset (in this example, I have chosen the word ".Hunge=
> r" to get
>    # one list and then picked up one out of hundreds results.
>    # The get-dataset() function has to be picked on the dynamic web address=
> )
> (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
>
> ## 03. Grabbing the (1st) file we are interested on
> AppendixC <- get_file("001_AppendixC.tab",
>                       "https://doi.org/10.7910/DVN/ZTCWYQ")
> writeBin(AppendixC, "001_AppendixC.tab")
>
> read.table("001_AppendixC.tab")

I imagine you are using the dataverse package.

7z is more straightforward because the file format is clear.

You need to figure out the 001_AppendixC.tab file format.
On first glance it looks to me like a spreadsheet.

  $ file /tmp/001_AppendixC.tab
  /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
  $ cd /tmp && unzip 001_AppendixC.tab
  $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"

Once you figure out the format manually, write an R function that
figures out the format, and ask again here to find an R function that
reads the format.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Dataverse (reading files with .tab and .7z suffixes)

Jorge Cimentada
Our just use the dataverse package already in CRAN:
https://cran.r-project.org/web/packages/dataverse/index.html

-----------------------------------


Jorge Cimentada
*https://cimentadaj.github.io/ <https://cimentadaj.github.io/>*


On Sun, May 13, 2018 at 2:04 PM, Thomas Levine <[hidden email]> wrote:

> Ilio Fornasero writes:
> > Yet, I am at this point.
> >
> >
> >
> >
> > ## 01. Finding the dataverse server and making a search
> > Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu")
> > dataverse_search(".Hunger")
> >
> >
> > ## 02. Loading the dataset (in this example, I have chosen the word
> ".Hunge=
> > r" to get
> >    # one list and then picked up one out of hundreds results.
> >    # The get-dataset() function has to be picked on the dynamic web
> address=
> > )
> > (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
> >
> > ## 03. Grabbing the (1st) file we are interested on
> > AppendixC <- get_file("001_AppendixC.tab",
> >                       "https://doi.org/10.7910/DVN/ZTCWYQ")
> > writeBin(AppendixC, "001_AppendixC.tab")
> >
> > read.table("001_AppendixC.tab")
>
> I imagine you are using the dataverse package.
>
> 7z is more straightforward because the file format is clear.
>
> You need to figure out the 001_AppendixC.tab file format.
> On first glance it looks to me like a spreadsheet.
>
>   $ file /tmp/001_AppendixC.tab
>   /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
>   $ cd /tmp && unzip 001_AppendixC.tab
>   $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
>   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>   <workbook xmlns="http://schemas.openxmlformats.org/
> spreadsheetml/2006/main"
>
> Once you figure out the format manually, write an R function that
> figures out the format, and ask again here to find an R function that
> reads the format.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Dataverse (reading files with .tab and .7z suffixes)

David Winsemius
In reply to this post by Thomas Levine-4

> On May 13, 2018, at 5:04 AM, Thomas Levine <[hidden email]> wrote:
>
> Ilio Fornasero writes:
>> Yet, I am at this point.
>>
>>
>>
>>
>> ## 01. Finding the dataverse server and making a search
>> Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu")
>> dataverse_search(".Hunger")
>>
>>
>> ## 02. Loading the dataset (in this example, I have chosen the word ".Hunge=
>> r" to get
>>   # one list and then picked up one out of hundreds results.
>>   # The get-dataset() function has to be picked on the dynamic web address=
>> )
>> (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
>>
>> ## 03. Grabbing the (1st) file we are interested on
>> AppendixC <- get_file("001_AppendixC.tab",
>>                      "https://doi.org/10.7910/DVN/ZTCWYQ")
>> writeBin(AppendixC, "001_AppendixC.tab")
>>
>> read.table("001_AppendixC.tab")
>
> I imagine you are using the dataverse package.
>
> 7z is more straightforward because the file format is clear.
>
> You need to figure out the 001_AppendixC.tab file format.
> On first glance it looks to me like a spreadsheet.
That website says it's tab-delimited. The read.delim (in base R) function is designed for that possibility. However the download pull-down menu that appears, seems to offer the option of deliver in a variety of formats:




When I choose the Rdata option I get:

 fil <- load("/Users/davidwinsemius/001_AppendixC.RData")
 fil
#[1] "x"

str(x)
#-------------------
'data.frame': 132 obs. of  17 variables:
 $ Country :Class 'AsIs'  atomic [1:132] Afghanistan Albania Algeria Angola ...
  .. ..- attr(*, "comment")= chr "Country"
 $ UN9193  :Class 'AsIs'  atomic [1:132] 37.4 7.7 9.1 65.400000000000006 ...
  .. ..- attr(*, "comment")= chr "UN9193"
 $ UN9901  :Class 'AsIs'  atomic [1:132] 46.1 7.2 10.7 50 ...
------ snipped --------


--
David.


>
>  $ file /tmp/001_AppendixC.tab
>  /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
>  $ cd /tmp && unzip 001_AppendixC.tab
>  $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
>  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>  <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
>
> Once you figure out the format manually, write an R function that
> figures out the format, and ask again here to find an R function that
> reads the format.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law






______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Untitled.pdf (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Dataverse

Ista Zahn
In reply to this post by Ilio Fornasero
Use https://cran.rstudio.com/web/packages/dataverse/

--Ista

On Sun, May 13, 2018 at 5:21 AM, Ilio Fornasero
<[hidden email]> wrote:

> Hello.
>
> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab,  .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution.
> Any hint?
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.