How to read a file containing two types of rows - (for the Netflix challenge data format)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to read a file containing two types of rows - (for the Netflix challenge data format)

Emmanuel Levy
Hi,

I'd like to use the Netflix challenge data and just can't figure out how to
efficiently "scan" the files.
https://www.kaggle.com/netflix-inc/netflix-prize-data

The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
3 values associated to each ID:

The format is as follows:
*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*
etc ...

And I want to create a matrix where each line is of the form:

ID value1, value2, value3

Si "ID" needs to be duplicated - I could write a Perl script to convert
this format to CSV, but I'm sure there's a simple R trick.

Thanks for suggestions!

Emmanuel

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to read a file containing two types of rows - (for the Netflix challenge data format)

Rainer M Krug-3
I did something similar yesterday…

Use readLine() to read at in and identify the “*1:*, … with a regex. Than you have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …)  to read the enclosed blocks, and last, combine them accordingly.

This is written without an R installation, so the argument names are likely wrong.

Rainer


> On 31 Jan 2020, at 10:04, Emmanuel Levy <[hidden email]> wrote:
>
> Hi,
>
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
>
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
>
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
>
> And I want to create a matrix where each line is of the form:
>
> ID value1, value2, value3
>
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
>
> Thanks for suggestions!
>
> Emmanuel
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y34-J-74
Winterthurerstrasse 190
8075 Zürich
Switzerland

Office: +41 (0)44 635 47 64
Cell:       +41 (0)78 630 66 57
email:      [hidden email]
                [hidden email]
Skype:     RMkrug

PGP: 0x0F52F982




        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to read a file containing two types of rows - (for the Netflix challenge data format)

Chris Evans
I am sure Rainer's approach is good and I know my R programming is truly terrible but here's a crude script in base R that does what you want

# rawDat <- readLines(con = "netflix.dat")
fil <- tempfile(fileext = ".dat")
cat("*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*",
    file = fil,
    sep = "\n")
rawDat <- readLines(fil, n = -1)
unlink(fil) # tidy up data input

### create a data frame for output
### this first line will be overwritten by the actual data
outDF <- as.data.frame(list(id = 1,
                            value1 = "",
                            value2 = "",
                            value3 = ""),
                       stringsAsFactors = FALSE) # necessary to avoid mess with character to factor conversion

j <- 0 # counter for entries
for (i in 1:length(rawDat)) {
  rawDat[i] <- trimws(rawDat[i])
  if (nchar(rawDat[i]) == 0) next # skip empty lines
  if (grepl(":*", rawDat[i], fixed = TRUE)) {
    ### got an ID line
    id <- sub("\\*([0123456789]*):\\*", "\\1", rawDat[i])
  } else {
    ### not an ID line so one of the one or more following lines of data
    ### I have assumed these are all of the same form
    j <- j + 1
    rawDat[i] <- gsub(" ", "", rawDat[i], fixed = TRUE)
    tmpDat <- unlist(strsplit(rawDat[i], ","))
    outDF[j,1] <- id
    outDF[j,2:4] <- tmpDat
  }
}
outDF

I am slowly adapting to the tidyverse but this is something I still find easier to do in very crude for loop, base R.

Plea: my formal programming training is one week of "Introduction to FORTRAN" on teletypes in 1975, but I confess it's
both lack of formal training _and_ lack of native ability that means my coding is so bad.

If any gurus have a moment, show us really elegant and tidyverse ways to do this!

Very best all,

Chris

----- Original Message -----
> From: "Rainer M Krug" <[hidden email]>
> To: "Emmanuel Levy" <[hidden email]>
> Cc: "R-help Mailing List" <[hidden email]>
> Sent: Friday, 31 January, 2020 10:55:46
> Subject: Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

> I did something similar yesterday…
>
> Use readLine() to read at in and identify the “*1:*, … with a regex. Than you
> have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …)  to
> read the enclosed blocks, and last, combine them accordingly.
>
> This is written without an R installation, so the argument names are likely
> wrong.
>
> Rainer
>
>
>> On 31 Jan 2020, at 10:04, Emmanuel Levy <[hidden email]> wrote:
>>
>> Hi,
>>
>> I'd like to use the Netflix challenge data and just can't figure out how to
>> efficiently "scan" the files.
>> https://www.kaggle.com/netflix-inc/netflix-prize-data
>>
>> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
>> 3 values associated to each ID:
>>
>> The format is as follows:
>> *1:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *2:*
>> value1,value2, value3
>> value1,value2, value3
>> *3:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *4:*
>> etc ...
>>
>> And I want to create a matrix where each line is of the form:
>>
>> ID value1, value2, value3
>>
>> Si "ID" needs to be duplicated - I could write a Perl script to convert
>> this format to CSV, but I'm sure there's a simple R trick.
>>
>> Thanks for suggestions!
>>
>> Emmanuel
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
> UCT), Dipl. Phys. (Germany)
>
> Orcid ID: 0000-0002-7490-0066
>
> Department of Evolutionary Biology and Environmental Studies
> University of Zürich
> Office Y34-J-74
> Winterthurerstrasse 190
> 8075 Zürich
> Switzerland
>
> Office: +41 (0)44 635 47 64
> Cell:       +41 (0)78 630 66 57
> email:      [hidden email]
> [hidden email]
> Skype:     RMkrug
>
> PGP: 0x0F52F982
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Chris Evans <[hidden email]> Visiting Professor, University of Sheffield <[hidden email]>
I do some consultation work for the University of Roehampton <[hidden email]> and other places
but <[hidden email]> remains my main Email address.  I have a work web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see:
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to read a file containing two types of rows - (for the Netflix challenge data format)

Jan Galkowski-2
In reply to this post by Emmanuel Levy
With the *data.table* package, *R* can use *fread* as follows:

> grab<- function(file)
> {
>  fin<- fread(file=file,
>  sep=NULL,
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("record"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Read '%s'.\n", file))
>  #
>  substance<- apply(X=fin, MARGIN=1, FUN=function(r) chartr(",", "\t", r[1]))
>  cat(sprintf("Translated '%s'.\n", file))
>  D<- fread(text=substance,
>  sep="\t",
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("ip", "valid.hits", "err.hits", "megabytes"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Parsed '%s'.\n", file))
>  ip<- D$ip
>  withinBlock<- sapply(X=ip, FUN=function(s) as.integer((strsplit(x=s, split=".", fixed=TRUE)[[1]])[4]))
>  D$within.block<- withinBlock
>  return(D)
> }
>

In short, one pass pulls in all the records into an internal structure, which can be edited or manipulated at will, and then a second call to *fread* parses it properly.

*fread *is fast, even for big datasets.


--
Jan Galkowski

https://www.linkedin.com/in/deepdevelopment

member,

... American Statistical Association
... International Society for Bayesian Analysis
... Ecological Society of America
... International Association of Survey Statisticians
... American Association for the Advancement of Science
... TeX Users Group

(pronouns: *he, him, his*)

*Keep your energy local*. --John Farrell, *ILSR <http://ilsr.org/>*


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to read a file containing two types of rows - (for the Netflix challenge data format)

Berry, Charles-2
In reply to this post by Emmanuel Levy


> On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <[hidden email]> wrote:
>
> Hi,
>
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
>
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
>
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
>
> And I want to create a matrix where each line is of the form:
>
> ID value1, value2, value3
>
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
>

I'd be tempted to use pipe() to separately read the ID lines and the value lines, but in R you can do this:

fc <- count.fields( "yourfile.txt", sep = ",")
rlines <- split( readLines( "yourfile.txt" ), fc)
mat <-
  cbind( rlines[[1]],
        do.call( rbind, strsplit( rlines[[2]], ",")))


This assumes that there are exactly 1 or 3 fields in each row of "yourfile.txt", if not, some incantation of grepl() applied to the text of readLines() should suffice.

HTH,

Chuck

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to read a file containing two types of rows - (for the Netflix challenge data format)

Emmanuel Levy
Hi All,

Thanks so much for your inputs, it's so nice to have such a helpful
community -- I wrote some kind of mix between the different replies, I copy
my final code below.

All the best,

Emmanuel

mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE)

first.col.idx = grep(":",mat[,1])

first.col.val = grep(":",mat[,1],value=TRUE)

first.col.val = gsub(pattern=":", replacement="", first.col.val)

mat.clean = mat[-first.col.idx,]

reps = diff(c(first.col.idx,length(mat1[,1])))

reps[1] = reps[1]+1

mat.final = cbind( rep(first.col.val, reps-1), mat.clean)








On Fri, 31 Jan 2020 at 20:31, Berry, Charles <[hidden email]>
wrote:

>
>
> > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <[hidden email]>
> wrote:
> >
> > Hi,
> >
> > I'd like to use the Netflix challenge data and just can't figure out how
> to
> > efficiently "scan" the files.
> > https://www.kaggle.com/netflix-inc/netflix-prize-data
> >
> > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc.
> or
> > 3 values associated to each ID:
> >
> > The format is as follows:
> > *1:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *2:*
> > value1,value2, value3
> > value1,value2, value3
> > *3:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *4:*
> > etc ...
> >
> > And I want to create a matrix where each line is of the form:
> >
> > ID value1, value2, value3
> >
> > Si "ID" needs to be duplicated - I could write a Perl script to convert
> > this format to CSV, but I'm sure there's a simple R trick.
> >
>
> I'd be tempted to use pipe() to separately read the ID lines and the value
> lines, but in R you can do this:
>
> fc <- count.fields( "yourfile.txt", sep = ",")
> rlines <- split( readLines( "yourfile.txt" ), fc)
> mat <-
>   cbind( rlines[[1]],
>         do.call( rbind, strsplit( rlines[[2]], ",")))
>
>
> This assumes that there are exactly 1 or 3 fields in each row of
> "yourfile.txt", if not, some incantation of grepl() applied to the text of
> readLines() should suffice.
>
> HTH,
>
> Chuck
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.