reading in csv files, some of which have column names and some of which don't

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

reading in csv files, some of which have column names and some of which don't

Christopher W. Ryan
Alas, we spend so much time and energy on data wrangling . . . .

I'm given a collection of csv files to work with---"found data". They arose
via saving Excel files to csv format. They all have the same column
structure, except that some were saved with column names and some were not.

I have a code snippet that I've used before to traverse a directory and
read into R all the csv files of a certain filename pattern within it, and
combine them all into a single dataframe:

library(dplyr)
## specify the csv files that I will want to access
files.to.read <- list.files(path = "H:/EH", pattern =
"WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

## function to read csv files back in
read.csv.files <- function(filename) {
    bb <- read.csv(filename, colClasses = "character", header = TRUE)
    bb
}

## now read the csv files, as all character
b <- lapply(files.to.read, read.csv.files)

ddd <- bind_rows(b)

But this assumes that all files have column names in their first row. In
this case, some don't. Any advice how to handle it so that those with
column names and those without are read in and combined properly? The only
thing I've come up with so far is:

## function to read csv files back in
## Unfortunately, some of the csv files are saved with column headers, and
some are saved without them.
## This presents a problem when defining the function to read them: header
= TRUE or header = FALSE?
## The best solution I can think of as of 13 August 2019 is to use header =
FALSE and skip the
## first row of every file. This will sacrifice one record from each csv of
about 80 files
read.csv.files <- function(filename) {
    bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
= 1)
    bb
}

This sacrifices about 80 out of about 1600 records. For my purposes in this
instance, this may be acceptable, but of course I'd rather not.

Thanks.

--Chris Ryan

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: reading in csv files, some of which have column names and some of which don't

Bert Gunter-2
Are these files of numerics? In other words, how would one know whether the
first line of a file of alpha data are headers or not?  read.table's Help
file contains some info that may or may not be relevant for your files also.

Assuming a criterion for distinction, one could simply read the first line
of a file, check the criterion, and then read it with or without headers as
appropriate. R can create default column names.  One could also use
readLines with connections, but I don't think this is necessary, though
maybe it's more elegant or faster.

Without knowing how to tell whether the first line is a header or not, I
have no clue. Maybe the filename/ suffix might tell you something.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Aug 13, 2019 at 11:00 AM Christopher W Ryan <[hidden email]>
wrote:

> Alas, we spend so much time and energy on data wrangling . . . .
>
> I'm given a collection of csv files to work with---"found data". They arose
> via saving Excel files to csv format. They all have the same column
> structure, except that some were saved with column names and some were not.
>
> I have a code snippet that I've used before to traverse a directory and
> read into R all the csv files of a certain filename pattern within it, and
> combine them all into a single dataframe:
>
> library(dplyr)
> ## specify the csv files that I will want to access
> files.to.read <- list.files(path = "H:/EH", pattern =
> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>
> ## function to read csv files back in
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = TRUE)
>     bb
> }
>
> ## now read the csv files, as all character
> b <- lapply(files.to.read, read.csv.files)
>
> ddd <- bind_rows(b)
>
> But this assumes that all files have column names in their first row. In
> this case, some don't. Any advice how to handle it so that those with
> column names and those without are read in and combined properly? The only
> thing I've come up with so far is:
>
> ## function to read csv files back in
> ## Unfortunately, some of the csv files are saved with column headers, and
> some are saved without them.
> ## This presents a problem when defining the function to read them: header
> = TRUE or header = FALSE?
> ## The best solution I can think of as of 13 August 2019 is to use header =
> FALSE and skip the
> ## first row of every file. This will sacrifice one record from each csv of
> about 80 files
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
> = 1)
>     bb
> }
>
> This sacrifices about 80 out of about 1600 records. For my purposes in this
> instance, this may be acceptable, but of course I'd rather not.
>
> Thanks.
>
> --Chris Ryan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: reading in csv files, some of which have column names and some of which don't

plangfelder
In reply to this post by Christopher W. Ryan
If the data are numeric (or at least some columns are numeric), a
brute force solution is to read a file once with header = FALSE, check
the relevant column(s) for being numeric, and if they are not numeric,
re-read with header = TRUE. Alternatively, if you know the column
names (headers) beforehand, read with header = FALSE and check the
first row for being equal to the known column names; if it contains
the column names, re-read with header = TRUE.

With a total of 1600 records, reading each file (at most) twice should
not be a problem.

Peter

On Tue, Aug 13, 2019 at 11:00 AM Christopher W Ryan
<[hidden email]> wrote:

>
> Alas, we spend so much time and energy on data wrangling . . . .
>
> I'm given a collection of csv files to work with---"found data". They arose
> via saving Excel files to csv format. They all have the same column
> structure, except that some were saved with column names and some were not.
>
> I have a code snippet that I've used before to traverse a directory and
> read into R all the csv files of a certain filename pattern within it, and
> combine them all into a single dataframe:
>
> library(dplyr)
> ## specify the csv files that I will want to access
> files.to.read <- list.files(path = "H:/EH", pattern =
> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>
> ## function to read csv files back in
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = TRUE)
>     bb
> }
>
> ## now read the csv files, as all character
> b <- lapply(files.to.read, read.csv.files)
>
> ddd <- bind_rows(b)
>
> But this assumes that all files have column names in their first row. In
> this case, some don't. Any advice how to handle it so that those with
> column names and those without are read in and combined properly? The only
> thing I've come up with so far is:
>
> ## function to read csv files back in
> ## Unfortunately, some of the csv files are saved with column headers, and
> some are saved without them.
> ## This presents a problem when defining the function to read them: header
> = TRUE or header = FALSE?
> ## The best solution I can think of as of 13 August 2019 is to use header =
> FALSE and skip the
> ## first row of every file. This will sacrifice one record from each csv of
> about 80 files
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
> = 1)
>     bb
> }
>
> This sacrifices about 80 out of about 1600 records. For my purposes in this
> instance, this may be acceptable, but of course I'd rather not.
>
> Thanks.
>
> --Chris Ryan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: reading in csv files, some of which have column names and some of which don't

Sarah Goslee
In reply to this post by Christopher W. Ryan
Like Bert, I can't see an easy approach for datasets that have
character rather than numeric data. But here's a simple approach for
distinguishing files that have possible character headers but numeric
data.



readheader <- function(filename) {

possibleheader <- read.table(filename, nrows=1, sep=",", header=FALSE)

if(all(is.numeric(possibleheader[,1]))) {
# no header
infile <- read.table(filename, sep=",", header=FALSE)
} else {
# has header
infile <- read.table(filename, sep=",", header=TRUE)
}

infile
}



#### file noheader.csv ####

1,1,1
2,2,2
3,3,3


#### file hasheader.csv ####

a,b,c
1,1,1
2,2,2
3,3,3

########################

> readheader("noheader.csv")
  V1 V2 V3
1  1  1  1
2  2  2  2
3  3  3  3
> readheader("hasheader.csv")
  a b c
1 1 1 1
2 2 2 2
3 3 3 3

Sarah

On Tue, Aug 13, 2019 at 2:00 PM Christopher W Ryan <[hidden email]> wrote:

>
> Alas, we spend so much time and energy on data wrangling . . . .
>
> I'm given a collection of csv files to work with---"found data". They arose
> via saving Excel files to csv format. They all have the same column
> structure, except that some were saved with column names and some were not.
>
> I have a code snippet that I've used before to traverse a directory and
> read into R all the csv files of a certain filename pattern within it, and
> combine them all into a single dataframe:
>
> library(dplyr)
> ## specify the csv files that I will want to access
> files.to.read <- list.files(path = "H:/EH", pattern =
> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>
> ## function to read csv files back in
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = TRUE)
>     bb
> }
>
> ## now read the csv files, as all character
> b <- lapply(files.to.read, read.csv.files)
>
> ddd <- bind_rows(b)
>
> But this assumes that all files have column names in their first row. In
> this case, some don't. Any advice how to handle it so that those with
> column names and those without are read in and combined properly? The only
> thing I've come up with so far is:
>
> ## function to read csv files back in
> ## Unfortunately, some of the csv files are saved with column headers, and
> some are saved without them.
> ## This presents a problem when defining the function to read them: header
> = TRUE or header = FALSE?
> ## The best solution I can think of as of 13 August 2019 is to use header =
> FALSE and skip the
> ## first row of every file. This will sacrifice one record from each csv of
> about 80 files
> read.csv.files <- function(filename) {
>     bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
> = 1)
>     bb
> }
>
> This sacrifices about 80 out of about 1600 records. For my purposes in this
> instance, this may be acceptable, but of course I'd rather not.
>
> Thanks.
>
> --Chris Ryan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Sarah Goslee (she/her)
http://www.numberwright.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: reading in csv files, some of which have column names and some of which don't

Peter Dalgaard-2
Yes. Also, the original poster said that the files had the same column structure, so there may be stronger heuristics to see whether the first line is a header line. E.g., assuming that the first column is called "ID" (and doesn't have ID as a possible value) use

first <- readLines(file, 1)
if (grepl("^ID", first)
...
else
...

-pd
 

> On 13 Aug 2019, at 20:39 , Sarah Goslee <[hidden email]> wrote:
>
> Like Bert, I can't see an easy approach for datasets that have
> character rather than numeric data. But here's a simple approach for
> distinguishing files that have possible character headers but numeric
> data.
>
>
>
> readheader <- function(filename) {
>
> possibleheader <- read.table(filename, nrows=1, sep=",", header=FALSE)
>
> if(all(is.numeric(possibleheader[,1]))) {
> # no header
> infile <- read.table(filename, sep=",", header=FALSE)
> } else {
> # has header
> infile <- read.table(filename, sep=",", header=TRUE)
> }
>
> infile
> }
>
>
>
> #### file noheader.csv ####
>
> 1,1,1
> 2,2,2
> 3,3,3
>
>
> #### file hasheader.csv ####
>
> a,b,c
> 1,1,1
> 2,2,2
> 3,3,3
>
> ########################
>
>> readheader("noheader.csv")
>  V1 V2 V3
> 1  1  1  1
> 2  2  2  2
> 3  3  3  3
>> readheader("hasheader.csv")
>  a b c
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
>
> Sarah
>
> On Tue, Aug 13, 2019 at 2:00 PM Christopher W Ryan <[hidden email]> wrote:
>>
>> Alas, we spend so much time and energy on data wrangling . . . .
>>
>> I'm given a collection of csv files to work with---"found data". They arose
>> via saving Excel files to csv format. They all have the same column
>> structure, except that some were saved with column names and some were not.
>>
>> I have a code snippet that I've used before to traverse a directory and
>> read into R all the csv files of a certain filename pattern within it, and
>> combine them all into a single dataframe:
>>
>> library(dplyr)
>> ## specify the csv files that I will want to access
>> files.to.read <- list.files(path = "H:/EH", pattern =
>> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
>> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>>
>> ## function to read csv files back in
>> read.csv.files <- function(filename) {
>>    bb <- read.csv(filename, colClasses = "character", header = TRUE)
>>    bb
>> }
>>
>> ## now read the csv files, as all character
>> b <- lapply(files.to.read, read.csv.files)
>>
>> ddd <- bind_rows(b)
>>
>> But this assumes that all files have column names in their first row. In
>> this case, some don't. Any advice how to handle it so that those with
>> column names and those without are read in and combined properly? The only
>> thing I've come up with so far is:
>>
>> ## function to read csv files back in
>> ## Unfortunately, some of the csv files are saved with column headers, and
>> some are saved without them.
>> ## This presents a problem when defining the function to read them: header
>> = TRUE or header = FALSE?
>> ## The best solution I can think of as of 13 August 2019 is to use header =
>> FALSE and skip the
>> ## first row of every file. This will sacrifice one record from each csv of
>> about 80 files
>> read.csv.files <- function(filename) {
>>    bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
>> = 1)
>>    bb
>> }
>>
>> This sacrifices about 80 out of about 1600 records. For my purposes in this
>> instance, this may be acceptable, but of course I'd rather not.
>>
>> Thanks.
>>
>> --Chris Ryan
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Sarah Goslee (she/her)
> http://www.numberwright.com
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: reading in csv files, some of which have column names and some of which don't

Benoit Vaillant
In reply to this post by Christopher W. Ryan
Hello,

On Tue, Aug 13, 2019 at 01:59:56PM -0400, Christopher W Ryan wrote:
> But this assumes that all files have column names in their first row. In
> this case, some don't. Any advice how to handle it so that those with
> column names and those without are read in and combined properly?

It obvously depends on the data, but here is an other approach (which
I hope has not been suggested yet):

1. For each file, read only the first row (and keep track of
   file => first row),
2. Make counts of dinstinct first rows. If data is sufficely not
   identical on first rows, the highest count will indicate that its a
   header, so mark this as the header,
3. Reread files, since there is some form of mapping kept in step 1,
   one knows if header should be TRUE or FALSE, and fix headers after
   reading with headers set to FALSE.
   
This can of course miserably fail if the set of saved files do not
have enough ones with headers included.

HTH in your case, but it's definitely not generic.

--
BenoƮt

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (883 bytes) Download Attachment