Quantcast

Why Numeric Values Become Factors in Data Frame

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Why Numeric Values Become Factors in Data Frame

Rich Shepard
   I have a data frame with 1 factor, one date, and 37 numeric values:
str(waterchem)
'data.frame': 3525 obs. of  39 variables:
   site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
  $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
  $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
  $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
  $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
  $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
  ...
  $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363

   All the numeric categories are read in as numbers except for some of those
in column 'SC'. I have been looking in the source file for a couple of hours
trying to learn why values such as 1.090 and 10.000 are seen as characters
rather than numbers. I've not see the reason.

   The source file is 860K and looks like this:

site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400

   The R command used to create the data frame is:
         waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')

   Pointers on how to determine why this one variable has some values and
characters rather than as numerics are needed.

Rich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why Numeric Values Become Factors in Data Frame

Joshua Wiley-2
Hi Rich,

Try looking at:

levels(waterchem$SC)

There must be something in that column that is triggering R to read it
as character.  Potential examples include using "." to indicate
missing values or anything else that is not itself directly numeric.
You might also get some mileadge out of attempting to coerce the
factor labels to numeric and seeing what errors/warnings arise and if
any new values are missing.  For instance:

x <- factor(c("1", "2", "NA", "3e5", "."))

> levels(x)
[1] "."   "1"   "2"   "3e5" "NA"
> as.numeric(levels(x))
[1]    NA 1e+00 2e+00 3e+05    NA
Warning message:
NAs introduced by coercion

Nothing else comes to mind off the top of my head to try.  Once you
determine what is doing it, you can force the class in read.table
using the colClasses argument.

Cheers,

Josh

On Tue, Nov 29, 2011 at 11:18 AM, Rich Shepard <[hidden email]> wrote:

>  I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame':   3525 obs. of  39 variables:
>  site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
>  $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
>  $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
>  $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
>  $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
>  $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
>  ...
>  $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
>  All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
>
>  The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
>
>  The R command used to create the data frame is:
>        waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
>  Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why Numeric Values Become Factors in Data Frame

David Winsemius
In reply to this post by Rich Shepard

On Nov 29, 2011, at 2:18 PM, Rich Shepard wrote:

>  I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame': 3525 obs. of  39 variables:
>  site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
> $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
> $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
> $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
> $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
> $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9  
> 2.52 ...
> ...
> $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
>  All the numeric categories are read in as numbers except for some  
> of those
> in column 'SC'. I have been looking in the source file for a couple  
> of hours
> trying to learn why values such as 1.090 and 10.000 are seen as  
> characters
> rather than numbers. I've not see the reason.
>
>  The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-
> Tot
> '|
> 'As
> '|
> 'Ba
> '|
> 'Be
> '|
> 'Bi
> '|
> 'Ca
> '|
> 'Cd
> '|
> 'Cl
> '|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-
> NO2'|'Oil-
> grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|
> 0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|
> 0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|
> 630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|
> 0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|
> 0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|
> 633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
>
>  The R command used to create the data frame is:
>        waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
>  Pointers on how to determine why this one variable has some values  
> and
> characters rather than as numerics are needed.

So what does this show?

grep("[^0-9.]", waterchem$SC)



David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why Numeric Values Become Factors in Data Frame

Marc Schwartz-3
In reply to this post by Rich Shepard
On Nov 29, 2011, at 1:18 PM, Rich Shepard wrote:

>  I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame': 3525 obs. of  39 variables:
>  site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
> $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
> $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
> $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
> $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
> $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
> ...
> $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
>  All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
>
>  The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
>
>  The R command used to create the data frame is:
>        waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
>  Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
>
> Rich


Rich,

Somewhere in that column are non-numeric characters (other than 0 through 9 and a decimal point), resulting in the column being coerced to a factor.

Not fully tested, but using grepl() along the lines of:

Vec <- c(1.09, 1.23, "1,23", "A", 2.067)

> which(grepl("[^0-9\\.]", Vec))
[1] 3 4

Will give you the indices of the entries in the column that contain non-numeric characters.

> Vec[which(grepl("[^0-9\\.]", Vec))]
[1] "1,23" "A"  

Will give you the entries themselves.

The read.table() family of functions use type.convert() internally to do the data type coercions:

> type.convert(Vec)
[1] 1.09  1.23  1,23  A     2.067
Levels: 1,23 1.09 1.23 2.067 A

So 'Vec' is coerced to a factor due to the non-numeric characters contained in the entries.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why Numeric Values Become Factors in Data Frame

William Dunlap
In reply to this post by Rich Shepard
You can see what the offending strings are with
  > with(waterchem, levels(SC)[is.na(as.numeric(levels(SC)))])
  [1] "-" "+"
  Warning message:
  In eval(expr, envir, enclos) : NAs introduced by coercion
but it may be easiest to use the colClasses argument to read.table
to force that column to be numeric (with NA's for strings that
could not be interpretted as numbers).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Rich Shepard
> Sent: Tuesday, November 29, 2011 11:19 AM
> To: [hidden email]
> Subject: [R] Why Numeric Values Become Factors in Data Frame
>
>    I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame': 3525 obs. of  39 variables:
>    site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
>   $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
>   $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
>   $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
>   $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
>   $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
>   ...
>   $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
>    All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
>
>    The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-
> Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'N
> O3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-
> 12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.
> 960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000
> |320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-
> 15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.9
> 10|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.0
> 00|300.000|0.001|0.000|12.400
>
>    The R command used to create the data frame is:
>          waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
>    Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why Numeric Values Become Factors in Data Frame

Rich Shepard
In reply to this post by Rich Shepard
On Tue, 29 Nov 2011, Rich Shepard wrote:

>  Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.

Joshua, Marc, David, Bill, Sarah, Bert, et al.:

   Thank you all for the insights and ideas. It was a valuable lesson and it
helped me fix the problem.

   Somehow my client had URLs in two data cells of the original Excel
spreadsheet. I removed that in my LibreOffice copy and exported the file as
a .csv. But, I was using a prior version with the cruft still in there when
I read it into R.

   Now that I corrected the problem (and fixed mis-entered conductivity
values < 100) the R data frame is correct:

str(waterchem)
'data.frame': 3524 obs. of  39 variables:
  $ site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 1 ...
  $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
  $ Ag        : num  0 0 0 0 0 0 0 0 0 0 ...
  $ Al        : num  0.106 0.08 0.116 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ...
  $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
  ...
  $ SC        : num  630 633 386 503 83.2 538 1450 1130 1040 940 ...

   I knew there was a non-number in there but didn't see it. Your guidance
not only taught me how to find it, but made me aware that while I was
searching in the cleaned up text file R was fed the old version.

Very much appreciated,

Rich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...