remove

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

remove

Val-17
Hi all,
I have a big data set and want to  remove rows conditionally.
In my data file  each person were recorded  for several weeks. Somehow
during the recording periods, their last name was misreported.   For
each person,   the last name should be the same. Otherwise remove from
the data. Example, in the following data set, Alex was found to have
two last names .

Alex   West
Alex   Joseph

Alex should be removed  from the data.  if this happens then I want
remove  all rows with Alex. Here is my data set

df <- read.table(header=TRUE, text='first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West ')

Desired output

      first  week last
1     Bob     1   John
2     Bob     2   John
3     Bob     3   John
4     Cory     1   Jack
5     Cory     2   Jack

Thank you in advance

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Bert Gunter-2
Basic stuff!

Either subscripting or ?subset.

There are many good R tutorials on the web. You should spend some
(more?) time with some.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Feb 11, 2017 at 9:02 PM, Val <[hidden email]> wrote:

> Hi all,
> I have a big data set and want to  remove rows conditionally.
> In my data file  each person were recorded  for several weeks. Somehow
> during the recording periods, their last name was misreported.   For
> each person,   the last name should be the same. Otherwise remove from
> the data. Example, in the following data set, Alex was found to have
> two last names .
>
> Alex   West
> Alex   Joseph
>
> Alex should be removed  from the data.  if this happens then I want
> remove  all rows with Alex. Here is my data set
>
> df <- read.table(header=TRUE, text='first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West ')
>
> Desired output
>
>       first  week last
> 1     Bob     1   John
> 2     Bob     2   John
> 3     Bob     3   John
> 4     Cory     1   Jack
> 5     Cory     2   Jack
>
> Thank you in advance
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

P Tennant
In reply to this post by Val-17
Hi Val,

The by() function could be used here. With the dataframe dfr:

# split the data by first name and check for more than one last name for
each first name
res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
# make the result more easily manipulated
res <- as.table(res)
res
# first
  # Alex   Bob  Cory
  # TRUE FALSE FALSE

# then use this result to subset the data
nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
# sort if needed
nw.dfr[order(nw.dfr$first) , ]

   first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John
3  Cory    1 Jack
4  Cory    2 Jack


Philip

On 12/02/2017 4:02 PM, Val wrote:

> Hi all,
> I have a big data set and want to  remove rows conditionally.
> In my data file  each person were recorded  for several weeks. Somehow
> during the recording periods, their last name was misreported.   For
> each person,   the last name should be the same. Otherwise remove from
> the data. Example, in the following data set, Alex was found to have
> two last names .
>
> Alex   West
> Alex   Joseph
>
> Alex should be removed  from the data.  if this happens then I want
> remove  all rows with Alex. Here is my data set
>
> df<- read.table(header=TRUE, text='first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West ')
>
> Desired output
>
>        first  week last
> 1     Bob     1   John
> 2     Bob     2   John
> 3     Bob     3   John
> 4     Cory     1   Jack
> 5     Cory     2   Jack
>
> Thank you in advance
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Re: remove

Rolf Turner
In reply to this post by Bert Gunter-2

On 12/02/17 18:36, Bert Gunter wrote:
> Basic stuff!
>
> Either subscripting or ?subset.
>
> There are many good R tutorials on the web. You should spend some
> (more?) time with some.

Uh, Bert, perhaps I'm being obtuse (a common occurrence) but it doesn't
seem basic to me.  The only way that I can see how to go at it is via
a for loop:

rdln <- function(X) {
# Remove discordant last names.
     ok <- logical(nrow(X))
     for(nm in unique(X$first)) {
         xxx <- unique(X$last[X$first==nm])
         if(length(xxx)==1) ok[X$first==nm] <- TRUE
     }
     Y <- X[ok,]
     Y <- Y[order(Y$first),]
     rownames(Y) <- 1:nrow(Y)
     Y
}

Calling the toy data frame "melvin" rather than "df" (since "df" is the
name of the built in F density function, it is bad form to use it as the
name of another object) I get:

 > rdln(melvin)
   first week last
1   Bob    1 John
2   Bob    2 John
3   Bob    3 John
4  Cory    1 Jack
5  Cory    2 Jack

which is the desired output.  If there is a "basic stuff" way to do this
I'd like to see it.  Perhaps I will then be toadally embarrassed, but
they say that this is good for one.

cheers,

Rolf

--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

> On Sat, Feb 11, 2017 at 9:02 PM, Val <[hidden email]> wrote:
>> Hi all,
>> I have a big data set and want to  remove rows conditionally.
>> In my data file  each person were recorded  for several weeks. Somehow
>> during the recording periods, their last name was misreported.   For
>> each person,   the last name should be the same. Otherwise remove from
>> the data. Example, in the following data set, Alex was found to have
>> two last names .
>>
>> Alex   West
>> Alex   Joseph
>>
>> Alex should be removed  from the data.  if this happens then I want
>> remove  all rows with Alex. Here is my data set
>>
>> df <- read.table(header=TRUE, text='first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West ')
>>
>> Desired output
>>
>>       first  week last
>> 1     Bob     1   John
>> 2     Bob     2   John
>> 3     Bob     3   John
>> 4     Cory     1   Jack
>> 5     Cory     2   Jack

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Jeff Newmiller
In reply to this post by P Tennant
The "by" function aggregates and returns a result with generally fewer
rows than the original data. Since you are looking to index the rows in
the original data set, the "ave" function is better suited because it
always returns a vector that is just as long as the input vector:

# I usually work with character data rather than factors if I plan
# to modify the data (e.g. removing rows)
DF <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )

err <- ave( DF$last
           , DF[ , "first", drop = FALSE]
           , FUN = function( lst ) {
               length( unique( lst ) )
             }
           )
result <- DF[ "1" == err, ]
result

Notice that the ave function returns a vector of the same type as was
given to it, so even though the function returns a numeric the err
vector is character.

If you wanted to be able to examine more than one other column in
determining the keep/reject decision, you could do:

err2 <- ave( seq_along( DF$first )
            , DF[ , "first", drop = FALSE]
            , FUN = function( n ) {
               length( unique( DF[ n, "last" ] ) )
              }
            )
result2 <- DF[ 1 == err2, ]
result2

and then you would have the option to re-use the "n" index to look at
other columns as well.

Finally, here is a dplyr solution:

library(dplyr)
result3 <- (   DF
            %>% group_by( first ) # like a prep for ave or by
            %>% mutate( err = length( unique( last ) ) ) # similar to ave
            %>% filter( 1 == err ) # drop the rows with too many last names
            %>% select( -err ) # drop the temporary column
            %>% as.data.frame # convert back to a plain-jane data frame
            )
result3

which uses a small set of verbs in a pipeline of functions to go from
input to result in one pass.

If your data set is really big (running out of memory big) then you might
want to investigate the data.table or sqlite packages, either of which can
be combined with dplyr to get a standardized syntax for managing larger
amounts of data. However, most people actually aren't running out of
memory so in most cases the extra horsepower isn't actually needed.

On Sun, 12 Feb 2017, P Tennant wrote:

> Hi Val,
>
> The by() function could be used here. With the dataframe dfr:
>
> # split the data by first name and check for more than one last name for each
> first name
> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
> # make the result more easily manipulated
> res <- as.table(res)
> res
> # first
> # Alex   Bob  Cory
> # TRUE FALSE FALSE
>
> # then use this result to subset the data
> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
> # sort if needed
> nw.dfr[order(nw.dfr$first) , ]
>
>  first week last
> 2   Bob    1 John
> 5   Bob    2 John
> 6   Bob    3 John
> 3  Cory    1 Jack
> 4  Cory    2 Jack
>
>
> Philip
>
> On 12/02/2017 4:02 PM, Val wrote:
>> Hi all,
>> I have a big data set and want to  remove rows conditionally.
>> In my data file  each person were recorded  for several weeks. Somehow
>> during the recording periods, their last name was misreported.   For
>> each person,   the last name should be the same. Otherwise remove from
>> the data. Example, in the following data set, Alex was found to have
>> two last names .
>>
>> Alex   West
>> Alex   Joseph
>>
>> Alex should be removed  from the data.  if this happens then I want
>> remove  all rows with Alex. Here is my data set
>>
>> df<- read.table(header=TRUE, text='first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West ')
>>
>> Desired output
>>
>>        first  week last
>> 1     Bob     1   John
>> 2     Bob     2   John
>> 3     Bob     3   John
>> 4     Cory     1   Jack
>> 5     Cory     2   Jack
>>
>> Thank you in advance
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

P Tennant
Hi Jeff,

Why do you say ave() is better suited *because* it always returns a
vector that is just as long as the input vector? Is it because that
feature (of equal length), allows match() to be avoided, and as a
result, the subsequent subsetting is faster with very large datasets?

Thanks, Philip


On 12/02/2017 5:42 PM, Jeff Newmiller wrote:

> The "by" function aggregates and returns a result with generally fewer
> rows than the original data. Since you are looking to index the rows
> in the original data set, the "ave" function is better suited because
> it always returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was
> given to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to look at
> other columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last
> names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from
> input to result in one pass.
>
> If your data set is really big (running out of memory big) then you
> might want to investigate the data.table or sqlite packages, either of
> which can be combined with dplyr to get a standardized syntax for
> managing larger amounts of data. However, most people actually aren't
> running out of memory so in most cases the extra horsepower isn't
> actually needed.
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name
>> for each first name
>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks. Somehow
>>> during the recording periods, their last name was misreported.   For
>>> each person,   the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
>
> Jeff Newmiller                        The     .....       .....  Go
> Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  
> rocks...1k
> ---------------------------------------------------------------------------
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Val-17
In reply to this post by Jeff Newmiller
 Jeff, Rolf and Philip.
Thank you very much for your suggestion.

Jeff, you suggested if your data is big then consider data.table ....
My data is "big"  it is more than 200M  records and I will see if this
function works.

Thank you again.


On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<[hidden email]> wrote:

> The "by" function aggregates and returns a result with generally fewer rows
> than the original data. Since you are looking to index the rows in the
> original data set, the "ave" function is better suited because it always
> returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was given
> to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to look at other
> columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from input
> to result in one pass.
>
> If your data set is really big (running out of memory big) then you might
> want to investigate the data.table or sqlite packages, either of which can
> be combined with dplyr to get a standardized syntax for managing larger
> amounts of data. However, most people actually aren't running out of memory
> so in most cases the extra horsepower isn't actually needed.
>
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name for
>> each first name
>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks. Somehow
>>> during the recording periods, their last name was misreported.   For
>>> each person,   the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Jeff Newmiller
In reply to this post by P Tennant
Exactly. Sort of like the optimisation of using which.max instead of max followed by which, though ideally the only intermediate vector would be the logical vector that says keep or don't keep.
--
Sent from my phone. Please excuse my brevity.

On February 11, 2017 11:19:11 PM PST, P Tennant <[hidden email]> wrote:

>Hi Jeff,
>
>Why do you say ave() is better suited *because* it always returns a
>vector that is just as long as the input vector? Is it because that
>feature (of equal length), allows match() to be avoided, and as a
>result, the subsequent subsetting is faster with very large datasets?
>
>Thanks, Philip
>
>
>On 12/02/2017 5:42 PM, Jeff Newmiller wrote:
>> The "by" function aggregates and returns a result with generally
>fewer
>> rows than the original data. Since you are looking to index the rows
>> in the original data set, the "ave" function is better suited because
>
>> it always returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text=
>> 'first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>>           , DF[ , "first", drop = FALSE]
>>           , FUN = function( lst ) {
>>               length( unique( lst ) )
>>             }
>>           )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>
>> given to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>>            , DF[ , "first", drop = FALSE]
>>            , FUN = function( n ) {
>>               length( unique( DF[ n, "last" ] ) )
>>              }
>>            )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to look at
>
>> other columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- (   DF
>>            %>% group_by( first ) # like a prep for ave or by
>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>>            %>% filter( 1 == err ) # drop the rows with too many last
>> names
>>            %>% select( -err ) # drop the temporary column
>>            %>% as.data.frame # convert back to a plain-jane data
>frame
>>            )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>
>> input to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you
>> might want to investigate the data.table or sqlite packages, either
>of
>> which can be combined with dplyr to get a standardized syntax for
>> managing larger amounts of data. However, most people actually aren't
>
>> running out of memory so in most cases the extra horsepower isn't
>> actually needed.
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last name
>
>>> for each first name
>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex   Bob  Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>>  first week last
>>> 2   Bob    1 John
>>> 5   Bob    2 John
>>> 6   Bob    3 John
>>> 3  Cory    1 Jack
>>> 4  Cory    2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>> Hi all,
>>>> I have a big data set and want to  remove rows conditionally.
>>>> In my data file  each person were recorded  for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.  
>For
>>>> each person,   the last name should be the same. Otherwise remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex   West
>>>> Alex   Joseph
>>>>
>>>> Alex should be removed  from the data.  if this happens then I want
>>>> remove  all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West ')
>>>>
>>>> Desired output
>>>>
>>>>        first  week last
>>>> 1     Bob     1   John
>>>> 2     Bob     2   John
>>>> 3     Bob     3   John
>>>> 4     Cory     1   Jack
>>>> 5     Cory     2   Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>
>>
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>
>> Go...
>>                                       Live:   OO#.. Dead: OO#..
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  
>> rocks...1k
>>
>---------------------------------------------------------------------------
>
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Re: remove

Bert Gunter-2
In reply to this post by Rolf Turner
My understanding was that the discordant names has been identified. So
in the example the OP gave, removing rows with first = "Alex" is done
by:

df[df$first !="Alex",]

If that is not the case, as others have pointed out, various forms of
tapply() (by, ave, etc.) can be used. I agree that that is not so
"basic," so I apologize if my understanding was incorrect.

Cheers,
Bert




Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Feb 11, 2017 at 10:04 PM, Rolf Turner <[hidden email]> wrote:

>
> On 12/02/17 18:36, Bert Gunter wrote:
>>
>> Basic stuff!
>>
>> Either subscripting or ?subset.
>>
>> There are many good R tutorials on the web. You should spend some
>> (more?) time with some.
>
>
> Uh, Bert, perhaps I'm being obtuse (a common occurrence) but it doesn't seem
> basic to me.  The only way that I can see how to go at it is via
> a for loop:
>
> rdln <- function(X) {
> # Remove discordant last names.
>     ok <- logical(nrow(X))
>     for(nm in unique(X$first)) {
>         xxx <- unique(X$last[X$first==nm])
>         if(length(xxx)==1) ok[X$first==nm] <- TRUE
>     }
>     Y <- X[ok,]
>     Y <- Y[order(Y$first),]
>     rownames(Y) <- 1:nrow(Y)
>     Y
> }
>
> Calling the toy data frame "melvin" rather than "df" (since "df" is the name
> of the built in F density function, it is bad form to use it as the name of
> another object) I get:
>
>> rdln(melvin)
>   first week last
> 1   Bob    1 John
> 2   Bob    2 John
> 3   Bob    3 John
> 4  Cory    1 Jack
> 5  Cory    2 Jack
>
> which is the desired output.  If there is a "basic stuff" way to do this
> I'd like to see it.  Perhaps I will then be toadally embarrassed, but they
> say that this is good for one.
>
> cheers,
>
> Rolf
>
> --
> Technical Editor ANZJS
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
>> On Sat, Feb 11, 2017 at 9:02 PM, Val <[hidden email]> wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks. Somehow
>>> during the recording periods, their last name was misreported.   For
>>> each person,   the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df <- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>       first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Re: remove

Rainer Schuermann
In reply to this post by Rolf Turner
I may not be understanding the question well enough but for me

df[ df[ , "first"]  != "Alex", ]

seems to do the job:

  first week last

Rainer




On Sonntag, 12. Februar 2017 19:04:19 CET Rolf Turner wrote:

>
> On 12/02/17 18:36, Bert Gunter wrote:
> > Basic stuff!
> >
> > Either subscripting or ?subset.
> >
> > There are many good R tutorials on the web. You should spend some
> > (more?) time with some.
>
> Uh, Bert, perhaps I'm being obtuse (a common occurrence) but it doesn't
> seem basic to me.  The only way that I can see how to go at it is via
> a for loop:
>
> rdln <- function(X) {
> # Remove discordant last names.
>      ok <- logical(nrow(X))
>      for(nm in unique(X$first)) {
>          xxx <- unique(X$last[X$first==nm])
>          if(length(xxx)==1) ok[X$first==nm] <- TRUE
>      }
>      Y <- X[ok,]
>      Y <- Y[order(Y$first),]
>      rownames(Y) <- 1:nrow(Y)
>      Y
> }
>
> Calling the toy data frame "melvin" rather than "df" (since "df" is the
> name of the built in F density function, it is bad form to use it as the
> name of another object) I get:
>
>  > rdln(melvin)
>    first week last
> 1   Bob    1 John
> 2   Bob    2 John
> 3   Bob    3 John
> 4  Cory    1 Jack
> 5  Cory    2 Jack
>
> which is the desired output.  If there is a "basic stuff" way to do this
> I'd like to see it.  Perhaps I will then be toadally embarrassed, but
> they say that this is good for one.
>
> cheers,
>
> Rolf
>
> > On Sat, Feb 11, 2017 at 9:02 PM, Val <[hidden email]> wrote:
> >> Hi all,
> >> I have a big data set and want to  remove rows conditionally.
> >> In my data file  each person were recorded  for several weeks. Somehow
> >> during the recording periods, their last name was misreported.   For
> >> each person,   the last name should be the same. Otherwise remove from
> >> the data. Example, in the following data set, Alex was found to have
> >> two last names .
> >>
> >> Alex   West
> >> Alex   Joseph
> >>
> >> Alex should be removed  from the data.  if this happens then I want
> >> remove  all rows with Alex. Here is my data set
> >>
> >> df <- read.table(header=TRUE, text='first  week last
> >> Alex    1  West
> >> Bob     1  John
> >> Cory    1  Jack
> >> Cory    2  Jack
> >> Bob     2  John
> >> Bob     3  John
> >> Alex    2  Joseph
> >> Alex    3  West
> >> Alex    4  West ')
> >>
> >> Desired output
> >>
> >>       first  week last
> >> 1     Bob     1   John
> >> 2     Bob     2   John
> >> 3     Bob     3   John
> >> 4     Cory     1   Jack
> >> 5     Cory     2   Jack
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Re: remove

Val-17
Thank you Rainer,

The question was :-
1. Identify those first names with different last names or more than
one last names.
2. Once identified (like Alex)  then exclude them.  This is because
not reliable record.

On Sun, Feb 12, 2017 at 11:17 AM, Rainer Schuermann
<[hidden email]> wrote:

> I may not be understanding the question well enough but for me
>
> df[ df[ , "first"]  != "Alex", ]
>
> seems to do the job:
>
>   first week last
>
> Rainer
>
>
>
>
> On Sonntag, 12. Februar 2017 19:04:19 CET Rolf Turner wrote:
>>
>> On 12/02/17 18:36, Bert Gunter wrote:
>> > Basic stuff!
>> >
>> > Either subscripting or ?subset.
>> >
>> > There are many good R tutorials on the web. You should spend some
>> > (more?) time with some.
>>
>> Uh, Bert, perhaps I'm being obtuse (a common occurrence) but it doesn't
>> seem basic to me.  The only way that I can see how to go at it is via
>> a for loop:
>>
>> rdln <- function(X) {
>> # Remove discordant last names.
>>      ok <- logical(nrow(X))
>>      for(nm in unique(X$first)) {
>>          xxx <- unique(X$last[X$first==nm])
>>          if(length(xxx)==1) ok[X$first==nm] <- TRUE
>>      }
>>      Y <- X[ok,]
>>      Y <- Y[order(Y$first),]
>>      rownames(Y) <- 1:nrow(Y)
>>      Y
>> }
>>
>> Calling the toy data frame "melvin" rather than "df" (since "df" is the
>> name of the built in F density function, it is bad form to use it as the
>> name of another object) I get:
>>
>>  > rdln(melvin)
>>    first week last
>> 1   Bob    1 John
>> 2   Bob    2 John
>> 3   Bob    3 John
>> 4  Cory    1 Jack
>> 5  Cory    2 Jack
>>
>> which is the desired output.  If there is a "basic stuff" way to do this
>> I'd like to see it.  Perhaps I will then be toadally embarrassed, but
>> they say that this is good for one.
>>
>> cheers,
>>
>> Rolf
>>
>> > On Sat, Feb 11, 2017 at 9:02 PM, Val <[hidden email]> wrote:
>> >> Hi all,
>> >> I have a big data set and want to  remove rows conditionally.
>> >> In my data file  each person were recorded  for several weeks. Somehow
>> >> during the recording periods, their last name was misreported.   For
>> >> each person,   the last name should be the same. Otherwise remove from
>> >> the data. Example, in the following data set, Alex was found to have
>> >> two last names .
>> >>
>> >> Alex   West
>> >> Alex   Joseph
>> >>
>> >> Alex should be removed  from the data.  if this happens then I want
>> >> remove  all rows with Alex. Here is my data set
>> >>
>> >> df <- read.table(header=TRUE, text='first  week last
>> >> Alex    1  West
>> >> Bob     1  John
>> >> Cory    1  Jack
>> >> Cory    2  Jack
>> >> Bob     2  John
>> >> Bob     3  John
>> >> Alex    2  Joseph
>> >> Alex    3  West
>> >> Alex    4  West ')
>> >>
>> >> Desired output
>> >>
>> >>       first  week last
>> >> 1     Bob     1   John
>> >> 2     Bob     2   John
>> >> 3     Bob     3   John
>> >> 4     Cory     1   Jack
>> >> 5     Cory     2   Jack
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Val-17
In reply to this post by Jeff Newmiller
Hi Jeff and all,
 How do I get the  number of unique first names   in the two data sets?

for the first one,
result2 <- DF[ 1 == err2, ]
length(unique(result2$first))




On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<[hidden email]> wrote:

> The "by" function aggregates and returns a result with generally fewer rows
> than the original data. Since you are looking to index the rows in the
> original data set, the "ave" function is better suited because it always
> returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was given
> to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to look at other
> columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from input
> to result in one pass.
>
> If your data set is really big (running out of memory big) then you might
> want to investigate the data.table or sqlite packages, either of which can
> be combined with dplyr to get a standardized syntax for managing larger
> amounts of data. However, most people actually aren't running out of memory
> so in most cases the extra horsepower isn't actually needed.
>
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name for
>> each first name
>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks. Somehow
>>> during the recording periods, their last name was misreported.   For
>>> each person,   the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Jeff Newmiller
Your question mystifies me, since it looks to me like you already know the answer.
--
Sent from my phone. Please excuse my brevity.

On February 12, 2017 3:30:49 PM PST, Val <[hidden email]> wrote:

>Hi Jeff and all,
> How do I get the  number of unique first names   in the two data sets?
>
>for the first one,
>result2 <- DF[ 1 == err2, ]
>length(unique(result2$first))
>
>
>
>
>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
><[hidden email]> wrote:
>> The "by" function aggregates and returns a result with generally
>fewer rows
>> than the original data. Since you are looking to index the rows in
>the
>> original data set, the "ave" function is better suited because it
>always
>> returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text=
>> 'first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>>           , DF[ , "first", drop = FALSE]
>>           , FUN = function( lst ) {
>>               length( unique( lst ) )
>>             }
>>           )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>given
>> to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>>            , DF[ , "first", drop = FALSE]
>>            , FUN = function( n ) {
>>               length( unique( DF[ n, "last" ] ) )
>>              }
>>            )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to look at
>other
>> columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- (   DF
>>            %>% group_by( first ) # like a prep for ave or by
>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>>            %>% filter( 1 == err ) # drop the rows with too many last
>names
>>            %>% select( -err ) # drop the temporary column
>>            %>% as.data.frame # convert back to a plain-jane data
>frame
>>            )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>input
>> to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you
>might
>> want to investigate the data.table or sqlite packages, either of
>which can
>> be combined with dplyr to get a standardized syntax for managing
>larger
>> amounts of data. However, most people actually aren't running out of
>memory
>> so in most cases the extra horsepower isn't actually needed.
>>
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last name
>for
>>> each first name
>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex   Bob  Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>>  first week last
>>> 2   Bob    1 John
>>> 5   Bob    2 John
>>> 6   Bob    3 John
>>> 3  Cory    1 Jack
>>> 4  Cory    2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>
>>>> Hi all,
>>>> I have a big data set and want to  remove rows conditionally.
>>>> In my data file  each person were recorded  for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.  
>For
>>>> each person,   the last name should be the same. Otherwise remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex   West
>>>> Alex   Joseph
>>>>
>>>> Alex should be removed  from the data.  if this happens then I want
>>>> remove  all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West ')
>>>>
>>>> Desired output
>>>>
>>>>        first  week last
>>>> 1     Bob     1   John
>>>> 2     Bob     2   John
>>>> 3     Bob     3   John
>>>> 4     Cory     1   Jack
>>>> 5     Cory     2   Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>Go...
>>                                       Live:   OO#.. Dead: OO#..
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>rocks...1k
>>
>---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Val-17
Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
     vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]

I did get different results but now I found out the problem.

Thank you!.








On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<[hidden email]> wrote:

> Your question mystifies me, since it looks to me like you already know the answer.
> --
> Sent from my phone. Please excuse my brevity.
>
> On February 12, 2017 3:30:49 PM PST, Val <[hidden email]> wrote:
>>Hi Jeff and all,
>> How do I get the  number of unique first names   in the two data sets?
>>
>>for the first one,
>>result2 <- DF[ 1 == err2, ]
>>length(unique(result2$first))
>>
>>
>>
>>
>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>><[hidden email]> wrote:
>>> The "by" function aggregates and returns a result with generally
>>fewer rows
>>> than the original data. Since you are looking to index the rows in
>>the
>>> original data set, the "ave" function is better suited because it
>>always
>>> returns a vector that is just as long as the input vector:
>>>
>>> # I usually work with character data rather than factors if I plan
>>> # to modify the data (e.g. removing rows)
>>> DF <- read.table( text=
>>> 'first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West
>>> ', header = TRUE, as.is = TRUE )
>>>
>>> err <- ave( DF$last
>>>           , DF[ , "first", drop = FALSE]
>>>           , FUN = function( lst ) {
>>>               length( unique( lst ) )
>>>             }
>>>           )
>>> result <- DF[ "1" == err, ]
>>> result
>>>
>>> Notice that the ave function returns a vector of the same type as was
>>given
>>> to it, so even though the function returns a numeric the err
>>> vector is character.
>>>
>>> If you wanted to be able to examine more than one other column in
>>> determining the keep/reject decision, you could do:
>>>
>>> err2 <- ave( seq_along( DF$first )
>>>            , DF[ , "first", drop = FALSE]
>>>            , FUN = function( n ) {
>>>               length( unique( DF[ n, "last" ] ) )
>>>              }
>>>            )
>>> result2 <- DF[ 1 == err2, ]
>>> result2
>>>
>>> and then you would have the option to re-use the "n" index to look at
>>other
>>> columns as well.
>>>
>>> Finally, here is a dplyr solution:
>>>
>>> library(dplyr)
>>> result3 <- (   DF
>>>            %>% group_by( first ) # like a prep for ave or by
>>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>>ave
>>>            %>% filter( 1 == err ) # drop the rows with too many last
>>names
>>>            %>% select( -err ) # drop the temporary column
>>>            %>% as.data.frame # convert back to a plain-jane data
>>frame
>>>            )
>>> result3
>>>
>>> which uses a small set of verbs in a pipeline of functions to go from
>>input
>>> to result in one pass.
>>>
>>> If your data set is really big (running out of memory big) then you
>>might
>>> want to investigate the data.table or sqlite packages, either of
>>which can
>>> be combined with dplyr to get a standardized syntax for managing
>>larger
>>> amounts of data. However, most people actually aren't running out of
>>memory
>>> so in most cases the extra horsepower isn't actually needed.
>>>
>>>
>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>
>>>> Hi Val,
>>>>
>>>> The by() function could be used here. With the dataframe dfr:
>>>>
>>>> # split the data by first name and check for more than one last name
>>for
>>>> each first name
>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>>> # make the result more easily manipulated
>>>> res <- as.table(res)
>>>> res
>>>> # first
>>>> # Alex   Bob  Cory
>>>> # TRUE FALSE FALSE
>>>>
>>>> # then use this result to subset the data
>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>> # sort if needed
>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>
>>>>  first week last
>>>> 2   Bob    1 John
>>>> 5   Bob    2 John
>>>> 6   Bob    3 John
>>>> 3  Cory    1 Jack
>>>> 4  Cory    2 Jack
>>>>
>>>>
>>>> Philip
>>>>
>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>
>>>>> Hi all,
>>>>> I have a big data set and want to  remove rows conditionally.
>>>>> In my data file  each person were recorded  for several weeks.
>>Somehow
>>>>> during the recording periods, their last name was misreported.
>>For
>>>>> each person,   the last name should be the same. Otherwise remove
>>from
>>>>> the data. Example, in the following data set, Alex was found to
>>have
>>>>> two last names .
>>>>>
>>>>> Alex   West
>>>>> Alex   Joseph
>>>>>
>>>>> Alex should be removed  from the data.  if this happens then I want
>>>>> remove  all rows with Alex. Here is my data set
>>>>>
>>>>> df<- read.table(header=TRUE, text='first  week last
>>>>> Alex    1  West
>>>>> Bob     1  John
>>>>> Cory    1  Jack
>>>>> Cory    2  Jack
>>>>> Bob     2  John
>>>>> Bob     3  John
>>>>> Alex    2  Joseph
>>>>> Alex    3  West
>>>>> Alex    4  West ')
>>>>>
>>>>> Desired output
>>>>>
>>>>>        first  week last
>>>>> 1     Bob     1   John
>>>>> 2     Bob     2   John
>>>>> 3     Bob     3   John
>>>>> 4     Cory     1   Jack
>>>>> 5     Cory     2   Jack
>>>>>
>>>>> Thank you in advance
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>Live...
>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>Go...
>>>                                       Live:   OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

Val-17
Hi Jeff and All,

When I examined the excluded  data,  ie.,  first name with  with
different last names, I noticed that  some last names were  not
recorded
or instance, I modified the data as follows
DF <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2     -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )


err2 <- ave( seq_along( DF$first )
           , DF[ , "first", drop = FALSE]
           , FUN = function( n ) {
              length( unique( DF[ n, "last" ] ) )
             }
           )
result2 <- DF[ 1 == err2, ]
result2

first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John

However, I want keep Cory's record. It is assumed that not recorded
should have the same last name.

Final out put should be

first week last
   Bob    1 John
   Bob    2 John
   Bob    3 John
  Cory    1  Jack
  Cory    2   -

Thank you again!

On Sun, Feb 12, 2017 at 7:28 PM, Val <[hidden email]> wrote:

> Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
> My question was the
> when I used this one
> length(unique(result2$first))
>      vs
> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>
> I did get different results but now I found out the problem.
>
> Thank you!.
>
>
>
>
>
>
>
>
> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
> <[hidden email]> wrote:
>> Your question mystifies me, since it looks to me like you already know the answer.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On February 12, 2017 3:30:49 PM PST, Val <[hidden email]> wrote:
>>>Hi Jeff and all,
>>> How do I get the  number of unique first names   in the two data sets?
>>>
>>>for the first one,
>>>result2 <- DF[ 1 == err2, ]
>>>length(unique(result2$first))
>>>
>>>
>>>
>>>
>>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>><[hidden email]> wrote:
>>>> The "by" function aggregates and returns a result with generally
>>>fewer rows
>>>> than the original data. Since you are looking to index the rows in
>>>the
>>>> original data set, the "ave" function is better suited because it
>>>always
>>>> returns a vector that is just as long as the input vector:
>>>>
>>>> # I usually work with character data rather than factors if I plan
>>>> # to modify the data (e.g. removing rows)
>>>> DF <- read.table( text=
>>>> 'first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West
>>>> ', header = TRUE, as.is = TRUE )
>>>>
>>>> err <- ave( DF$last
>>>>           , DF[ , "first", drop = FALSE]
>>>>           , FUN = function( lst ) {
>>>>               length( unique( lst ) )
>>>>             }
>>>>           )
>>>> result <- DF[ "1" == err, ]
>>>> result
>>>>
>>>> Notice that the ave function returns a vector of the same type as was
>>>given
>>>> to it, so even though the function returns a numeric the err
>>>> vector is character.
>>>>
>>>> If you wanted to be able to examine more than one other column in
>>>> determining the keep/reject decision, you could do:
>>>>
>>>> err2 <- ave( seq_along( DF$first )
>>>>            , DF[ , "first", drop = FALSE]
>>>>            , FUN = function( n ) {
>>>>               length( unique( DF[ n, "last" ] ) )
>>>>              }
>>>>            )
>>>> result2 <- DF[ 1 == err2, ]
>>>> result2
>>>>
>>>> and then you would have the option to re-use the "n" index to look at
>>>other
>>>> columns as well.
>>>>
>>>> Finally, here is a dplyr solution:
>>>>
>>>> library(dplyr)
>>>> result3 <- (   DF
>>>>            %>% group_by( first ) # like a prep for ave or by
>>>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>>>ave
>>>>            %>% filter( 1 == err ) # drop the rows with too many last
>>>names
>>>>            %>% select( -err ) # drop the temporary column
>>>>            %>% as.data.frame # convert back to a plain-jane data
>>>frame
>>>>            )
>>>> result3
>>>>
>>>> which uses a small set of verbs in a pipeline of functions to go from
>>>input
>>>> to result in one pass.
>>>>
>>>> If your data set is really big (running out of memory big) then you
>>>might
>>>> want to investigate the data.table or sqlite packages, either of
>>>which can
>>>> be combined with dplyr to get a standardized syntax for managing
>>>larger
>>>> amounts of data. However, most people actually aren't running out of
>>>memory
>>>> so in most cases the extra horsepower isn't actually needed.
>>>>
>>>>
>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>
>>>>> Hi Val,
>>>>>
>>>>> The by() function could be used here. With the dataframe dfr:
>>>>>
>>>>> # split the data by first name and check for more than one last name
>>>for
>>>>> each first name
>>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>>>> # make the result more easily manipulated
>>>>> res <- as.table(res)
>>>>> res
>>>>> # first
>>>>> # Alex   Bob  Cory
>>>>> # TRUE FALSE FALSE
>>>>>
>>>>> # then use this result to subset the data
>>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>>> # sort if needed
>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>
>>>>>  first week last
>>>>> 2   Bob    1 John
>>>>> 5   Bob    2 John
>>>>> 6   Bob    3 John
>>>>> 3  Cory    1 Jack
>>>>> 4  Cory    2 Jack
>>>>>
>>>>>
>>>>> Philip
>>>>>
>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I have a big data set and want to  remove rows conditionally.
>>>>>> In my data file  each person were recorded  for several weeks.
>>>Somehow
>>>>>> during the recording periods, their last name was misreported.
>>>For
>>>>>> each person,   the last name should be the same. Otherwise remove
>>>from
>>>>>> the data. Example, in the following data set, Alex was found to
>>>have
>>>>>> two last names .
>>>>>>
>>>>>> Alex   West
>>>>>> Alex   Joseph
>>>>>>
>>>>>> Alex should be removed  from the data.  if this happens then I want
>>>>>> remove  all rows with Alex. Here is my data set
>>>>>>
>>>>>> df<- read.table(header=TRUE, text='first  week last
>>>>>> Alex    1  West
>>>>>> Bob     1  John
>>>>>> Cory    1  Jack
>>>>>> Cory    2  Jack
>>>>>> Bob     2  John
>>>>>> Bob     3  John
>>>>>> Alex    2  Joseph
>>>>>> Alex    3  West
>>>>>> Alex    4  West ')
>>>>>>
>>>>>> Desired output
>>>>>>
>>>>>>        first  week last
>>>>>> 1     Bob     1   John
>>>>>> 2     Bob     2   John
>>>>>> 3     Bob     3   John
>>>>>> 4     Cory     1   Jack
>>>>>> 5     Cory     2   Jack
>>>>>>
>>>>>> Thank you in advance
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller                        The     .....       .....  Go
>>>Live...
>>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>>Go...
>>>>                                       Live:   OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: remove

P Tennant
Val,

Working with R's special missing value indicator (NA) would be useful
here. You could use the na.strings arg in read.table() to recognise "-"
as a missing value:

dfr <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE, na.strings = c("NA", "-"))

and then modify the function used by ave() or by() to exclude missing
values from the count of unique last names. Here's one approach adapting
code from earlier in this thread:

err <- ave(dfr$last, dfr$first, FUN = function(x)
length(unique(x[!is.na(x)])))
res <- dfr[err == 1 , ]
res <- res[order(res$first) , ]
res

   first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John
3  Cory    1 Jack
4  Cory    2 <NA>


Alternatively, if not using na.strings, change "-" to NA after first
reading the data in: identify last names recorded as "-" using an index,
and assign NA to these elements, before proceeding as above.

Philip

On 13/02/2017 3:18 PM, Val wrote:

> Hi Jeff and All,
>
> When I examined the excluded  data,  ie.,  first name with  with
> different last names, I noticed that  some last names were  not
> recorded
> or instance, I modified the data as follows
> DF<- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2     -
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
>
> err2<- ave( seq_along( DF$first )
>             , DF[ , "first", drop = FALSE]
>             , FUN = function( n ) {
>                length( unique( DF[ n, "last" ] ) )
>               }
>             )
> result2<- DF[ 1 == err2, ]
> result2
>
> first week last
> 2   Bob    1 John
> 5   Bob    2 John
> 6   Bob    3 John
>
> However, I want keep Cory's record. It is assumed that not recorded
> should have the same last name.
>
> Final out put should be
>
> first week last
>     Bob    1 John
>     Bob    2 John
>     Bob    3 John
>    Cory    1  Jack
>    Cory    2   -
>
> Thank you again!
>
> On Sun, Feb 12, 2017 at 7:28 PM, Val<[hidden email]>  wrote:
>> Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
>> My question was the
>> when I used this one
>> length(unique(result2$first))
>>       vs
>> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>>
>> I did get different results but now I found out the problem.
>>
>> Thank you!.
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
>> <[hidden email]>  wrote:
>>> Your question mystifies me, since it looks to me like you already know the answer.
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On February 12, 2017 3:30:49 PM PST, Val<[hidden email]>  wrote:
>>>> Hi Jeff and all,
>>>> How do I get the  number of unique first names   in the two data sets?
>>>>
>>>> for the first one,
>>>> result2<- DF[ 1 == err2, ]
>>>> length(unique(result2$first))
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>>> <[hidden email]>  wrote:
>>>>> The "by" function aggregates and returns a result with generally
>>>> fewer rows
>>>>> than the original data. Since you are looking to index the rows in
>>>> the
>>>>> original data set, the "ave" function is better suited because it
>>>> always
>>>>> returns a vector that is just as long as the input vector:
>>>>>
>>>>> # I usually work with character data rather than factors if I plan
>>>>> # to modify the data (e.g. removing rows)
>>>>> DF<- read.table( text=
>>>>> 'first  week last
>>>>> Alex    1  West
>>>>> Bob     1  John
>>>>> Cory    1  Jack
>>>>> Cory    2  Jack
>>>>> Bob     2  John
>>>>> Bob     3  John
>>>>> Alex    2  Joseph
>>>>> Alex    3  West
>>>>> Alex    4  West
>>>>> ', header = TRUE, as.is = TRUE )
>>>>>
>>>>> err<- ave( DF$last
>>>>>            , DF[ , "first", drop = FALSE]
>>>>>            , FUN = function( lst ) {
>>>>>                length( unique( lst ) )
>>>>>              }
>>>>>            )
>>>>> result<- DF[ "1" == err, ]
>>>>> result
>>>>>
>>>>> Notice that the ave function returns a vector of the same type as was
>>>> given
>>>>> to it, so even though the function returns a numeric the err
>>>>> vector is character.
>>>>>
>>>>> If you wanted to be able to examine more than one other column in
>>>>> determining the keep/reject decision, you could do:
>>>>>
>>>>> err2<- ave( seq_along( DF$first )
>>>>>             , DF[ , "first", drop = FALSE]
>>>>>             , FUN = function( n ) {
>>>>>                length( unique( DF[ n, "last" ] ) )
>>>>>               }
>>>>>             )
>>>>> result2<- DF[ 1 == err2, ]
>>>>> result2
>>>>>
>>>>> and then you would have the option to re-use the "n" index to look at
>>>> other
>>>>> columns as well.
>>>>>
>>>>> Finally, here is a dplyr solution:
>>>>>
>>>>> library(dplyr)
>>>>> result3<- (   DF
>>>>>             %>% group_by( first ) # like a prep for ave or by
>>>>>             %>% mutate( err = length( unique( last ) ) ) # similar to
>>>> ave
>>>>>             %>% filter( 1 == err ) # drop the rows with too many last
>>>> names
>>>>>             %>% select( -err ) # drop the temporary column
>>>>>             %>% as.data.frame # convert back to a plain-jane data
>>>> frame
>>>>>             )
>>>>> result3
>>>>>
>>>>> which uses a small set of verbs in a pipeline of functions to go from
>>>> input
>>>>> to result in one pass.
>>>>>
>>>>> If your data set is really big (running out of memory big) then you
>>>> might
>>>>> want to investigate the data.table or sqlite packages, either of
>>>> which can
>>>>> be combined with dplyr to get a standardized syntax for managing
>>>> larger
>>>>> amounts of data. However, most people actually aren't running out of
>>>> memory
>>>>> so in most cases the extra horsepower isn't actually needed.
>>>>>
>>>>>
>>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>>
>>>>>> Hi Val,
>>>>>>
>>>>>> The by() function could be used here. With the dataframe dfr:
>>>>>>
>>>>>> # split the data by first name and check for more than one last name
>>>> for
>>>>>> each first name
>>>>>> res<- by(dfr, dfr['first'], function(x) length(unique(x$last))>  1)
>>>>>> # make the result more easily manipulated
>>>>>> res<- as.table(res)
>>>>>> res
>>>>>> # first
>>>>>> # Alex   Bob  Cory
>>>>>> # TRUE FALSE FALSE
>>>>>>
>>>>>> # then use this result to subset the data
>>>>>> nw.dfr<- dfr[!dfr$first %in% names(res[res]) , ]
>>>>>> # sort if needed
>>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>>
>>>>>>   first week last
>>>>>> 2   Bob    1 John
>>>>>> 5   Bob    2 John
>>>>>> 6   Bob    3 John
>>>>>> 3  Cory    1 Jack
>>>>>> 4  Cory    2 Jack
>>>>>>
>>>>>>
>>>>>> Philip
>>>>>>
>>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>> Hi all,
>>>>>>> I have a big data set and want to  remove rows conditionally.
>>>>>>> In my data file  each person were recorded  for several weeks.
>>>> Somehow
>>>>>>> during the recording periods, their last name was misreported.
>>>> For
>>>>>>> each person,   the last name should be the same. Otherwise remove
>>>> from
>>>>>>> the data. Example, in the following data set, Alex was found to
>>>> have
>>>>>>> two last names .
>>>>>>>
>>>>>>> Alex   West
>>>>>>> Alex   Joseph
>>>>>>>
>>>>>>> Alex should be removed  from the data.  if this happens then I want
>>>>>>> remove  all rows with Alex. Here is my data set
>>>>>>>
>>>>>>> df<- read.table(header=TRUE, text='first  week last
>>>>>>> Alex    1  West
>>>>>>> Bob     1  John
>>>>>>> Cory    1  Jack
>>>>>>> Cory    2  Jack
>>>>>>> Bob     2  John
>>>>>>> Bob     3  John
>>>>>>> Alex    2  Joseph
>>>>>>> Alex    3  West
>>>>>>> Alex    4  West ')
>>>>>>>
>>>>>>> Desired output
>>>>>>>
>>>>>>>         first  week last
>>>>>>> 1     Bob     1   John
>>>>>>> 2     Bob     2   John
>>>>>>> 3     Bob     3   John
>>>>>>> 4     Cory     1   Jack
>>>>>>> 5     Cory     2   Jack
>>>>>>>
>>>>>>> Thank you in advance
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>> ---------------------------------------------------------------------------
>>>>> Jeff Newmiller                        The     .....       .....  Go
>>>> Live...
>>>>> DCN:<[hidden email]>         Basics: ##.#.       ##.#.  Live
>>>> Go...
>>>>>                                        Live:   OO#.. Dead: OO#..
>>>> Playing
>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>> rocks...1k
>>>> ---------------------------------------------------------------------------

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.