Getting an unexpected extra row when merging two dataframes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting an unexpected extra row when merging two dataframes

PaulJr
Hello everyone,

Hope you are all doing great. So I have two datasets:

-dataset1Frame: which contains the historical number of transits from
october 1st, 1985 up to march 1, 2017. It has two columns, one called
TransitDate and the other called Transits. dataset1Frame is a table comming
from an SQL Server Database.

-TransitDateFrame: a made up dataframe that goes from october 1st, 1985 up
to the last date available in dataset1Frame.

Note: The reason why I made up TransitDataFrame is because, since sometimes
dataset1Frame has missing observations (some dates do not exist), and I
just want to make sure I have all the dates available from october 1, 1985
up to the last available observation.
The idea is to leave the transits that do exist as they come, and add the
dates missing as aditional rows (observations) with a value of NA for the
transits.

That being said, here is the code:

>install.packages("src/lubridate_1.6.0.zip", lib=".", repos=NULL,
verbose=TRUE)
>library(lubridate, lib.loc=".", verbose=TRUE)
>library(forecast)
>library(tseries)
>library(stats)
>library(stats4)

>dataset1 <-read.table("CONTAINERTESTDATA.txt")


>dataset1Frame<-data.frame(dataset1)

>dataset1Frame$TransitDate<-as.Date(dataset1Frame$TransitDate, "%Y-%m-%d")

>TransitDate<-seq(as.Date("1985-10-01"),
as.Date(dataset1Frame[nrow(dataset1Frame),1]), "months")

>TransitDate["Transits"]<-NA

>TransitDateFrame<-data.frame(TransitDate)

>NewTransitsFrame<-merge(dataset1Frame,TransitDateFrame, all.y=TRUE)

#Output from resulting dataframes

>TransitDateFrame

>NewTransitsFrame

Why is there an additional row(observation) with a value of NA if I
specified that the dataframe should only go to the last observation? There
should be 378 observations at the end and I get 379 observations instead.

The reason I am doing it this way is because this is how I got to fill in
the gaps in dates (whenever there are nonexistent observations/missing
data).

Any guidance will be greatly appreciated.

I am attaching a .txt file as a reference,

Best regards,

Paul

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

CONTAINERTESTDATA.txt (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Getting an unexpected extra row when merging two dataframes

jholtman
first of all when you read the data in you get 379 rows of data since
you did not say 'header = TRUE' in the read.table.  Here is what the
first 6 lines of you data are:

> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt')
>
> str(dataset1)
'data.frame':   379 obs. of  2 variables:
 $ V1: Factor w/ 379 levels "1-Apr-00","1-Apr-01",..: 379 333 301 80
145 113 239 18 270 207 ...
 $ V2: Factor w/ 66 levels "10","11","12",..: 66 46 57 5 39 48 40 61 10 18 ...
> View(dataset1)
> head(dataset1)
           V1       V2
1 TransitDate Transits
2    1-Oct-85       55
3    1-Nov-85       66
4    1-Dec-85       14
5    1-Jan-86       48
6    1-Feb-86       57
>

You need to learn to use 'str' to look at the structure.  So when you
are converting the dates, you will get an NA because the first row has
"TransitDate".  Now if you had used 'header = TRUE', you data would
look like this:

> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt',
+                     header = TRUE,
+                     as.is = TRUE  # prevent conversion to factors
+                     )
>
> str(dataset1)
'data.frame':   378 obs. of  2 variables:
 $ TransitDate: chr  "1-Oct-85" "1-Nov-85" "1-Dec-85" "1-Jan-86" ...
 $ Transits   : int  55 66 14 48 57 49 70 19 27 28 ...
> head(dataset1)
  TransitDate Transits
1    1-Oct-85       55
2    1-Nov-85       66
3    1-Dec-85       14
4    1-Jan-86       48
5    1-Feb-86       57
6    1-Mar-86       49
>

So try again.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Mar 29, 2017 at 11:02 AM, Paul Bernal <[hidden email]> wrote:

> Hello everyone,
>
> Hope you are all doing great. So I have two datasets:
>
> -dataset1Frame: which contains the historical number of transits from
> october 1st, 1985 up to march 1, 2017. It has two columns, one called
> TransitDate and the other called Transits. dataset1Frame is a table comming
> from an SQL Server Database.
>
> -TransitDateFrame: a made up dataframe that goes from october 1st, 1985 up
> to the last date available in dataset1Frame.
>
> Note: The reason why I made up TransitDataFrame is because, since sometimes
> dataset1Frame has missing observations (some dates do not exist), and I
> just want to make sure I have all the dates available from october 1, 1985
> up to the last available observation.
> The idea is to leave the transits that do exist as they come, and add the
> dates missing as aditional rows (observations) with a value of NA for the
> transits.
>
> That being said, here is the code:
>
>>install.packages("src/lubridate_1.6.0.zip", lib=".", repos=NULL,
> verbose=TRUE)
>>library(lubridate, lib.loc=".", verbose=TRUE)
>>library(forecast)
>>library(tseries)
>>library(stats)
>>library(stats4)
>
>>dataset1 <-read.table("CONTAINERTESTDATA.txt")
>
>
>>dataset1Frame<-data.frame(dataset1)
>
>>dataset1Frame$TransitDate<-as.Date(dataset1Frame$TransitDate, "%Y-%m-%d")
>
>>TransitDate<-seq(as.Date("1985-10-01"),
> as.Date(dataset1Frame[nrow(dataset1Frame),1]), "months")
>
>>TransitDate["Transits"]<-NA
>
>>TransitDateFrame<-data.frame(TransitDate)
>
>>NewTransitsFrame<-merge(dataset1Frame,TransitDateFrame, all.y=TRUE)
>
> #Output from resulting dataframes
>
>>TransitDateFrame
>
>>NewTransitsFrame
>
> Why is there an additional row(observation) with a value of NA if I
> specified that the dataframe should only go to the last observation? There
> should be 378 observations at the end and I get 379 observations instead.
>
> The reason I am doing it this way is because this is how I got to fill in
> the gaps in dates (whenever there are nonexistent observations/missing
> data).
>
> Any guidance will be greatly appreciated.
>
> I am attaching a .txt file as a reference,
>
> Best regards,
>
> Paul
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting an unexpected extra row when merging two dataframes

jholtman
In reply to this post by PaulJr
you need to show what 'str' shows for the data structure

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Mar 30, 2017 at 12:08 AM, [hidden email]
<[hidden email]> wrote:

> Dear Jim,
>
> Thank you for your kind reply. However I forgot to tell you that the data
> was actually read from a Microsoft SQL Server database, so I used a select
> statement to read (import) it.
>
> I am working with the R script module of Microsoft Azure Machine Learning
> Studio, adn I used an sql connection to read in the table.
>
> That being said, how can I do to fix tve issue?
>
> Best regards,
>
> Paul
>
>
>
> -------- Mensaje original --------
> Asunto: Re: [R] Getting an unexpected extra row when merging two dataframes
> De: jim holtman
> Para: Paul Bernal
> CC: [hidden email]
>
>
> first of all when you read the data in you get 379 rows of data since
> you did not say 'header = TRUE' in the read.table. Here is what the
> first 6 lines of you data are:
>
>> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt')
>>
>> str(dataset1)
> 'data.frame': 379 obs. of 2 variables:
> $ V1: Factor w/ 379 levels "1-Apr-00","1-Apr-01",..: 379 333 301 80
> 145 113 239 18 270 207 ...
> $ V2: Factor w/ 66 levels "10","11","12",..: 66 46 57 5 39 48 40 61 10 18
> ...
>> View(dataset1)
>> head(dataset1)
> V1 V2
> 1 TransitDate Transits
> 2 1-Oct-85 55
> 3 1-Nov-85 66
> 4 1-Dec-85 14
> 5 1-Jan-86 48
> 6 1-Feb-86 57
>>
>
> You need to learn to use 'str' to look at the structure. So when you
> are converting the dates, you will get an NA because the first row has
> "TransitDate". Now if you had used 'header = TRUE', you data would
> look like this:
>
>> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt',
> + header = TRUE,
> + as.is = TRUE # prevent conversion to factors
> + )
>>
>> str(dataset1)
> 'data.frame': 378 obs. of 2 variables:
> $ TransitDate: chr "1-Oct-85" "1-Nov-85" "1-Dec-85" "1-Jan-86" ...
> $ Transits : int 55 66 14 48 57 49 70 19 27 28 ...
>> head(dataset1)
> TransitDate Transits
> 1 1-Oct-85 55
> 2 1-Nov-85 66
> 3 1-Dec-85 14
> 4 1-Jan-86 48
> 5 1-Feb-86 57
> 6 1-Mar-86 49
>>
>
> So try again.
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Wed, Mar 29, 2017 at 11:02 AM, Paul Bernal wrote:
>> Hello everyone,
>>
>> Hope you are all doing great. So I have two datasets:
>>
>> -dataset1Frame: which contains the historical number of transits from
>> october 1st, 1985 up to march 1, 2017. It has two columns, one called
>> TransitDate and the other called Transits. dataset1Frame is a table
>> comming
>> from an SQL Server Database.
>>
>> -TransitDateFrame: a made up dataframe that goes from october 1st, 1985 up
>> to the last date available in dataset1Frame.
>>
>> Note: The reason why I made up TransitDataFrame is because, since
>> sometimes
>> dataset1Frame has missing observations (some dates do not exist), and I
>> just want to make sure I have all the dates available from october 1, 1985
>> up to the last available observation.
>> The idea is to leave the transits that do exist as they come, and add the
>> dates missing as aditional rows (observations) with a value of NA for the
>> transits.
>>
>> That being said, here is the code:
>>
>>>install.packages("src/lubridate_1.6.0.zip", lib=".", repos=NULL,
>> verbose=TRUE)
>>>library(lubridate, lib.loc=".", verbose=TRUE)
>>>library(forecast)
>>>library(tseries)
>>>library(stats)
>>>library(stats4)
>>
>>>dataset1 <-read.table("CONTAINERTESTDATA.txt")
>>
>>
>>>dataset1Frame<-data.frame(dataset1)
>>
>>>dataset1Frame$TransitDate<-as.Date(dataset1Frame$TransitDate, "%Y-%m-%d")
>>
>>>TransitDate<-seq(as.Date("1985-10-01"),
>> as.Date(dataset1Frame[nrow(dataset1Frame),1]), "months")
>>
>>>TransitDate["Transits"]<-NA
>>
>>>TransitDateFrame<-data.frame(TransitDate)
>>
>>>NewTransitsFrame<-merge(dataset1Frame,TransitDateFrame, all.y=TRUE)
>>
>> #Output from resulting dataframes
>>
>>>TransitDateFrame
>>
>>>NewTransitsFrame
>>
>> Why is there an additional row(observation) with a value of NA if I
>> specified that the dataframe should only go to the last observation? There
>> should be 378 observations at the end and I get 379 observations instead.
>>
>> The reason I am doing it this way is because this is how I got to fill in
>> the gaps in dates (whenever there are nonexistent observations/missing
>> data).
>>
>> Any guidance will be greatly appreciated.
>>
>> I am attaching a .txt file as a reference,
>>
>> Best regards,
>>
>> Paul
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.