Problems with time formats when importing data using readHTMLTable

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Problems with time formats when importing data using readHTMLTable

Cristina Silva-2
Hi,

I am extracting positions data from the marine traffic website. The
table has a "Timestamp" column which, in the browser, appears with the
format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).

When I import the table, the same date "2017-07-10 14:04 (UTC)" appears
as "1499696500149969650021 minutes ago", This is the more recent date
and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear as e.g.
"1499619726149961972621 hours, 59 minutes ago".

I don't know how to convert these data to the time formats used in R
(POSIXct).

The script is very simple and worked before:

library(XML)
x <- readHTMLTable('url')

where the 'url' is the link to the website with the specification of the
vessel.

I appreciate any help.

Cristina

--
Cristina Silva
Divisão de Modelação e Gestão de Recursos Pesqueiros
Av. Dr. Alfredo Magalhães Ramalho
1495-165 Lisboa
@: [hidden email] <mailto:[hidden email]>
#: +351 213027096 <phoneto:+351213027096>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with time formats when importing data using readHTMLTable

Jeff Newmiller
Not reproducible. [1][2][3] If our answers don't seem to apply to your situation, it will likely be because you did not explain your question clearly.

Not plain text. This is a plain text mailing list,  and the best-case scenario when you let your email program send HTML is that what you saw is not what we see (worst case is your email is scrambled on our end).

Have you read the documentation for the function you are using? In particular, what about the colClasses argument? If you don't let readHTMLTable guess what the format is (have it read in as character data) then you have a fighting chance to get it right yourself, e.g.

as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M (UTC)", tz="UTC" )

-----

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

[2] http://adv-r.had.co.nz/Reproducibility.html

[3] https://cran.r-project.org/web/packages/reprex/index.html
--
Sent from my phone. Please excuse my brevity.

On July 10, 2017 8:31:30 AM PDT, Cristina Silva <[hidden email]> wrote:

>Hi,
>
>I am extracting positions data from the marine traffic website. The
>table has a "Timestamp" column which, in the browser, appears with the
>format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>
>When I import the table, the same date "2017-07-10 14:04 (UTC)" appears
>
>as "1499696500149969650021 minutes ago", This is the more recent date
>and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear as
>e.g.
>"1499619726149961972621 hours, 59 minutes ago".
>
>I don't know how to convert these data to the time formats used in R
>(POSIXct).
>
>The script is very simple and worked before:
>
>library(XML)
>x <- readHTMLTable('url')
>
>where the 'url' is the link to the website with the specification of
>the
>vessel.
>
>I appreciate any help.
>
>Cristina
>
>--
>Cristina Silva
>Divisão de Modelação e Gestão de Recursos Pesqueiros
>Av. Dr. Alfredo Magalhães Ramalho
>1495-165 Lisboa
>@: [hidden email] <mailto:[hidden email]>
>#: +351 213027096 <phoneto:+351213027096>
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with time formats when importing data using readHTMLTable

Cristina Silva-2
Dear Jeff,

I am sorry, I didn't notice that it was not plain text. I hope that it
is now in the correct format. I explain the problem again, now with more
detais.
I am collecting the track positions of our research vessel from
www.marinetraffic.com.  In the page, the data appear in a table:

Timestamp     Source     Speed (kn)     Latitude (°) Longitude (°)    
Course (°)     Show on Map
2017-07-11 10:57 (UTC)    Terr-AIS     8.6    37.36228 -9.176811    200
2017-07-11 10:54 (UTC)    Terr-AIS     8.4    37.36875 -9.174048    200
2017-07-11 09:50 (UTC)    Terr-AIS     8    37.51499 -9.184502    149
2017-07-11 09:08 (UTC)    Terr-AIS     7.7    37.60513 -9.228263    169
2017-07-11 09:05 (UTC)    Terr-AIS     8    37.61168 -9.229627    167
2017-07-11 09:03 (UTC)    Terr-AIS     6.9    37.61626 -9.230456    132
2017-07-11 08:59 (UTC)    Terr-AIS     3.4    37.61758 -9.237928    101
2017-07-11 08:57 (UTC)    Terr-AIS     3.6    37.61808 -9.240235    116
2017-07-11 08:47 (UTC)    Terr-AIS     3.3    37.62267 -9.249762    112
2017-07-11 08:44 (UTC)    Terr-AIS     3.2    37.6241 -9.25348    115
2017-07-11 08:40 (UTC)    Terr-AIS     3.3    37.62557 -9.257295    112
2017-07-11 08:32 (UTC)    Terr-AIS     3.4    37.62903 -9.266028    111
2017-07-11 08:28 (UTC)    Terr-AIS     3.3    37.63051 -9.269725    114
2017-07-11 08:25 (UTC)    Terr-AIS     3.3    37.63207 -9.273642    120
2017-07-11 08:20 (UTC)    Terr-AIS     3.6    37.63395 -9.278254    118
2017-07-11 08:16 (UTC)    Terr-AIS     3.3    37.63586 -9.282853    116
2017-07-11 08:08 (UTC)    Terr-AIS     3.2    37.6392 -9.291157    118
2017-07-11 07:38 (UTC)    Sat-AIS
2017-07-11 06:58 (UTC)    Terr-AIS     3    37.68459 -9.321658    159
2017-07-11 06:53 (UTC)    Terr-AIS     3.3    37.68881 -9.324172    150

To get this table I had to copy manually from the webpage, paste in a
text file, delete information on figures, links, etc. and then import to
R. This works well, I read the csv file and have no problems with date
formats. But instead of 20 records, I have to import e.g. 500 records,
i.e. 10 pages of 50 records each, or more. I want to import these data
in a more automated and quick procedure. I have done this before with no
problems, but probably the webpage changed some permissions.

Here is the script used to get the data directly from the webpage into
R, for a sample of 20 records (10 per page):

library(XML)
x <- list()
for (i in 1:2)
{
     x[i]<-
readHTMLTable(paste('http://www.marinetraffic.com/en/ais/index/positions/all/shipid:318358/mmsi:263601000/shipname:NORUEGA/per_page:10/page:',
i, sep=''))
}

ais <- do.call('rbind', x)
ais <- ais[,-7]

and I got the following table:

> ais
                                      Timestamp Source                  
       Speed (kn) Latitude (°) Longitude (°) Course (°)
1            149977066014997706604 minutes ago Terr-AIS                
                8.6      37.36228 -9.176811         200
2            149977048714997704876 minutes ago Terr-AIS                
                8.4      37.36875 -9.174048         200
3   149976661414997666141 hour, 11 minutes ago Terr-AIS                
                8.0      37.51499 -9.184502         149
4   149976410714997641071 hour, 53 minutes ago Terr-AIS                
                7.7      37.60513 -9.228263         169
5   149976392714997639271 hour, 56 minutes ago Terr-AIS                
                8.0      37.61168 -9.229627         167
6   149976378014997637801 hour, 58 minutes ago Terr-AIS                
                6.9      37.61626 -9.230456         132
7   149976354014997635402 hours, 2 minutes ago Terr-AIS                
                3.4      37.61758 -9.237928         101
8   149976342014997634202 hours, 4 minutes ago Terr-AIS                
                3.6      37.61808 -9.240235         116
9  149976286114997628612 hours, 14 minutes ago Terr-AIS                
                3.3      37.62267 -9.249762         112
10 149976264714997626472 hours, 17 minutes ago Terr-AIS                
                3.2       37.6241 -9.25348         115
11 149976243014997624302 hours, 21 minutes ago Terr-AIS                
                3.3      37.62557 -9.257295         112
12 149976193714997619372 hours, 29 minutes ago Terr-AIS                
                3.4      37.62903 -9.266028         111
13 149976172814997617282 hours, 32 minutes ago Terr-AIS                
                3.3      37.63051 -9.269725         114
14 149976150714997615072 hours, 36 minutes ago Terr-AIS                
                3.3      37.63207 -9.273642         120
15 149976124714997612472 hours, 40 minutes ago Terr-AIS                
                3.6      37.63395 -9.278254         118
16 149976098714997609872 hours, 45 minutes ago Terr-AIS                
                3.3      37.63586 -9.282853         116
17 149976051014997605102 hours, 53 minutes ago Terr-AIS                
                3.2       37.6392 -9.291157         118
18 149975870414997587043 hours, 23 minutes ago  Sat-AIS Add to SAT
Fleet for undelayed data          <NA> <NA>        <NA>
19  149975631414997563144 hours, 3 minutes ago Terr-AIS                
                3.0      37.68459 -9.321658         159
20  149975598814997559884 hours, 8 minutes ago Terr-AIS                
                3.3      37.68881 -9.324172         150

The records correspond exactly to the first table. I have tried to
change the colClasses to character, but the results are always the same.
My question is: how to change the Timestamp to a POSIXlt POSIXct format,
either in the importing phase or within R?

I hope that the question is clear now.

Cristina


On Mon, 10 Jul 2017 09:18:05 -0700, Jeff Newmiller wrote:

> Not reproducible. [1][2][3] If our answers don't seem to apply to
> your situation, it will likely be because you did not explain your
> question clearly.
>
> Not plain text. This is a plain text mailing list,  and the best-case
> scenario when you let your email program send HTML is that what you
> saw is not what we see (worst case is your email is scrambled on our
> end).
>
> Have you read the documentation for the function you are using? In
> particular, what about the colClasses argument? If you don't let
> readHTMLTable guess what the format is (have it read in as character
> data) then you have a fighting chance to get it right yourself, e.g.
>
> as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M (UTC)",
> tz="UTC" )
>
> -----
>
> [1]
>
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>
> [2] http://adv-r.had.co.nz/Reproducibility.html
>
> [3] https://cran.r-project.org/web/packages/reprex/index.html
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 10, 2017 8:31:30 AM PDT, Cristina Silva <[hidden email]>
> wrote:
>>Hi,
>>
>>I am extracting positions data from the marine traffic website. The
>>table has a "Timestamp" column which, in the browser, appears with
>> the
>>format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>>
>>When I import the table, the same date "2017-07-10 14:04 (UTC)"
>> appears
>>
>>as "1499696500149969650021 minutes ago", This is the more recent date
>>and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear as
>>e.g.
>>"1499619726149961972621 hours, 59 minutes ago".
>>
>>I don't know how to convert these data to the time formats used in R
>>(POSIXct).
>>
>>The script is very simple and worked before:
>>
>>library(XML)
>>x <- readHTMLTable('url')
>>
>>where the 'url' is the link to the website with the specification of
>>the
>>vessel.
>>
>>I appreciate any help.
>>
>>Cristina
>>
>>--
>>Cristina Silva
>>Divisão de Modelação e Gestão de Recursos Pesqueiros
>>Av. Dr. Alfredo Magalhães Ramalho
>>1495-165 Lisboa
>>@: [hidden email] <mailto:[hidden email]>
>>#: +351 213027096 <phoneto:+351213027096>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>______________________________________________
>>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.

--
Cristina Silva
Divisão de Modelação e Gestão de Recursos Pesqueiros
Avenida de Brasília
1449-006 Lisboa
@: [hidden email]
#: +351 213027096

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with time formats when importing data using readHTMLTable

David Winsemius

> On Jul 11, 2017, at 6:25 AM, Cristina Silva <[hidden email]> wrote:
>
> Dear Jeff,
>
> I am sorry, I didn't notice that it was not plain text. I hope that it is now in the correct format. I explain the problem again, now with more detais.
> I am collecting the track positions of our research vessel from www.marinetraffic.com.  In the page, the data appear in a table:
>
> Timestamp     Source     Speed (kn)     Latitude (°) Longitude (°)     Course (°)     Show on Map
> 2017-07-11 10:57 (UTC)    Terr-AIS     8.6    37.36228 -9.176811    200
> 2017-07-11 10:54 (UTC)    Terr-AIS     8.4    37.36875 -9.174048    200
> 2017-07-11 09:50 (UTC)    Terr-AIS     8    37.51499 -9.184502    149
> 2017-07-11 09:08 (UTC)    Terr-AIS     7.7    37.60513 -9.228263    169
> 2017-07-11 09:05 (UTC)    Terr-AIS     8    37.61168 -9.229627    167
> 2017-07-11 09:03 (UTC)    Terr-AIS     6.9    37.61626 -9.230456    132
> 2017-07-11 08:59 (UTC)    Terr-AIS     3.4    37.61758 -9.237928    101
> 2017-07-11 08:57 (UTC)    Terr-AIS     3.6    37.61808 -9.240235    116
> 2017-07-11 08:47 (UTC)    Terr-AIS     3.3    37.62267 -9.249762    112
> 2017-07-11 08:44 (UTC)    Terr-AIS     3.2    37.6241 -9.25348    115
> 2017-07-11 08:40 (UTC)    Terr-AIS     3.3    37.62557 -9.257295    112
> 2017-07-11 08:32 (UTC)    Terr-AIS     3.4    37.62903 -9.266028    111
> 2017-07-11 08:28 (UTC)    Terr-AIS     3.3    37.63051 -9.269725    114
> 2017-07-11 08:25 (UTC)    Terr-AIS     3.3    37.63207 -9.273642    120
> 2017-07-11 08:20 (UTC)    Terr-AIS     3.6    37.63395 -9.278254    118
> 2017-07-11 08:16 (UTC)    Terr-AIS     3.3    37.63586 -9.282853    116
> 2017-07-11 08:08 (UTC)    Terr-AIS     3.2    37.6392 -9.291157    118
> 2017-07-11 07:38 (UTC)    Sat-AIS
> 2017-07-11 06:58 (UTC)    Terr-AIS     3    37.68459 -9.321658    159
> 2017-07-11 06:53 (UTC)    Terr-AIS     3.3    37.68881 -9.324172    150
>
> To get this table I had to copy manually from the webpage, paste in a text file, delete information on figures, links, etc. and then import to R. This works well, I read the csv file and have no problems with date formats. But instead of 20 records, I have to import e.g. 500 records, i.e. 10 pages of 50 records each, or more. I want to import these data in a more automated and quick procedure. I have done this before with no problems, but probably the webpage changed some permissions.
>
> Here is the script used to get the data directly from the webpage into R, for a sample of 20 records (10 per page):
>
> library(XML)
> x <- list()
> for (i in 1:2)
> {
>    x[i]<- readHTMLTable(paste('http://www.marinetraffic.com/en/ais/index/positions/all/shipid:318358/mmsi:263601000/shipname:NORUEGA/per_page:10/page:', i, sep=''))
> }
>
> ais <- do.call('rbind', x)
> ais <- ais[,-7]
>
> and I got the following table:
>
>> ais
>                                     Timestamp Source                         Speed (kn) Latitude (°) Longitude (°) Course (°)
> 1            149977066014997706604 minutes ago Terr-AIS                                8.6      37.36228 -9.176811         200

There's undoubtedly a way to extract the time info from this HTML code (but I'm not the one to provide that solution.) This is the first entry in the webpage source when viewed right now:

<td>
<time class="toggle_time utc_time hide-me formatTime" data-overrideAbbreviation="empty" data-time="1499780737" data-timezone="0"><span>1499780737</span></time><time class="toggle_time my_time hide-me formatTime" data-overrideAbbreviation="MyT" data-time="1499780737" data-timezone="MyT"><span>1499780737</span></time><time class="toggle_time dif_time hide"><span>6 minutes ago</span></time> </td>

So the POSIX-style time is givne twice (with no separator and then an offset in minutes is calculated. If I were given the records as you have them now I would extract the leading 10 digits and convert to POSIXct

as.POSIXct(as.numeric(gsub("(^.{10}).+","", Timestamp)), origin="1970-01-01")

as.POSIXct(as.numeric(gsub("(^.{10}).+","\\1", ais$Timestamp)), origin="1970-01-01")
 [1] "2017-07-11 07:09:57 PDT" "2017-07-11 06:45:37 PDT" "2017-07-11 06:40:34 PDT" "2017-07-11 06:34:46 PDT" "2017-07-11 06:32:26 PDT"
 [6] "2017-07-11 06:29:34 PDT" "2017-07-11 06:26:37 PDT" "2017-07-11 06:20:24 PDT" "2017-07-11 06:17:18 PDT" "2017-07-11 06:12:37 PDT"
[11] "2017-07-11 06:08:24 PDT" "2017-07-11 06:05:04 PDT" "2017-07-11 06:03:27 PDT" "2017-07-11 05:59:37 PDT" "2017-07-11 05:54:37 PDT"
[16] "2017-07-11 05:50:48 PDT" "2017-07-11 05:44:53 PDT" "2017-07-11 05:29:17 PDT" "2017-07-11 05:26:44 PDT" "2017-07-11 04:30:08 PDT"

-- David.



> 2            149977048714997704876 minutes ago Terr-AIS                                8.4      37.36875 -9.174048         200
> 3   149976661414997666141 hour, 11 minutes ago Terr-AIS                                8.0      37.51499 -9.184502         149
> 4   149976410714997641071 hour, 53 minutes ago Terr-AIS                                7.7      37.60513 -9.228263         169
> 5   149976392714997639271 hour, 56 minutes ago Terr-AIS                                8.0      37.61168 -9.229627         167
> 6   149976378014997637801 hour, 58 minutes ago Terr-AIS                                6.9      37.61626 -9.230456         132
> 7   149976354014997635402 hours, 2 minutes ago Terr-AIS                                3.4      37.61758 -9.237928         101
> 8   149976342014997634202 hours, 4 minutes ago Terr-AIS                                3.6      37.61808 -9.240235         116
> 9  149976286114997628612 hours, 14 minutes ago Terr-AIS                                3.3      37.62267 -9.249762         112
> 10 149976264714997626472 hours, 17 minutes ago Terr-AIS                                3.2       37.6241 -9.25348         115
> 11 149976243014997624302 hours, 21 minutes ago Terr-AIS                                3.3      37.62557 -9.257295         112
> 12 149976193714997619372 hours, 29 minutes ago Terr-AIS                                3.4      37.62903 -9.266028         111
> 13 149976172814997617282 hours, 32 minutes ago Terr-AIS                                3.3      37.63051 -9.269725         114
> 14 149976150714997615072 hours, 36 minutes ago Terr-AIS                                3.3      37.63207 -9.273642         120
> 15 149976124714997612472 hours, 40 minutes ago Terr-AIS                                3.6      37.63395 -9.278254         118
> 16 149976098714997609872 hours, 45 minutes ago Terr-AIS                                3.3      37.63586 -9.282853         116
> 17 149976051014997605102 hours, 53 minutes ago Terr-AIS                                3.2       37.6392 -9.291157         118
> 18 149975870414997587043 hours, 23 minutes ago  Sat-AIS Add to SAT Fleet for undelayed data          <NA> <NA>        <NA>
> 19  149975631414997563144 hours, 3 minutes ago Terr-AIS                                3.0      37.68459 -9.321658         159
> 20  149975598814997559884 hours, 8 minutes ago Terr-AIS                                3.3      37.68881 -9.324172         150
>
> The records correspond exactly to the first table. I have tried to change the colClasses to character, but the results are always the same. My question is: how to change the Timestamp to a POSIXlt POSIXct format, either in the importing phase or within R?
>
> I hope that the question is clear now.
>
> Cristina
>
>
> On Mon, 10 Jul 2017 09:18:05 -0700, Jeff Newmiller wrote:
>> Not reproducible. [1][2][3] If our answers don't seem to apply to
>> your situation, it will likely be because you did not explain your
>> question clearly.
>>
>> Not plain text. This is a plain text mailing list,  and the best-case
>> scenario when you let your email program send HTML is that what you
>> saw is not what we see (worst case is your email is scrambled on our
>> end).
>>
>> Have you read the documentation for the function you are using? In
>> particular, what about the colClasses argument? If you don't let
>> readHTMLTable guess what the format is (have it read in as character
>> data) then you have a fighting chance to get it right yourself, e.g.
>>
>> as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M (UTC)",
>> tz="UTC" )
>>
>> -----
>>
>> [1]
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>
>> [2] http://adv-r.had.co.nz/Reproducibility.html
>>
>> [3] https://cran.r-project.org/web/packages/reprex/index.html
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 10, 2017 8:31:30 AM PDT, Cristina Silva <[hidden email]> wrote:
>>> Hi,
>>>
>>> I am extracting positions data from the marine traffic website. The
>>> table has a "Timestamp" column which, in the browser, appears with the
>>> format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>>>
>>> When I import the table, the same date "2017-07-10 14:04 (UTC)" appears
>>>
>>> as "1499696500149969650021 minutes ago", This is the more recent date
>>> and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear as
>>> e.g.
>>> "1499619726149961972621 hours, 59 minutes ago".
>>>
>>> I don't know how to convert these data to the time formats used in R
>>> (POSIXct).
>>>
>>> The script is very simple and worked before:
>>>
>>> library(XML)
>>> x <- readHTMLTable('url')
>>>
>>> where the 'url' is the link to the website with the specification of
>>> the
>>> vessel.
>>>
>>> I appreciate any help.
>>>
>>> Cristina
>>>
>>> --
>>> Cristina Silva
>>> Divisão de Modelação e Gestão de Recursos Pesqueiros
>>> Av. Dr. Alfredo Magalhães Ramalho
>>> 1495-165 Lisboa
>>> @: [hidden email] <mailto:[hidden email]>
>>> #: +351 213027096 <phoneto:+351213027096>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Cristina Silva
> Divisão de Modelação e Gestão de Recursos Pesqueiros
> Avenida de Brasília
> 1449-006 Lisboa
> @: [hidden email]
> #: +351 213027096
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with time formats when importing data using readHTMLTable

Cristina Silva-2
Dear David,

Thank you very much. You have solved my problem, not of importing in
the right format but the date conversion. It works for me, and if I
specify the timezone tz='GMT', I get exactly the same time of the
webpage.

Sincerely yours,

Cristina

On Tue, 11 Jul 2017 07:15:43 -0700, David Winsemius wrote:

>> On Jul 11, 2017, at 6:25 AM, Cristina Silva <[hidden email]> wrote:
>>
>> Dear Jeff,
>>
>> I am sorry, I didn't notice that it was not plain text. I hope that
>> it is now in the correct format. I explain the problem again, now with
>> more detais.
>> I am collecting the track positions of our research vessel from
>> www.marinetraffic.com.  In the page, the data appear in a table:
>>
>> Timestamp     Source     Speed (kn)     Latitude (°) Longitude (°)  
>>  Course (°)     Show on Map
>> 2017-07-11 10:57 (UTC)    Terr-AIS     8.6    37.36228 -9.176811    
>> 200
>> 2017-07-11 10:54 (UTC)    Terr-AIS     8.4    37.36875 -9.174048    
>> 200
>> 2017-07-11 09:50 (UTC)    Terr-AIS     8    37.51499 -9.184502    
>> 149
>> 2017-07-11 09:08 (UTC)    Terr-AIS     7.7    37.60513 -9.228263    
>> 169
>> 2017-07-11 09:05 (UTC)    Terr-AIS     8    37.61168 -9.229627    
>> 167
>> 2017-07-11 09:03 (UTC)    Terr-AIS     6.9    37.61626 -9.230456    
>> 132
>> 2017-07-11 08:59 (UTC)    Terr-AIS     3.4    37.61758 -9.237928    
>> 101
>> 2017-07-11 08:57 (UTC)    Terr-AIS     3.6    37.61808 -9.240235    
>> 116
>> 2017-07-11 08:47 (UTC)    Terr-AIS     3.3    37.62267 -9.249762    
>> 112
>> 2017-07-11 08:44 (UTC)    Terr-AIS     3.2    37.6241 -9.25348    
>> 115
>> 2017-07-11 08:40 (UTC)    Terr-AIS     3.3    37.62557 -9.257295    
>> 112
>> 2017-07-11 08:32 (UTC)    Terr-AIS     3.4    37.62903 -9.266028    
>> 111
>> 2017-07-11 08:28 (UTC)    Terr-AIS     3.3    37.63051 -9.269725    
>> 114
>> 2017-07-11 08:25 (UTC)    Terr-AIS     3.3    37.63207 -9.273642    
>> 120
>> 2017-07-11 08:20 (UTC)    Terr-AIS     3.6    37.63395 -9.278254    
>> 118
>> 2017-07-11 08:16 (UTC)    Terr-AIS     3.3    37.63586 -9.282853    
>> 116
>> 2017-07-11 08:08 (UTC)    Terr-AIS     3.2    37.6392 -9.291157    
>> 118
>> 2017-07-11 07:38 (UTC)    Sat-AIS
>> 2017-07-11 06:58 (UTC)    Terr-AIS     3    37.68459 -9.321658    
>> 159
>> 2017-07-11 06:53 (UTC)    Terr-AIS     3.3    37.68881 -9.324172    
>> 150
>>
>> To get this table I had to copy manually from the webpage, paste in
>> a text file, delete information on figures, links, etc. and then
>> import to R. This works well, I read the csv file and have no problems
>> with date formats. But instead of 20 records, I have to import e.g.
>> 500 records, i.e. 10 pages of 50 records each, or more. I want to
>> import these data in a more automated and quick procedure. I have done
>> this before with no problems, but probably the webpage changed some
>> permissions.
>>
>> Here is the script used to get the data directly from the webpage
>> into R, for a sample of 20 records (10 per page):
>>
>> library(XML)
>> x <- list()
>> for (i in 1:2)
>> {
>>    x[i]<-
>> readHTMLTable(paste('http://www.marinetraffic.com/en/ais/index/positions/all/shipid:318358/mmsi:263601000/shipname:NORUEGA/per_page:10/page:',
>> i, sep=''))
>> }
>>
>> ais <- do.call('rbind', x)
>> ais <- ais[,-7]
>>
>> and I got the following table:
>>
>>> ais
>>                                     Timestamp Source                
>>        Speed (kn) Latitude (°) Longitude (°) Course (°)
>> 1            149977066014997706604 minutes ago Terr-AIS              
>>                  8.6      37.36228 -9.176811         200
>
> There's undoubtedly a way to extract the time info from this HTML
> code (but I'm not the one to provide that solution.) This is the
> first
> entry in the webpage source when viewed right now:
>
> <td>
> <time class="toggle_time utc_time hide-me formatTime"
> data-overrideAbbreviation="empty" data-time="1499780737"
> data-timezone="0"><span>1499780737</span></time><time
> class="toggle_time my_time hide-me formatTime"
> data-overrideAbbreviation="MyT" data-time="1499780737"
> data-timezone="MyT"><span>1499780737</span></time><time
> class="toggle_time dif_time hide"><span>6 minutes
> ago</span></time> </td>
>
> So the POSIX-style time is givne twice (with no separator and then an
> offset in minutes is calculated. If I were given the records as you
> have them now I would extract the leading 10 digits and convert to
> POSIXct
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","", Timestamp)),
> origin="1970-01-01")
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","\\1", ais$Timestamp)),
> origin="1970-01-01")
>  [1] "2017-07-11 07:09:57 PDT" "2017-07-11 06:45:37 PDT" "2017-07-11
> 06:40:34 PDT" "2017-07-11 06:34:46 PDT" "2017-07-11 06:32:26 PDT"
>  [6] "2017-07-11 06:29:34 PDT" "2017-07-11 06:26:37 PDT" "2017-07-11
> 06:20:24 PDT" "2017-07-11 06:17:18 PDT" "2017-07-11 06:12:37 PDT"
> [11] "2017-07-11 06:08:24 PDT" "2017-07-11 06:05:04 PDT" "2017-07-11
> 06:03:27 PDT" "2017-07-11 05:59:37 PDT" "2017-07-11 05:54:37 PDT"
> [16] "2017-07-11 05:50:48 PDT" "2017-07-11 05:44:53 PDT" "2017-07-11
> 05:29:17 PDT" "2017-07-11 05:26:44 PDT" "2017-07-11 04:30:08 PDT"
>
> -- David.
>
>
>
>> 2            149977048714997704876 minutes ago Terr-AIS              
>>                  8.4      37.36875 -9.174048         200
>> 3   149976661414997666141 hour, 11 minutes ago Terr-AIS              
>>                  8.0      37.51499 -9.184502         149
>> 4   149976410714997641071 hour, 53 minutes ago Terr-AIS              
>>                  7.7      37.60513 -9.228263         169
>> 5   149976392714997639271 hour, 56 minutes ago Terr-AIS              
>>                  8.0      37.61168 -9.229627         167
>> 6   149976378014997637801 hour, 58 minutes ago Terr-AIS              
>>                  6.9      37.61626 -9.230456         132
>> 7   149976354014997635402 hours, 2 minutes ago Terr-AIS              
>>                  3.4      37.61758 -9.237928         101
>> 8   149976342014997634202 hours, 4 minutes ago Terr-AIS              
>>                  3.6      37.61808 -9.240235         116
>> 9  149976286114997628612 hours, 14 minutes ago Terr-AIS              
>>                  3.3      37.62267 -9.249762         112
>> 10 149976264714997626472 hours, 17 minutes ago Terr-AIS              
>>                  3.2       37.6241 -9.25348         115
>> 11 149976243014997624302 hours, 21 minutes ago Terr-AIS              
>>                  3.3      37.62557 -9.257295         112
>> 12 149976193714997619372 hours, 29 minutes ago Terr-AIS              
>>                  3.4      37.62903 -9.266028         111
>> 13 149976172814997617282 hours, 32 minutes ago Terr-AIS              
>>                  3.3      37.63051 -9.269725         114
>> 14 149976150714997615072 hours, 36 minutes ago Terr-AIS              
>>                  3.3      37.63207 -9.273642         120
>> 15 149976124714997612472 hours, 40 minutes ago Terr-AIS              
>>                  3.6      37.63395 -9.278254         118
>> 16 149976098714997609872 hours, 45 minutes ago Terr-AIS              
>>                  3.3      37.63586 -9.282853         116
>> 17 149976051014997605102 hours, 53 minutes ago Terr-AIS              
>>                  3.2       37.6392 -9.291157         118
>> 18 149975870414997587043 hours, 23 minutes ago  Sat-AIS Add to SAT
>> Fleet for undelayed data          <NA> <NA>        <NA>
>> 19  149975631414997563144 hours, 3 minutes ago Terr-AIS              
>>                  3.0      37.68459 -9.321658         159
>> 20  149975598814997559884 hours, 8 minutes ago Terr-AIS              
>>                  3.3      37.68881 -9.324172         150
>>
>> The records correspond exactly to the first table. I have tried to
>> change the colClasses to character, but the results are always the
>> same. My question is: how to change the Timestamp to a POSIXlt POSIXct
>> format, either in the importing phase or within R?
>>
>> I hope that the question is clear now.
>>
>> Cristina
>>
>>
>> On Mon, 10 Jul 2017 09:18:05 -0700, Jeff Newmiller wrote:
>>> Not reproducible. [1][2][3] If our answers don't seem to apply to
>>> your situation, it will likely be because you did not explain your
>>> question clearly.
>>>
>>> Not plain text. This is a plain text mailing list,  and the
>>> best-case
>>> scenario when you let your email program send HTML is that what you
>>> saw is not what we see (worst case is your email is scrambled on
>>> our
>>> end).
>>>
>>> Have you read the documentation for the function you are using? In
>>> particular, what about the colClasses argument? If you don't let
>>> readHTMLTable guess what the format is (have it read in as
>>> character
>>> data) then you have a fighting chance to get it right yourself,
>>> e.g.
>>>
>>> as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M
>>> (UTC)",
>>> tz="UTC" )
>>>
>>> -----
>>>
>>> [1]
>>>
>>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>>
>>> [2] http://adv-r.had.co.nz/Reproducibility.html
>>>
>>> [3] https://cran.r-project.org/web/packages/reprex/index.html
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 10, 2017 8:31:30 AM PDT, Cristina Silva <[hidden email]>
>>> wrote:
>>>> Hi,
>>>>
>>>> I am extracting positions data from the marine traffic website.
>>>> The
>>>> table has a "Timestamp" column which, in the browser, appears with
>>>> the
>>>> format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>>>>
>>>> When I import the table, the same date "2017-07-10 14:04 (UTC)"
>>>> appears
>>>>
>>>> as "1499696500149969650021 minutes ago", This is the more recent
>>>> date
>>>> and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear
>>>> as
>>>> e.g.
>>>> "1499619726149961972621 hours, 59 minutes ago".
>>>>
>>>> I don't know how to convert these data to the time formats used in
>>>> R
>>>> (POSIXct).
>>>>
>>>> The script is very simple and worked before:
>>>>
>>>> library(XML)
>>>> x <- readHTMLTable('url')
>>>>
>>>> where the 'url' is the link to the website with the specification
>>>> of
>>>> the
>>>> vessel.
>>>>
>>>> I appreciate any help.
>>>>
>>>> Cristina
>>>>

>
> David Winsemius
> Alameda, CA, USA

--
Cristina Silva
Divisão de Modelação e Gestão de Recursos Pesqueiros
Avenida de Brasília
1449-006 Lisboa
@: [hidden email]
#: +351 213027096

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with time formats when importing data using readHTMLTable

David Winsemius


Sent from my iPhone

> On Jul 11, 2017, at 10:00 AM, Cristina Silva <[hidden email]> wrote:
>
> Dear David,
>
> Thank you very much. You have solved my problem, not of importing in the right format but the date conversion. It works for me, and if I specify the timezone tz='GMT', I get exactly the same time of the webpage.
>
R date times are a bit weird. To me , anyway. Input is in UTC, but default output is in your locale. So I usually wrap format around a result if I want UTC.

--
David

> Sincerely yours,
>
> Cristina
>
> On Tue, 11 Jul 2017 07:15:43 -0700, David Winsemius wrote:
>>> On Jul 11, 2017, at 6:25 AM, Cristina Silva <[hidden email]> wrote:
>>>
>>> Dear Jeff,
>>>
>>> I am sorry, I didn't notice that it was not plain text. I hope that it is now in the correct format. I explain the problem again, now with more detais.
>>> I am collecting the track positions of our research vessel from www.marinetraffic.com.  In the page, the data appear in a table:
>>>
>>> Timestamp     Source     Speed (kn)     Latitude (°) Longitude (°)    Course (°)     Show on Map
>>> 2017-07-11 10:57 (UTC)    Terr-AIS     8.6    37.36228 -9.176811    200
>>> 2017-07-11 10:54 (UTC)    Terr-AIS     8.4    37.36875 -9.174048    200
>>> 2017-07-11 09:50 (UTC)    Terr-AIS     8    37.51499 -9.184502    149
>>> 2017-07-11 09:08 (UTC)    Terr-AIS     7.7    37.60513 -9.228263    169
>>> 2017-07-11 09:05 (UTC)    Terr-AIS     8    37.61168 -9.229627    167
>>> 2017-07-11 09:03 (UTC)    Terr-AIS     6.9    37.61626 -9.230456    132
>>> 2017-07-11 08:59 (UTC)    Terr-AIS     3.4    37.61758 -9.237928    101
>>> 2017-07-11 08:57 (UTC)    Terr-AIS     3.6    37.61808 -9.240235    116

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...