readlines() truncates text file with Codepage 437 encoding

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

readlines() truncates text file with Codepage 437 encoding

Adam Obeng
Hello r-devel,


The attached Code page 437-encoded file contains 245 characters
(including the final newline), but readLines only reads 242 of them:

> test_text <- readLines(file('437__characters.txt', encoding='437'))
Warning message:
In readLines(file("437__characters.txt",  :
  incomplete final line found on '437__characters.txt'
> test_text
[1]
"\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037
!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177
¡¢£¥ª«¬°±²µ·º»¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬▀▄█▌▐░▒"
> nchar(test_text)
[1] 242

You'll note that readLines hasn't read the final characters "▓■\n".

# Diagnostics

My best guess is that this is something to do with how readLines()
determines when it has reached EOF, because of the following:

- The file is terminated with an ASCII LF (0x0a), but R gives an
'incomplete final line found' warning. Note that in some implementations
of Code page 437, 0x0a is interpreted as a graphical character rather
than a control character, but this does not seem to be the problem here.
The same problem occurs if the file ends with 0x0d or 0x0d 0x0a.
- Adding seven or more characters to the end of the file makes it read
correctly
- Similarly, the file is read correctly if you remove three characters
from anywhere in the file
- The same issue seems to occur with reading files encoded in other DOS
code pages

# Additional information

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

The same behaviour occurs under R 2.15.1 on a Linux server.

In case the attached file is somehow corrupted, here is a hexdump:

00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b  ................
00000010: 1c1d 1e1f 2021 2223 2425 2627 2829 2a2b  .... !"#$%&'()*+
00000020: 2c2d 2e2f 3031 3233 3435 3637 3839 3a3b  ,-./0123456789:;
00000030: 3c3d 3e3f 4041 4243 4445 4647 4849 4a4b  <=>?@ABCDEFGHIJK
00000040: 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b  LMNOPQRSTUVWXYZ[
00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b  \]^_`abcdefghijk
00000060: 6c6d 6e6f 7071 7273 7475 7677 7879 7a7b  lmnopqrstuvwxyz{
00000070: 7c7d 7e7f ffad 9b9c 9da6 aeaa f8f1 fde6  |}~.............
00000080: faa7 afac aba8 8e8f 9280 90a5 999a e185  ................
00000090: a083 8486 9187 8a82 8889 8da1 8c8b a495  ................
000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0  ................
000000b0: ebee e3e5 e7ed fc9e f9fb ecef f7f0 f3f2  ................
000000c0: a9f4 f5c4 b3da bfc0 d9c3 b4c2 c1c5 cdba  ................
000000d0: d5d6 c9b8 b7bb d4d3 c8be bdbc c6c7 ccb5  ................
000000e0: b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde  ................
000000f0: b0b1 b2fe 0a                             .....


Has anyone encountered something similar?


Kind regards,

Adam Obeng
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readlines() truncates text file with Codepage 437 encoding

Martin Maechler
Appended is the file -- you need to tell your e-mail software to use
one of the MIME types that R-devel does accept;  text/plain
is what I chose

((Yes, as R mailing list server "operator", with a bit of detective work,
  I was able to find the "uncleaned" e-mail and extract the
  attachment from it))

Martin Maechler
ETH Zurich



 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ÿ­›œ¦®ªøñýæú§¯¬«¨Ž’€¥™šá… ƒ„†‘‡Š‚ˆ‰¡Œ‹¤•¢“”ö—£–˜Ÿâéäèêàëîãåçíüžùûìï÷ðóò©ôõijڿÀÙôÂÁÅͺÕÖɸ·»ÔÓȾ½¼ÆÇ̵¶¹ÑÒËÏÐÊØ×ÎßÜÛÝÞ°±²þ


>>>>> Adam Obeng <[hidden email]>
>>>>>     on Mon, 6 Jun 2016 11:11:21 +0100 writes:

    > Hello r-devel, The attached Code page 437-encoded file
    > contains 245 characters (including the final newline), but
    > readLines only reads 242 of them:

    >> test_text <- readLines(file('437__characters.txt',
    >> encoding='437'))
    > Warning message: In readLines(file("437__characters.txt",
    > : incomplete final line found on '437__characters.txt'
    >> test_text
    > [1]
    > "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037
    > !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177
    > ¡¢£¥ª«¬°±²µ·º»¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬▀▄█▌▐░▒"
    >> nchar(test_text)
    > [1] 242

    > You'll note that readLines hasn't read the final
    > characters "▓■\n".

    > # Diagnostics

    > My best guess is that this is something to do with how
    > readLines() determines when it has reached EOF, because of
    > the following:

    > - The file is terminated with an ASCII LF (0x0a), but R
    > gives an 'incomplete final line found' warning. Note that
    > in some implementations of Code page 437, 0x0a is
    > interpreted as a graphical character rather than a control
    > character, but this does not seem to be the problem here.
    > The same problem occurs if the file ends with 0x0d or 0x0d
    > 0x0a.  - Adding seven or more characters to the end of the
    > file makes it read correctly - Similarly, the file is read
    > correctly if you remove three characters from anywhere in
    > the file - The same issue seems to occur with reading
    > files encoded in other DOS code pages

    > # Additional information

    >> sessionInfo()
    > R version 3.2.3 (2015-12-10) Platform:
    > x86_64-apple-darwin14.5.0 (64-bit) Running under: OS X
    > 10.10.5 (Yosemite)

    > locale: [1]
    > en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

    > attached base packages: [1] stats graphics grDevices utils
    > datasets methods base

    > The same behaviour occurs under R 2.15.1 on a Linux
    > server.

    > In case the attached file is somehow corrupted, here is a
    > hexdump:

    > 00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b
    > ................  00000010: 1c1d 1e1f 2021 2223 2425 2627
    > 2829 2a2b .... !"#$%&'()*+ 00000020: 2c2d 2e2f 3031 3233
    > 3435 3637 3839 3a3b ,-./0123456789:; 00000030: 3c3d 3e3f
    > 4041 4243 4445 4647 4849 4a4b <=>?@ABCDEFGHIJK 00000040:
    > 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[
    > 00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b
    > \]^_`abcdefghijk 00000060: 6c6d 6e6f 7071 7273 7475 7677
    > 7879 7a7b lmnopqrstuvwxyz{ 00000070: 7c7d 7e7f ffad 9b9c
    > 9da6 aeaa f8f1 fde6 |}~.............  00000080: faa7 afac
    > aba8 8e8f 9280 90a5 999a e185 ................  00000090:
    > a083 8486 9187 8a82 8889 8da1 8c8b a495 ................
    > 000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0
    > ................  000000b0: ebee e3e5 e7ed fc9e f9fb ecef
    > f7f0 f3f2 ................  000000c0: a9f4 f5c4 b3da bfc0
    > d9c3 b4c2 c1c5 cdba ................  000000d0: d5d6 c9b8
    > b7bb d4d3 c8be bdbc c6c7 ccb5 ................  000000e0:
    > b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................
    > 000000f0: b0b1 b2fe 0a .....


    > Has anyone encountered something similar?


    > Kind regards,

    > Adam Obeng ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readlines() truncates text file with Codepage 437 encoding

Martin Maechler
I can reproduce the issue on Linux (Fedora F22),
R 3.3.0 patched of today.

Here's code for experimenting which allows to reproduce the
issue without the need for an attached file (there's a temporary
file created and removed as part of the function below) :

##---------------------------------------------------------------------------

##' @title write-binary-readLines testing
##' @param i  vector of integers in 0:255 to be used as character codes
##' @param file.name optional
##' @param encoding "437" is the one where the problem has been reported
##' @return the readLines() resulting character string with attributes
##' @author Martin Maechler
wb.readL <- function(i, file.name = tempfile("bin"), encoding = "437") {
    stopifnot(is.integer(i), 0 <= i, i <= 255,
              is.character(file.name))
    ff <- file(file.name, "wb")
    writeBin(as.raw(i), ff)
    close(ff) ; on.exit(unlink(file.name))
    ## Now read "as codepage" :
    ch <- readLines(file(file.name, encoding = encoding))
    ##    ---------                 -------------------  typically gives warning
    structure(ch,
              fSize = file.size(file.name),
              nchars = c(b = nchar(ch, "b"),
                         c = nchar(ch, "c"),
                         w = nchar(ch, "w")))
}

ii <- c(11:12, 14:255, 10L)
(cc <- wb.readL(ii))

##---------------------------------------------------------------------------


gives

> (cc <- wb.readL(ii))
[1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ"
attr(,"fSize")
[1] 245
attr(,"nchars")
  b   c   w
427 241 241
Warning message:
In readLines(file(file.name, encoding = encoding)) :
  incomplete final line found on '/tmp/RtmpaPyDyp/bin65842896d5f1'
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel