Quantcast

Eliminating 'Unprintable ASCII' characters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Eliminating 'Unprintable ASCII' characters

Steven Kang
Hi all,

I have a csv file containing words with *UNPRINTABLE ASCII* characters
(described in the following table).

Are there any viable method in eliminating these characters?

I realise that *EXTENDED ASCII* characters (i.e , ¡, ¢, £, ¤ etc) can be
removed or replaced via *"gsub"* or *"gregexpr"* functions. But am not
certain with the *UNPRINTABLE ASCII* characters.

Your help in resolving this problem would be highly appreciated.

Thanks




Steven




    ASCII control characters (character code 0-31)The first 32 characters in
the ASCII-table are unprintable control codes and are used to control
peripherals such as printers.
   *DEC* *OCT* *HEX* *BIN* *Symbol* *HTML Number* *HTML Name* *Description*
0 000 00 00000000 NUL �   Null char 1 001 01 00000001 SOH    Start
of Heading 2 002 02 00000010 STX    Start of Text 3 003 03 00000011
ETX    End of Text 4 004 04 00000100 EOT    End of Transmission
5 005 05 00000101 ENQ    Enquiry 6 006 06 00000110 ACK 
Acknowledgment 7 007 07 00000111 BEL    Bell 8 010 08 00001000 BS
   Back Space 9 011 09 00001001 HT 	   Horizontal Tab 10 012 0A
00001010 LF 
   Line Feed 11 013 0B 00001011 VT    Vertical Tab
12 014 0C 00001100 FF    Form Feed 13 015 0D 00001101 CR 
 Carriage
Return 14 016 0E 00001110 SO    Shift Out / X-On 15 017 0F 00001111 SI
   Shift In / X-Off 16 020 10 00010000 DLE    Data Line Escape
17 021 11 00010001 DC1    Device Control 1 (oft. XON) 18 022 12
00010010 DC2    Device Control 2 19 023 13 00010011 DC3    Device
Control 3 (oft. XOFF) 20 024 14 00010100 DC4    Device Control 4 21
025 15 00010101 NAK    Negative Acknowledgement 22 026 16 00010110 SYN
   Synchronous Idle 23 027 17 00010111 ETB    End of Transmit
Block 24 030 18 00011000 CAN    Cancel 25 031 19 00011001 EM    End
of Medium 26 032 1A 00011010 SUB    Substitute 27 033 1B 00011011 ESC
   Escape 28 034 1C 00011100 FS    File Separator 29 035 1D
00011101 GS    Group Separator 30 036 1E 00011110 RS    Record
Separator 31 037 1F 00011111 US    Unit Separator

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Eliminating 'Unprintable ASCII' characters

Prof Brian Ripley
I think you mean the control characters: there are other unprintable
characters (del for example).  They are the character range
[\001-\037].  E.g.

> test <- intToUtf8(1:40, multiple=TRUE)
> grepl("[\001-\037]", test)
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE

If you want to include del, use "[\001-\037\177]".  I have omitted nul
(\000) which cannot occur in R character strings.

You didn't give us the sessionInfo() output the posting guide asked
you for, so I am presuming you are not doing this in an unusual
locale: I wouldn't trust the regexp code in one of the stateful
locales used for Japanese.

On Wed, 25 Nov 2009, Steven Kang wrote:

> Hi all,
>
> I have a csv file containing words with *UNPRINTABLE ASCII* characters
> (described in the following table).
>
> Are there any viable method in eliminating these characters?
>
> I realise that *EXTENDED ASCII* characters (i.e , ?, ?, ?, ? etc) can be
> removed or replaced via *"gsub"* or *"gregexpr"* functions. But am not
> certain with the *UNPRINTABLE ASCII* characters.
>
> Your help in resolving this problem would be highly appreciated.
>
> Thanks
>
>
>
>
> Steven
>
>
>
>
>    ASCII control characters (character code 0-31)The first 32 characters in
> the ASCII-table are unprintable control codes and are used to control
> peripherals such as printers.
>   *DEC* *OCT* *HEX* *BIN* *Symbol* *HTML Number* *HTML Name* *Description*
> 0 000 00 00000000 NUL &#000;   Null char 1 001 01 00000001 SOH &#001;   Start
> of Heading 2 002 02 00000010 STX &#002;   Start of Text 3 003 03 00000011
> ETX &#003;   End of Text 4 004 04 00000100 EOT &#004;   End of Transmission
> 5 005 05 00000101 ENQ &#005;   Enquiry 6 006 06 00000110 ACK &#006;
> Acknowledgment 7 007 07 00000111 BEL &#007;   Bell 8 010 08 00001000 BS
> &#008;   Back Space 9 011 09 00001001 HT &#009;   Horizontal Tab 10 012 0A
> 00001010 LF &#010;   Line Feed 11 013 0B 00001011 VT &#011;   Vertical Tab
> 12 014 0C 00001100 FF &#012;   Form Feed 13 015 0D 00001101 CR &#013;
> Carriage
> Return 14 016 0E 00001110 SO &#014;   Shift Out / X-On 15 017 0F 00001111 SI
> &#015;   Shift In / X-Off 16 020 10 00010000 DLE &#016;   Data Line Escape
> 17 021 11 00010001 DC1 &#017;   Device Control 1 (oft. XON) 18 022 12
> 00010010 DC2 &#018;   Device Control 2 19 023 13 00010011 DC3 &#019;   Device
> Control 3 (oft. XOFF) 20 024 14 00010100 DC4 &#020;   Device Control 4 21
> 025 15 00010101 NAK &#021;   Negative Acknowledgement 22 026 16 00010110 SYN
> &#022;   Synchronous Idle 23 027 17 00010111 ETB &#023;   End of Transmit
> Block 24 030 18 00011000 CAN &#024;   Cancel 25 031 19 00011001 EM &#025;   End
> of Medium 26 032 1A 00011010 SUB &#026;   Substitute 27 033 1B 00011011 ESC
> &#027;   Escape 28 034 1C 00011100 FS &#028;   File Separator 29 035 1D
> 00011101 GS &#029;   Group Separator 30 036 1E 00011110 RS &#030;   Record
> Separator 31 037 1F 00011111 US &#031;   Unit Separator
>
> [[alternative HTML version deleted]]
>
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...