EOF within quoted string

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

EOF within quoted string

Mohan.Radhakrishnan
Hi,

Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using

data <- read.csv("20_newsgroups.csv",header=TRUE)

throws this.

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ?

From: [hidden email] () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu   [hidden email] writes:  morgan and guzman will have era's 1 run higher than last year, and  the cubs will be idiots and not pitch harkey as much as hibbard.  castillo won't be good (i think he's a stud pitcher)         This season so far, Morgan and Guzman helped to lead the Cubs        at top in ERA, even better than THE rotation at Atlanta.        Cubs ERA at 0.056 while Braves at 0.059. We know it is early        in the season, we Cubs fans have learned how to enjoy the        short triumph while it is still there.

Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EOF within quoted string

Adams, Jean
You might want to try some of the suggestions mentioned in this post:
https://stackoverflow.com/q/17414776/2140956

Jean

On Thu, Aug 10, 2017 at 7:59 AM, <[hidden email]> wrote:

> Hi,
>
> Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading
> it using
>
> data <- read.csv("20_newsgroups.csv",header=TRUE)
>
> throws this.
>
> Warning message:
> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   EOF within quoted string
>
> So, for example, the first line in the file is this. This column contains
> only such text. Is there a way read it ?
>
> From: [hidden email] () Subject: Re: Cubs behind Marlins?
> How? Article-I.D.: agate.1pt592$f9a Organization: University of California,
> Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu
> [hidden email] writes:  morgan and guzman will have era's 1 run
> higher than last year, and  the cubs will be idiots and not pitch harkey as
> much as hibbard.  castillo won't be good (i think he's a stud pitcher)
>    This season so far, Morgan and Guzman helped to lead the Cubs        at
> top in ERA, even better than THE rotation at Atlanta.        Cubs ERA at
> 0.056 while Braves at 0.059. We know it is early        in the season, we
> Cubs fans have learned how to enjoy the        short triumph while it is
> still there.
>
> Thanks,
> Mohan
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EOF within quoted string

Mohan.Radhakrishnan
Yes. I tried that already. Not straightforward.

data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, quote="", sep=",", encoding="UTF-8")

This line does read it haphazardly. The emails in the column are split into multiple columns and there are several columns with just ‘NA’. Totally 202 columns.

And then I removed columns with NA’s and concatenated all the text and finally got it.

munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))]
munged <- munged[-1,]
munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse = " ")

munged <- munged[,c("V1","V2","text")]

print(head(munged$text))

Mohan

From: Adams, Jean [mailto:[hidden email]]
Sent: Thursday, August 10, 2017 8:03 PM
To: Radhakrishnan, Mohan (Cognizant) <[hidden email]>
Cc: R help <[hidden email]>
Subject: Re: [R] EOF within quoted string

You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956

Jean

On Thu, Aug 10, 2017 at 7:59 AM, <[hidden email]<mailto:[hidden email]>> wrote:
Hi,

Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using

data <- read.csv("20_newsgroups.csv",header=TRUE)

throws this.

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ?

From: [hidden email]<mailto:[hidden email]> () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu>   [hidden email]<mailto:[hidden email]> writes:  morgan and guzman will have era's 1 run higher than last year, and  the cubs will be idiots and not pitch harkey as much as hibbard.  castillo won't be good (i think he's a stud pitcher)         This season so far, Morgan and Guzman helped to lead the Cubs        at top in ERA, even better than THE rotation at Atlanta.        Cubs ERA at 0.056 while Braves at 0.059. We know it is early        in the season, we Cubs fans have learned how to enjoy the        short triumph while it is still there.

Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EOF within quoted string

R help mailing list-2
I tried reading it with read.table, as below, and didn't see any obvious
problems.

> txt <- readLines("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv")
> str(txt)
 chr [1:11315] ",target,text" ...
> writeLines(substring(txt[1:5],1,40))
,target,text
0,9,"From: [hidden email] (
1,4,"From: [hidden email] (Gre
2,11,"From: [hidden email]
3,4,"From:  () Subject: Re: Quadra SCSI
> z <- read.table(text=txt, sep=",", quote="\"", header=TRUE,
stringsAsFactors=FALSE)
> str(z)
'data.frame':   11314 obs. of  3 variables:
 $ X     : int  0 1 2 3 4 5 6 7 8 9 ...
 $ target: int  9 4 11 4 0 4 5 5 13 12 ...
 $ text  : chr  "From: [hidden email] (
      ) Subject: Re: Cubs behind Marlins? How? Artic"| __truncated__ "From:
[hidden email] (Gregory Nelson) Subject: Thanks Apple: Free
Ethernet on my C610! Article-I.D.: "| __truncated__ "From:
[hidden email] Subject: Cryptography FAQ 10/10 - References
Organization: The Crypt Cabal L"| __truncated__ "From:  () Subject: Re:
Quadra SCSI Problems??? Organization: Apple Computer Inc. Lines: 28  >
ATTENTION: Mac Qu"| __truncated__ ...
> summary(z)
       X             target           text
 Min.   :    0   Min.   : 0.000   Length:11314
 1st Qu.: 2828   1st Qu.: 5.000   Class :character
 Median : 5656   Median : 9.000   Mode  :character
 Mean   : 5656   Mean   : 9.293
 3rd Qu.: 8485   3rd Qu.:14.000
 Max.   :11313   Max.   :19.000
> sum(is.na(z$text))
[1] 0
> table(substring(z$text,1,5))

 cs.u  egsn  howl  sgib  uune  wisc  wupo  zaph Distr From: Nntp- Organ
Orgin
    4     1     2     1     1     1     2     4    14 10795    13   125
1
Reply Subje To: g X-Mai
    3   343     1     3




Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Aug 11, 2017 at 1:58 AM, <[hidden email]> wrote:

> Yes. I tried that already. Not straightforward.
>
> data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F,
> quote="", sep=",", encoding="UTF-8")
>
> This line does read it haphazardly. The emails in the column are split
> into multiple columns and there are several columns with just ‘NA’. Totally
> 202 columns.
>
> And then I removed columns with NA’s and concatenated all the text and
> finally got it.
>
> munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))]
> munged <- munged[-1,]
> munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 ,
> collapse = " ")
>
> munged <- munged[,c("V1","V2","text")]
>
> print(head(munged$text))
>
> Mohan
>
> From: Adams, Jean [mailto:[hidden email]]
> Sent: Thursday, August 10, 2017 8:03 PM
> To: Radhakrishnan, Mohan (Cognizant) <[hidden email]>
> Cc: R help <[hidden email]>
> Subject: Re: [R] EOF within quoted string
>
> You might want to try some of the suggestions mentioned in this post:
> https://stackoverflow.com/q/17414776/2140956
>
> Jean
>
> On Thu, Aug 10, 2017 at 7:59 AM, <[hidden email]
> <mailto:[hidden email]>> wrote:
> Hi,
>
> Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading
> it using
>
> data <- read.csv("20_newsgroups.csv",header=TRUE)
>
> throws this.
>
> Warning message:
> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   EOF within quoted string
>
> So, for example, the first line in the file is this. This column contains
> only such text. Is there a way read it ?
>
> From: [hidden email]<mailto:[hidden email]> ()
> Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a
> Organization: University of California, Berkeley Lines: 12
> NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu>
> [hidden email]<mailto:[hidden email]> writes:  morgan
> and guzman will have era's 1 run higher than last year, and  the cubs will
> be idiots and not pitch harkey as much as hibbard.  castillo won't be good
> (i think he's a stud pitcher)         This season so far, Morgan and Guzman
> helped to lead the Cubs        at top in ERA, even better than THE rotation
> at Atlanta.        Cubs ERA at 0.056 while Braves at 0.059. We know it is
> early        in the season, we Cubs fans have learned how to enjoy the
>   short triumph while it is still there.
>
> Thanks,
> Mohan
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email]<mailto:[hidden email]> mailing list -- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...