How long to wait for process?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How long to wait for process?

john polo-2
UseRs,

I have a dataframe with 2547 rows and several hundred columns in R
3.1.3. I am trying to run a small logistic regression with a subset of
the data.

know_fin ~
comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county

     > str(knowf3)
     'data.frame':   2033 obs. of  18 variables:
     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..:
1857 157 965 1967 164 315 849 1017 699 189 ...
     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75
75 64 64 64 64 ...
     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2
4 2 6 ...
     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5
8 4 4 ...
     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000
800000 10000 ...
     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2
3 2 6 ...
     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13
13 13 10 10 10 10 ...
     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...


With the regular glm() function, I get a warning about "perfect or
quasi-perfect separation"[1]. I looked for a method to deal with this
and a penalized GLM is an accepted method[2]. This is implemented in
logistf(). I used the default settings for the function.

Just before I run the model, memory.size() for my session is ~4500 (MB).
memory.limit() is ~25500. When I start the model, R immediately becomes
non-responsive. This is in a Windows environment and in Task Manager,
the instance of R is, and has been, using ~13% of CPU aand ~4997 MB of
RAM. It's been ~24 hours now in that state and I don't have any idea of
how long this should take. If I run the same model in the same setting
with the base glm(), the model runs in about 60 seconds. Is there a way
to know if the process is going to produce something useful after all
this time or if it's hanging on some kind of problem?


   [1]:
https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917
   [2]:
https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates


--
Men occasionally stumble
over the truth, but most of them
pick themselves up and hurry off
as if nothing had happened.
-- Winston Churchill

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

Bert Gunter-2
Dunno. You might wish to email the maintainer (see ?maintainer), who
may not monitor this list, if you do not get a satisfactory reply
here.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Jul 26, 2017 at 7:14 AM, john polo <[hidden email]> wrote:

> UseRs,
>
> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. I
> am trying to run a small logistic regression with a subset of the data.
>
> know_fin ~
> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>
>     > str(knowf3)
>     'data.frame':   2033 obs. of  18 variables:
>     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 157
> 965 1967 164 315 849 1017 699 189 ...
>     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 64
> 64 64 64 ...
>     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 2 6
> ...
>     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 4 4
> ...
>     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000
> 800000 10000 ...
>     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 2 6
> ...
>     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 13
> 10 10 10 10 ...
>     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>
>
> With the regular glm() function, I get a warning about "perfect or
> quasi-perfect separation"[1]. I looked for a method to deal with this and a
> penalized GLM is an accepted method[2]. This is implemented in logistf(). I
> used the default settings for the function.
>
> Just before I run the model, memory.size() for my session is ~4500 (MB).
> memory.limit() is ~25500. When I start the model, R immediately becomes
> non-responsive. This is in a Windows environment and in Task Manager, the
> instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. It's
> been ~24 hours now in that state and I don't have any idea of how long this
> should take. If I run the same model in the same setting with the base
> glm(), the model runs in about 60 seconds. Is there a way to know if the
> process is going to produce something useful after all this time or if it's
> hanging on some kind of problem?
>
>
>   [1]:
> https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917
>   [2]:
> https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates
>
>
> --
> Men occasionally stumble
> over the truth, but most of them
> pick themselves up and hurry off
> as if nothing had happened.
> -- Winston Churchill
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

Michael Friendly
In reply to this post by john polo-2
Rather than go to a penalized GLM, you might be better off investigating
the sources of quasi-perfect separation and simplifying the model to
avoid or reduce it.  In your data set you have several factors with
large number of levels, making the data sparse for all their combinations.

Like multicolinearity, near perfect separation is a data problem, and is
often better solved by careful thought about the model, rather than
wrapping the data in a computationally intensive band aid.

-Michael

On 7/26/2017 10:14 AM, john polo wrote:

> UseRs,
>
> I have a dataframe with 2547 rows and several hundred columns in R
> 3.1.3. I am trying to run a small logistic regression with a subset of
> the data.
>
> know_fin ~
> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>
>      > str(knowf3)
>      'data.frame':   2033 obs. of  18 variables:
>      $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..:
> 1857 157 965 1967 164 315 849 1017 699 189 ...
>      $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>      $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>      $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75
> 75 64 64 64 64 ...
>      $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>      $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2
> 4 2 6 ...
>      $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5
> 8 4 4 ...
>      $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000
> 800000 10000 ...
>      $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>      $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2
> 3 2 6 ...
>      $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>      $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>      $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13
> 13 13 10 10 10 10 ...
>      $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>
>
> With the regular glm() function, I get a warning about "perfect or
> quasi-perfect separation"[1]. I looked for a method to deal with this
> and a penalized GLM is an accepted method[2]. This is implemented in
> logistf(). I used the default settings for the function.
>
> Just before I run the model, memory.size() for my session is ~4500 (MB).
> memory.limit() is ~25500. When I start the model, R immediately becomes
> non-responsive. This is in a Windows environment and in Task Manager,
> the instance of R is, and has been, using ~13% of CPU aand ~4997 MB of
> RAM. It's been ~24 hours now in that state and I don't have any idea of
> how long this should take. If I run the same model in the same setting
> with the base glm(), the model runs in about 60 seconds. Is there a way
> to know if the process is going to produce something useful after all
> this time or if it's hanging on some kind of problem?
>
>
>    [1]:
> https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917 
>
>    [2]:
> https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates 
>
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

john polo-2
Michael,

Thank you for the suggestion. I will take your advice and look more
critically at the covariates.

John

On 7/27/2017 8:08 AM, Michael Friendly wrote:

> Rather than go to a penalized GLM, you might be better off
> investigating the sources of quasi-perfect separation and simplifying
> the model to avoid or reduce it.  In your data set you have several
> factors with large number of levels, making the data sparse for all
> their combinations.
>
> Like multicolinearity, near perfect separation is a data problem, and
> is often better solved by careful thought about the model, rather than
> wrapping the data in a computationally intensive band aid.
>
> -Michael
>
> On 7/26/2017 10:14 AM, john polo wrote:
>> UseRs,
>>
>> I have a dataframe with 2547 rows and several hundred columns in R
>> 3.1.3. I am trying to run a small logistic regression with a subset
>> of the data.
>>
>> know_fin ~
>> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>
>>      > str(knowf3)
>>      'data.frame':   2033 obs. of  18 variables:
>>      $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..:
>> 1857 157 965 1967 164 315 849 1017 699 189 ...
>>      $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1
>> ...
>>      $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>      $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75
>> 75 75 64 64 64 64 ...
>>      $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>      $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4
>> 2 4 2 6 ...
>>      $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8
>> 5 8 4 4 ...
>>      $ income    : num  550000 80000 90000 19000 42000 30000 18000
>> 50000 800000 10000 ...
>>      $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>      $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4
>> 2 3 2 6 ...
>>      $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1
>> 2 ...
>>      $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>      $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13
>> 13 13 13 10 10 10 10 ...
>>      $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2
>> ...
>>
>>
>> With the regular glm() function, I get a warning about "perfect or
>> quasi-perfect separation"[1]. I looked for a method to deal with this
>> and a penalized GLM is an accepted method[2]. This is implemented in
>> logistf(). I used the default settings for the function.
>>
>> Just before I run the model, memory.size() for my session is ~4500
>> (MB). memory.limit() is ~25500. When I start the model, R immediately
>> becomes non-responsive. This is in a Windows environment and in Task
>> Manager, the instance of R is, and has been, using ~13% of CPU aand
>> ~4997 MB of RAM. It's been ~24 hours now in that state and I don't
>> have any idea of how long this should take. If I run the same model
>> in the same setting with the base glm(), the model runs in about 60
>> seconds. Is there a way to know if the process is going to produce
>> something useful after all this time or if it's hanging on some kind
>> of problem?
>>
>>
>>    [1]:
>> https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917 
>>
>>    [2]:
>> https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates 
>>
>>
>>
>


--
Men occasionally stumble
over the truth, but most of them
pick themselves up and hurry off
as if nothing had happened.
-- Winston Churchill

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

Marc Schwartz-3
Hi,

Late to the thread here, but I noted that your dependent variable 'know_fin' has 3 levels in the str() output below.

Since you did not provide a full c&p of your glm() call, we can only presume that you did specify 'family = binomial' in the call.

Is the dataset 'knowf3' the result of a subsetting operation, such that there are only two of the three levels of 'know_fin' retained in the records used in the glm() call, or are there actually 3 levels in the dataset used in the glm() call?

If the latter, that will of course be problematic and from a quick check here, glm(..., family = binomial) does not issue a warning or error in the case where the dependent variable has >2 levels.

Regards,

Marc Schwartz


> On Jul 27, 2017, at 8:26 AM, john polo <[hidden email]> wrote:
>
> Michael,
>
> Thank you for the suggestion. I will take your advice and look more critically at the covariates.
>
> John
>
> On 7/27/2017 8:08 AM, Michael Friendly wrote:
>> Rather than go to a penalized GLM, you might be better off investigating the sources of quasi-perfect separation and simplifying the model to avoid or reduce it.  In your data set you have several factors with large number of levels, making the data sparse for all their combinations.
>>
>> Like multicolinearity, near perfect separation is a data problem, and is often better solved by careful thought about the model, rather than wrapping the data in a computationally intensive band aid.
>>
>> -Michael
>>
>> On 7/26/2017 10:14 AM, john polo wrote:
>>> UseRs,
>>>
>>> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. I am trying to run a small logistic regression with a subset of the data.
>>>
>>> know_fin ~ comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>>
>>>     > str(knowf3)
>>>     'data.frame':   2033 obs. of  18 variables:
>>>     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 157 965 1967 164 315 849 1017 699 189 ...
>>>     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>>>     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>>     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 64 64 64 64 ...
>>>     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>>     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 2 6 ...
>>>     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 4 4 ...
>>>     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000 800000 10000 ...
>>>     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>>     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 2 6 ...
>>>     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>>>     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>>     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 13 10 10 10 10 ...
>>>     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>>>
>>>
>>> With the regular glm() function, I get a warning about "perfect or quasi-perfect separation"[1]. I looked for a method to deal with this and a penalized GLM is an accepted method[2]. This is implemented in logistf(). I used the default settings for the function.
>>>
>>> Just before I run the model, memory.size() for my session is ~4500 (MB). memory.limit() is ~25500. When I start the model, R immediately becomes non-responsive. This is in a Windows environment and in Task Manager, the instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. It's been ~24 hours now in that state and I don't have any idea of how long this should take. If I run the same model in the same setting with the base glm(), the model runs in about 60 seconds. Is there a way to know if the process is going to produce something useful after all this time or if it's hanging on some kind of problem?
>>>
>>>
>>>   [1]: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917 
>>>   [2]: https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates 
>>>
>>>
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

john polo-2
Marc,

Sorry for the lack of info on my part. Yes, I did use 'family =
binomial' and I did drop the 3rd level before running the model. I think
the str(<subset>) that I wrote into my original email might not have
been my final step before using glm. Thank you for reminding of the
potential problem.

I think Michael Friendly's idea is probably the solution I need to
consider. I am simplifying my factors a little bit and revising which I
will keep.


best,
John

On 7/27/2017 8:54 AM, Marc Schwartz wrote:

> Hi,
>
> Late to the thread here, but I noted that your dependent variable 'know_fin' has 3 levels in the str() output below.
>
> Since you did not provide a full c&p of your glm() call, we can only presume that you did specify 'family = binomial' in the call.
>
> Is the dataset 'knowf3' the result of a subsetting operation, such that there are only two of the three levels of 'know_fin' retained in the records used in the glm() call, or are there actually 3 levels in the dataset used in the glm() call?
>
> If the latter, that will of course be problematic and from a quick check here, glm(..., family = binomial) does not issue a warning or error in the case where the dependent variable has >2 levels.
>
> Regards,
>
> Marc Schwartz
>
>
>> On Jul 27, 2017, at 8:26 AM, john polo<[hidden email]>  wrote:
>>
>> Michael,
>>
>> Thank you for the suggestion. I will take your advice and look more critically at the covariates.
>>
>> John
>>
>> On 7/27/2017 8:08 AM, Michael Friendly wrote:
>>> Rather than go to a penalized GLM, you might be better off investigating the sources of quasi-perfect separation and simplifying the model to avoid or reduce it.  In your data set you have several factors with large number of levels, making the data sparse for all their combinations.
>>>
>>> Like multicolinearity, near perfect separation is a data problem, and is often better solved by careful thought about the model, rather than wrapping the data in a computationally intensive band aid.
>>>
>>> -Michael
>>>
>>> On 7/26/2017 10:14 AM, john polo wrote:
>>>> UseRs,
>>>>
>>>> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. I am trying to run a small logistic regression with a subset of the data.
>>>>
>>>> know_fin ~ comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>>>
>>>>      > str(knowf3)
>>>>      'data.frame':   2033 obs. of  18 variables:
>>>>      $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 157 965 1967 164 315 849 1017 699 189 ...
>>>>      $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>>>>      $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>>>      $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 64 64 64 64 ...
>>>>      $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>>>      $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 2 6 ...
>>>>      $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 4 4 ...
>>>>      $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000 800000 10000 ...
>>>>      $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>>>      $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 2 6 ...
>>>>      $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>>>>      $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>>>      $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 13 10 10 10 10 ...
>>>>      $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>>>>
>>>>
>>>> With the regular glm() function, I get a warning about "perfect or quasi-perfect separation"[1]. I looked for a method to deal with this and a penalized GLM is an accepted method[2]. This is implemented in logistf(). I used the default settings for the function.
>>>>
>>>> Just before I run the model, memory.size() for my session is ~4500 (MB). memory.limit() is ~25500. When I start the model, R immediately becomes non-responsive. This is in a Windows environment and in Task Manager, the instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. It's been ~24 hours now in that state and I don't have any idea of how long this should take. If I run the same model in the same setting with the base glm(), the model runs in about 60 seconds. Is there a way to know if the process is going to produce something useful after all this time or if it's hanging on some kind of problem?
>>>>
>>>>
>>>>    [1]:https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917 
>>>>    [2]:https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates 
>>>>
>>>>


--
Men occasionally stumble
over the truth, but most of them
pick themselves up and hurry off
as if nothing had happened.
-- Winston Churchill

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How long to wait for process?

Marc Schwartz-3
Hi John,

Not a problem, just wanted to be sure that there was not additional confounding due to these issues.

You may be aware that a subsetting operation to remove records in a data frame does not by default remove the unwanted levels from the factor that was filtered:

iris.new <- subset(iris, Species == "setosa")

> str(iris.new)
'data.frame': 50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(iris.new$Species)
[1] "setosa"     "versicolor" "virginica"

> table(iris.new$Species)

    setosa versicolor  virginica
        50          0          0


You can see that Species retains all 3 original levels, even though only one is actually present in the records in the new data frame.

Thus, your output below may very well be post the filtering of 'know_fin' to 2 levels.

Regards,

Marc


> On Jul 27, 2017, at 9:14 AM, john polo <[hidden email]> wrote:
>
> Marc,
>
> Sorry for the lack of info on my part. Yes, I did use 'family = binomial' and I did drop the 3rd level before running the model. I think the str(<subset>) that I wrote into my original email might not have been my final step before using glm. Thank you for reminding of the potential problem.
>
> I think Michael Friendly's idea is probably the solution I need to consider. I am simplifying my factors a little bit and revising which I will keep.
>
>
> best,
> John
>
> On 7/27/2017 8:54 AM, Marc Schwartz wrote:
>> Hi,
>>
>> Late to the thread here, but I noted that your dependent variable 'know_fin' has 3 levels in the str() output below.
>>
>> Since you did not provide a full c&p of your glm() call, we can only presume that you did specify 'family = binomial' in the call.
>>
>> Is the dataset 'knowf3' the result of a subsetting operation, such that there are only two of the three levels of 'know_fin' retained in the records used in the glm() call, or are there actually 3 levels in the dataset used in the glm() call?
>>
>> If the latter, that will of course be problematic and from a quick check here, glm(..., family = binomial) does not issue a warning or error in the case where the dependent variable has >2 levels.
>>
>> Regards,
>>
>> Marc Schwartz
>>
>>
>>> On Jul 27, 2017, at 8:26 AM, john polo<[hidden email]>  wrote:
>>>
>>> Michael,
>>>
>>> Thank you for the suggestion. I will take your advice and look more critically at the covariates.
>>>
>>> John
>>>
>>> On 7/27/2017 8:08 AM, Michael Friendly wrote:
>>>> Rather than go to a penalized GLM, you might be better off investigating the sources of quasi-perfect separation and simplifying the model to avoid or reduce it.  In your data set you have several factors with large number of levels, making the data sparse for all their combinations.
>>>>
>>>> Like multicolinearity, near perfect separation is a data problem, and is often better solved by careful thought about the model, rather than wrapping the data in a computationally intensive band aid.
>>>>
>>>> -Michael
>>>>
>>>> On 7/26/2017 10:14 AM, john polo wrote:
>>>>> UseRs,
>>>>>
>>>>> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. I am trying to run a small logistic regression with a subset of the data.
>>>>>
>>>>> know_fin ~ comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>>>>
>>>>>     > str(knowf3)
>>>>>     'data.frame':   2033 obs. of  18 variables:
>>>>>     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 157 965 1967 164 315 849 1017 699 189 ...
>>>>>     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>>>>>     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>>>>     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 64 64 64 64 ...
>>>>>     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>>>>     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 2 6 ...
>>>>>     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 4 4 ...
>>>>>     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000 800000 10000 ...
>>>>>     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>>>>     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 2 6 ...
>>>>>     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>>>>>     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>>>>     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 13 10 10 10 10 ...
>>>>>     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>>>>>
>>>>>
>>>>> With the regular glm() function, I get a warning about "perfect or quasi-perfect separation"[1]. I looked for a method to deal with this and a penalized GLM is an accepted method[2]. This is implemented in logistf(). I used the default settings for the function.
>>>>>
>>>>> Just before I run the model, memory.size() for my session is ~4500 (MB). memory.limit() is ~25500. When I start the model, R immediately becomes non-responsive. This is in a Windows environment and in Task Manager, the instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. It's been ~24 hours now in that state and I don't have any idea of how long this should take. If I run the same model in the same setting with the base glm(), the model runs in about 60 seconds. Is there a way to know if the process is going to produce something useful after all this time or if it's hanging on some kind of problem?
>>>>>
>>>>>
>>>>>   [1]:https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917     [2]:https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates 
>>>>>
>
>
> --
> Men occasionally stumble
> over the truth, but most of them
> pick themselves up and hurry off
> as if nothing had happened.
> -- Winston Churchill
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.