Re: Help translating Python statsmodel.api.OLS.sum... - Qlik Community

evan_kurowski · ‎2020-03-11

Hello Qlik Community,

First off let me point out that Data Science is just obnoxious. I mean.. geez, there's so many calculations my head is swimming and just not getting what all these fancy math hieroglyphics are gonna be useful for.. this better have a big payoff.

(hope you guys don't mind the candor, maybe one day I'll look back on this and be like "Oh man, can't believe I wasn't infatuated with data science sooner", it's so awesome! But for now, not seeing the point , while meanwhile data science is stealing a lot of thunder when we coulda been crankin out plenty of the plain ole "regular" analytics)

BUT..

I'm still dutifully trying to wrap my head around this stuff (even though on 1-to-1 basis I would take killer data mungers on my team over polymaths. I control zero staffers, so ... this point is moot).

Maybe they're making actual rockets with this stuff at other companies? (But where I'm at we still often don't even arrive at consensus on Count(*) , so this seems like a real overreach in terms of technical ambition)

ANYWAY.. So my problem is this...

My data science course has taught us how to absorb a pair of X,Y values and build a linear regression.

It uses the statsmodels.api package in Python to achieve this:

import statsmodels.api as sm
y_qlik = data_Qlik['Comparison'] # y = b0 + b1x1
x_qlik1 = data_Qlik['Observation']
x_qlik = sm.add_constant(x_qlik1)
results = sm.OLS(y_qlik,x_qlik).fit()
results.summary()

In order to do an A - B comparison between technologies, I used the sample data set in the Qlik help manual through both Python & Qlik, that data set is found here: Qlik ttest sample data set

[Table1]:
crosstable LOAD recno() as ID, * inline [
Observation|Comparison
35|2
40|27
12|38
15|31
21|1
14|19
46|1
10|34
28|3
48|1
16|2
30|3
32|2
48|1
31|2
22|1
12|3
39|29
19|37
25|2 ] (delimiter is '|')

The Python statsmodel.api.OLS.Summary() puts out a neat table full of impressive math-babble. I'm trying to follow along the narration of the data science course and recreate the functions in Qlik.

I can connect 8 pieces of calculation between the Python approach & Qlik. My problem is I've gotten stuck on the 't' & 'P>|t|' section.

The Python course describes these numbers as 't-statistic' and 'p value', with a hypothesis value of 0.

Can anyone help me understand what Python uses to produce this section of number? I've tried several "t something or other formulas" and nothing seems to nail it.

**sidenote to administrators of Qlik Community**

Yet again the post editing process was HORRENDOUS. And it has been this way ever since we migrated the community site.

I imagine from Qlik's perspective, when testing the community portal everything appears to be working fine. (otherwise there would be no release of buggy code, right?)

But from end-user perspective, the formatting, the cut & paste, the rendering is just absolutely terrible and it takes many incompleted attempts and retries to format an article. no exaggeration when I say attempts to cut & paste paragraphs into the 'Body' pane worked maybe 3 out of every 10 attempts? I had to chop the article up a sentence at a time and each time I pasted a sentence, the font formatting made it the wrong size. Like this portal is trying to exhaust my patience.

If this is the same experience the full population of end-users is experiencing, you should really look into this, it's not a good look and I'm sure it would discourage visits.

I'm open to the idea there's some kind of "bug-in-the-middle" attack specific to browser functionailty at my company (or just specific to me, and if so does anyone know any good lawyers?), but would really suggest you ensure the quality of experience for the overall community isn't just a Qlik facing facade giving out false-positives.

**p.p.s to Community administrators**

I was at work and hit the 'Post' button trying to submit this post, and a red-error appeared that said "Invalid HTML characters" . I tried to make the correction and figure out what characters were invalid, but the Body pane was empty and frozen. Hitting the PREVIEW button flashed the HTML for just a split second, and no matter what I tried, I could not recover my post.

Since these "aborted" postings happen often, I now prepare postings in an external editor before attempting the risky transfer into the community portal. I emailed the body of the post to my personal email and completed this from a browser at home. Editing features seemed to work better at home and also the auto-post recovery had preserved my earlier post, but none of this was functional when I attempted it from work.

I had my own website once, and when I worked on development from home, the website previews worked fine. Weirdly, when I would go out in public and ask people to visit my website, from their phones or computers at other places, almost every time an attempt to load the "public" website would fail. Maybe there's something like that going on here? If a website fails to load in the woods, does it make a sound?

You can see I started the post cheerful, but after such a struggle, humor is wearing thin. It took me a decent amount of time to formulate, and hope the participation is appreciated as much as I appreciate the solutions. This feedback is to help as I am definitely a Qlik proponent. Thank you community.

evan_kurowski · ‎2020-03-13

Answered a few of these items with further research on Python & statistics (though it did take wading through several articles just to get the answer, which turned out to be really simple, but you'd be surprised how many discussions on the topic did not explain it in a clear way).

The t-stats are the coefficient / standard error. I don't know how I didn't see this in original looks, but you just divide the first column in the Summary() table by the second column, and viola.. "t-stat".

In this case for the Qlik sytax that meant:
Linest_M / Linest_SEM for the first field (x), and Linest_B / Linest_SEB for the second field (y). Keep the '_M' & '_SEM' paired together and the '_B', '_SEB' together.

Then to get the 'P > |t|' p-value, you plug in the results of the t-stat as first argument in the TDIST() function.

Which uses TDIST(score, degrees freedom, # tails) or TDIST( Linest_M/Linest_SEM, Linest_DF, 2) (*note you can do the same for _B functions but not sure the results have the same ".05 or below represents significance" rule)

continuing work on getting those probability ranges to line up. I see Qlik producing ranges with the TTest upper&lower functions, but not quite lining up with the Python yet.

I thought it was cool though that I could select multiple data samples and pull up their regression profiles pretty easily.

the next challenge is also building this out as a multi-factor regression (which I hope is possible to do in Qlik). Thanks for your assistance, all.

Help translating Python statsmodel.api.OLS.summary() into Qlik

broken community editor

Data Science

linest

obnoxious statisticians

Python

statsmodel.api

ttest