Replace COUNT(DISTINCT) with a SUM - and further..... - Qlik Community

datanibbler · ‎2013-12-17

Hi,

for reasons of performance (yes, I know that some say that this is not so) I want to replace a COUNT(DISTINCT ) in a diagram which every one of my apps has (employee presence) with a SUM() expression
=> This also has another aspect: I would then not need to have the Personell_ID in the final table (which is the only one I keep, all temp_tables are deleted in the process) => data_protection

This works fine in principle:

=> In the script, my table is sorted by

- Pers_ID (primary)

- Date (secondary)

=> I create a binary field in the script, comparing the Pers_ID to the one in the line before - when it changes, I put 1, otherwise 0.

=> That field can be summed up and I get the exact same nr. as before - every Pers_ID is counted only once.

<=> The problem is in the dimension "month" (the default one where it works is "day")

=> There I currently have an aggr() function

<=> That gives me way too large values.

=> Can someone give me a hint as to what I have to do to get this sum_field right in the "month" dimension?

Thanks a lot!

Best regards,

DataNibbler

swuehl · ‎2013-12-17

Please read HIC's recent blog post:

http://community.qlik.com/blogs/qlikviewdesignblog/2013/10/22/a-myth-about-countdistinct

You don't need to use a Personaell_ID, just any appropriate key, so data protection should not be an issue.

I don't really get what you are doing in your script, are you trying to create a secondary ID counter in a fact table?

I think that is probably not the way to go.

It's always lot easier to understand if you try to describe your setting not only with words, but tables, scripts, applications.

datanibbler · ‎2013-12-17

Hi swuehl,

no, I'm not creating a secondary counter - the keyfield of the table remains the Pers_ID.

I'm merely adding more info to the table, so I add

- a 1 on days where a specific employee was reported as sick

- a 1 on a day when a specific employee was on holiday

- a 1 on a day when a specific employee was off_duty.

Best regards,

DataNibbler

datanibbler · ‎2013-12-17

Hi,

I have another, albeit related, issue now:

=> In another chart, I have the info of "emp being off_duty" with a value of 1 in the record for the resp. day.

=> I can sum this up problem-free on the "day" dimension and subtract it from my total to get the nr. of emps who were supposed to be there on a given day

<=> When I put this inside an aggr() fct., so it looks something like this

>>> aggr(sum(off_duty), Monat) <<<

=> I get skyhigh values, more than double the total of employees assigned to an area...

Can somebody tell me what I have to do to get this info right on the dimension "month"?

Thanks a lot!

Best regards,

DataNibbler

swuehl · ‎2013-12-17

If you have the information for off_duty on a per day basis, what do you expect to see then if you aggregate on a per month basis? A count of employess that are at least one day off-duty on the given month? A count of days off-duty across all employees for that month?

To get something right, you need to define what is right and what is wrong.

Report Inappropriate Content · ‎2013-12-17

In regards to your question about adding 1 based on a certain criteria, it sounds like you will want separate fields for each (sick, holiday, off_duty).

I'm not sure what field tells you that a person was sick/holiday/off_duty, but in your load statement I would add:

If( [SICK_FIELD] = [condition to make true], 1, 0) as SICK_DAY_FLAG

Then repeat this for holiday and off_duty. You can then use Sum(SICK_DAY_FLAG) to get the total sick days for each person.

datanibbler · ‎2013-12-18

Hi swuehl, Matthew,

quite easy:

- I have a table with one record per emp per day

- There is a field with the value 1 on a row (for an employee and a day) when that emp was out sick.

- By summing up that field (on a "day" dimension), I get the nr. of employees out sick on that one day.
=> @ Matthew - I don't want the nr. of days a specific person was out sick, but the other way round, I want
the nr. of persons out sick on a given day - that one works fine already.

=> As in most diagrams, when the user switches to the dimension "month", I would like to see the average of how many employees were out sick.

Can you help me there? I have that in some other scenarios, too, where I have a 0/1 field that I simply sum up.

=> There is a way: By NOT summing up those fields, but instead counting (with a DISTINCT) the Pers_IDs where that field has the value 1.

I think I'll go that way to get this done.

Best regards,

DataNibbler

Report Inappropriate Content · ‎2013-12-18

I think I understand you better now. The first thing that comes to mind is Set Analysis.

Count(DISTINCT {<SICK={1}>} Pers_ID)

This will count the number of unique IDs where your flag field is equal to 1. You can change SICK to your specific field and 1 would be the value within that field.

If your chart has a month dimension, then you will be given the unique IDs that were out sick. If you're want an average per week for a month, then I believe you will need the aggr() function as well. Try something like this:

Avg( aggr( Count(DISTINCT {<SICK={1}>} Pers_ID), MONTH, WEEK) )

This basically first counts the number of unique IDs for each WEEK in a MONTH than averages them together for the month - assuming your chart dimension is MONTH.

datanibbler · ‎2013-12-18

Hi Matthew,

COUNT(DISTINCT) is a good one - that's just what I'm doing currently 😉

However, there is another aspect beside performance and data protection:

=> This COUNT(DISTINCT) is used in quite a number of places, so it would be a performance_win to have it easier - I might try putting the COUNT(DISTINCT) in the script

<=> Then I have to remove the Pers_ID from that aggregated LOAD, else I'll have to put it in the GROUP BY, too, and I'll only get a COUNT of 1 for every Pers_ID&day, right?

Best regards,

DataNibbler

datanibbler · ‎2013-12-18

Hi,

I think this is finally getting me somewhere: By placing the COUNT(DISTINCT) in the script, it is calculated only once and by using a SUM() command in the chart, I get it aggregated to whatever dimension or user_selection there is.

That is going to prove a big saving performance-wise because that formula is all over the place in most of my apps.

Best regards,

DataNibbler