13 Replies Latest reply: Jun 5, 2012 11:46 AM by user5674 RSS

    Replace count distinct by sum : bad results

    user5674

      Hello,

       

      I have attached an example.

       

      I want to test to replace count distinct big expression with many if conditions by sum with if conditions to compare calculation speed.

       

      I used a table with my data (TMP) and an column containing Id

      I have added a table with only the distinct Id and a column flag.

       

      But with replacing my count distinct by a sum of the flag column, I have bad results.

       

      Could you explain me what is the problem?

       

      Cause to the filter conditions I can't use the peek style load script because the first line with the Id which obtain the flag may have other fields values which validate only certain condition and not others. But i want my count distinct evaluate expressions dynamically on selected data.

      That why I have attempted to use a table (TMPDistinct)

      to associate one single Id value to one single flag 1.

       

      Regards,

        • Re: Replace count distinct by sum : bad results
          user5674

          I have done some other test with formulas in attached application.

           

          I don't know if the thing I want is possible:

          use the current line value of the "fake" dimension CategoryTtl as value condition on the true aggregated sum  on field Category in the "Set analysis" formula.

           

          The "best" I have done is the yellow column with an explicit if switch. I don't think it's really "better" concerning performance...

           

          And with the If conditions I don't understand why all my Sum result are wrong.

            • Replace count distinct by sum : bad results
              swuehl

              Hi user5674,

               

              as soon as you use something like

              =sum(If(Date=vDate,flag))

               

              in your chart with dimension Category, you tell QV to go through your table containing Date and return flag for every line where Date=vDate.

               

              For example, for your chosen date and Cat A, there are three lines. But you are only interested in the distinct two lines.

               

              If you use the set analysis version,

              =sum( {$<Date={'$(=vDate)'}>} flag)


              you limit the possible values for ID for Cat A to 1 and 4 by your set expression, so if you do the sum over flag, only two values of 1 are summed up (again, QV will look at each line of your flag table, but here there are only two lines to be regarded).

               

              This sounds a bit complicated and I think it is. I hope I made my point a bit clear.

               

              Regards,

              Stefan


                • Re: Replace count distinct by sum : bad results
                  user5674

                  Hi,

                   

                  Thanks for your answer.

                   

                  I understand what is the problem on the simple Sum(if(...)) now. It take all lines of the TMP table validating the condition instead of all lines of the flag table (TMPDistinct ).

                   

                  The first "set analysis" on the left pivot table (Sum set analysis) seems give the correct results.

                   

                  But I don't know why the "set analysis" don't take account of the selection. if I select the 20/05/2012 the column :

                  -first table: Sum set analysis

                  -second table: Sum set analysis with explicit if-switch

                  remain the same but I haven't the condition Date=vDate (=25/05/2012) validated as Date=20/05/2012.

                  So why I haven't "0" ?I have written the "$" though in order to use current selection.

                   

                   

                  I haven't solution to have the good result on my second table with the "fake" dimension "CategoryTtl".

                   

                  For this one I'm not sure if  it's possible to heve the correct results without the horrible If/Switch conditions: I think a count distinct with the condition if(Category=CategoryTtl,...) is less heavy in calculation time, so it's meaningless to use set analysis with this method.

                    • Replace count distinct by sum : bad results
                      swuehl

                      >So why I haven't "0" ?I have written the "$" though in order to use current selection.

                       

                      Sure, you are using the current selection set, but you override your selection in Date by using set modifier

                      <Date={'$(=vDate)'}>}


                      and vDate equals 25/05/2012, right?

                       

                      Looking at your second question using the data island, may I ask why do you want to use a data island here?

                       

                      I think you can make your expression work like:

                       

                      =If(

                          CategoryTtl<>'Total'

                          ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id))  

                           ,sum({$<Date={'$(=vDate)'}>} flag)

                      )

                       

                      Also, maybe this is of use to understand how QV handles set analysis in dimension context:

                      http://community.qlik.com/docs/DOC-1335

                       

                      Regards,

                      Stefan

                        • Re: Replace count distinct by sum : bad results
                          user5674

                          About the <Date={'$(=vDate)'}>} I misunderstood set analysis here. I thought that was just a sort of filter on current selection, not a forced selection which override current selection.

                           

                          But for this type of formula this isn't a big problem here because the variable is automatically updated by macro in the real application when I change the date field.

                           

                           

                          Concerning the goal of the data island this is related to this subject: http://community.qlik.com/message/221295

                           

                          I have a total aggregated over a "fake" dimension label in order to have two total rows with distinct formula.

                           

                          I have deleted the two total rows in this dicussion because it isn't directly related to the subject.

                           

                          Regards,

                          Sylvain

                            • Replace count distinct by sum : bad results
                              user5674

                              Hello Stefan,

                               

                              Thanks for the formula

                              =If(

                                  CategoryTtl<>'Total'

                                  ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id))  

                                   ,sum({$<Date={'$(=vDate)'}>} flag)

                              )

                               

                              It return good dynamic results on "normal" rows, but the total row isn't dynamic (always "4" and not "0" when the 2012-05-20 date is selected). I have tried to change as sum({$<Date={'$(=vDate)'}>} aggr(flag,Date) ) but it don't give me the good result.

                               

                              Have you a solution for this ?

                               

                              Regards,

                              Sylvain

                                • Replace count distinct by sum : bad results
                                  swuehl

                                  You could try something like this (if you are limiting your evaluation to a certain date):

                                   

                                  =If(

                                       CategoryTtl<>'Total'

                                       ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id)) 

                                        ,sum({$<Date={'$(=vDate)'}>} aggr(flag,CategoryTtl, Date, Id))

                                  )

                                    • Replace count distinct by sum : bad results
                                      user5674

                                      Thanks!

                                       

                                      It solves my problem.

                                        • Re: Replace count distinct by sum : bad results
                                          user5674

                                          Hi,

                                           

                                          I have a question on the method sum( {SET} aggr(, Dimensions) ):

                                          I must add each field used in the set analysis formula as a dimensions of the aggr?

                                           

                                          I have many "IF" condition with text field range exclusion and date field range inclusion to replace by a set analysis formula in order to boost performances in my true application.

                                           

                                          If I must add them, I think I don't optimize but decrease performance as I aggregate over the ID I want count, and other fields.

                                           

                                          My full expression is like this:

                                          Count(DISTINCT
                                          IF(
                                            [Category]=[CategoryTtl]
                                            AND [Period]=vLastPeriod
                                            AND Class='B'
                                            AND ((vActive=0 AND Action='Y') OR vActive=1)
                                            AND CStatus<>'SOLVED'
                                            AND CStatus<>'CANCELED'
                                            AND (not isnull([CStatus))
                                            AND CStatus<>''
                                            AND Record.Date<=vLastDate
                                            AND Record.Date>vLastDate_1
                                            ,FlagDistinctID
                                          )
                                          )


                                          I have set this expression but it doesn't work (only 0 on each row) :

                                           

                                          (variable)

                                           

                                          Sum(

                                              {$<
                                                Period={'$(=vLastPeriod)'}
                                                ,Class={'B'}
                                                ,CStatus - = {'SOLVED','CANCELED',''}
                                                ,"Record.Date"={'>$(=vLastDate_1)<=$(=vLastDate)'}
                                               >}

                                               aggr(
                                                    IF(
                                                         [Category]=[CategoryTtl]
                                                         AND ((vActive=0 AND Action='Y') OR vActive=1)
                                                         ,FlagDistinctID
                                                    )
                                                    ,[CategoryTtl],[Category]
                                                )
                                          )

                                           

                                          My goal is to decrease expressions calculation time which is too long.

                                           

                                          I can't do something as sub aggregates/fields on the load script because I have many expressions like this one and my application have many selectable field for users, so my expressions must remain dependant of users selections.

                                          And It's to difficult to estimate all possible sub aggregates to flag distinct ID in the load script.

                                          ( expression A->distinct ID, expression B->distinct ID,...)

                                           

                                          Could you help me ?

                                            • Replace count distinct by sum : bad results
                                              swuehl

                                              No, I believe you don't need to add each field in your set expression as dimension to the aggr() function. Not necessarily.

                                               

                                              Not sure why your expression returns all zero. Are you still using the same 1 dimension CategoryTtl?

                                               

                                              If you comment out some set modifiers in your set expression or the additional condition in the if() statement, does this change anything?

                                               

                                              It's quite hard to debug this just looking at the expression (I don't see anything really wrong at the moment) and without knowing your data (model). Is there a chance that you post an updated sample file?

                                                • Re: Replace count distinct by sum : bad results
                                                  user5674

                                                  Thanks.

                                                   

                                                  I will provide an example.

                                                   

                                                  Regards.

                                                    • Re: Replace count distinct by sum : bad results
                                                      user5674

                                                      Hello,

                                                       

                                                      Here an example with a sample dataset CSV.

                                                       

                                                      The issue is to sum on flag tables only, and not on AllRecord table, as long as I understand the bad results when not "0".

                                                       

                                                      I don't know if it will really decrease the calculation time, but it's my goal to decrease it.

                                                       

                                                      Regards

                                                        • Re: Replace count distinct by sum : bad results
                                                          user5674

                                                          Hi,

                                                           

                                                          I have success to have the good results in my application using something like this

                                                           

                                                          aggr(

                                                               Count(

                                                                    {SET CLAUSE}

                                                                    DISTINCT

                                                                    IF(

                                                                         TrueDimension=FakeDimension1

                                                                         AND conditions dependant of variables

                                                                        , ID

                                                                    )

                                                                    ,FakeDimension1,FakeDimension2

                                                          )

                                                           

                                                          Same thing for my Mt formula:

                                                          SUM(

                                                               aggr(

                                                                    Min(

                                                                         {SET CLAUSE}

                                                                         IF(

                                                                              TrueDimension=FakeDimension1

                                                                              AND conditions dependant of variables

                                                                             , Mt

                                                                          )

                                                                    )

                                                                    ,FakeDimension1,FakeDimension2,ID

                                                               )

                                                          )

                                                           

                                                           

                                                          Edit: The statistics give me a good result, calculation time / 2 for  ten expression like this instead of use the "IF" ! Set analysis is really interesting for me here.