Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
johnw
Champion III
Champion III

Should We Stop Worrying and Love the Synthetic Key?

Synthetic keys have a bad reputation.  The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure.  If you load tables like this:

TableA:  FieldA, FieldB, FieldC

TableB:  FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA:             SyntheticKey, FieldC

TableB:             SyntheticKey, FieldD

SyntheticKeyTable:  SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing.  If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually.  As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field.  That shows up as a connection.  Two tables can be connected by two or more fields.  That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem.  Either can cause problems when they result from a bad data model, when there are too many or the wrong connections.  But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables.  If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

·        Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.

·        Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes!  Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”.  No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written.  I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys.  Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys.  But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem.  I’ve never seen a good data model that resulted in a lot of synthetic keys.  Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include:  Forming your own non-composite keys”.  While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA.  And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage.  I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage.  In fact, I believe performance and memory usage will typically get marginally worse.

I only have one test of my own to provide, from this thread:

http://community.qlik.com/forums/t/23510.aspx

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved.  I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key.  I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

                            load time   file KB   RAM KB    CalcTime

synthetic key removed           14:01    49,507   77,248   46,000 ms 

synthetic key left in place      3:27    49,401   77,160   44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting.  I’m happy to learn something here.  I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking.  The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves.  Maybe I just have a bad memory, and failed to find good examples when I searched.  But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage?  Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

96 Replies
johnw
Champion III
Champion III
Author

If you load this data:

TableA:
FieldA, Field B, FieldC
A, B, C
D, E, F

TableB:
FieldA, FieldB, FieldD
A, B, G
D, E, H

You will get this result:

TableA:
SyntheticKey, FieldC
1, C
2, F

TableB:
SyntheticKey, FieldD
1, G
2, H

SyntheticKeyTable:
SyntheticKey, FieldA, FieldB
1, A, B
2, D, E

Field A has 2 values.  Field B has two values.  You claim the synthetic key table will have four rows.  It doesn't.  It has two rows.

I believe you're confusing synthetic keys with Cartesian joins.  That isn't how synthetic keys work.  A synthetic key is not giving you all POSSIBLE combinations of field values.  It is giving you all EXISTING combinations of field values.  You can only have 1,728,000 records in the synthetic key if you have 1,728,000+ records in your data set.  You can only have 36,000,000,000,000,000 records in your synthetic key if you have 36,000,000,000,000,000+ records in your data set.

Edit: Attached is your worst case example - a 5 field synthetic key with 800 possible values for each field.  The load works in a flash, the file is 156 KB, and there is no problem rendering a tablebox that depends on taking advantage of the synthetic key.  I believe it is safe to conclude that the synthetic key table does not, in fact, have 36,000,000,000,000,000 records.

Michael_Reese
Employee
Employee

John, I can appreciate the thought and effort you put into understanding synthetic keys. 

My take on synthetic keys lies along these lines: 

Can it work?  Sure.  Just the same, we don't have to create preceding loads.  We don't have to explicitly list all fields in select statements or preceding loads. We don't have to explicilty name the table we are creating,  joining or concatenating to.  Does that mean we shouldn't?  Definitely not. 

Being explicit in what we do is a much cleaner and consistent approach to development.  It makes it easier on ourselves when we have to revist the app down the road.  It especially makes it easier on anyone else who has to support your app. 

Not having a standardized way of handling keys opens the door for many problems and rework when changes are required. 

Mike

Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

Mike,

My take on it is exactly the same, but I couldn't possibly word it as clearly as you did - it's all about "being explicit in what we do".

John - I agree with your distinction between synthetic keys and Cartesian Joins, however, I think the sentiment that Klaus was trying to express is that generating a single key (synthetic key in this case) built for all possible permutations of the component keys, will create the biggest field in your database. For example:

1,000 Customers x 1,000 Products x 1,000 Dates - that's a composite key with up to 1B distinct values (of course, if every customer buys every product every day...) In any case, the synthetic key will generate a list of distinct values that's substantially longer than any of the initial 3 keys. And that, as you know, is what determines the memory consumption in QlikView - not so much the number of rows in the table.

So, I get your argument that Synthetic key can be performing slightly better than an identical Composite Key, generated manually. In this case, I'll repeat after Mike - I'd rather be explicit in what I do. However, I'd like to continue the exercise and take it to the next level. Let's design the optimal data model that perhaps won't need the Composite Key at all, and that's the ultimate solution to the problem of Synthetic Keys.

One technical correction - the variable Time that you calculate in your script, does not include the time needed for calculating the synthetic keys. The linking is done after the script is finished. The only safe way to know that is to generate a log file and to see the time needed for generating the synthetic key.

In the case of your script, applied to 2Mil. rows, the load itself took 32 sec., and generating the Synthetic key took additional 13 sec. - more than a third of the script load time!

cheers,

Oleg

johnw
Champion III
Champion III
Author

I agree that the synthetic or composite key can take a large portion of the memory if it has a large number of distinct values.  That's a disadvantage to both of those approaches.

As far as being explicit, I've changed my mind on that issue.  I used to believe that if two tables were associated by two fields, then obviously I intended a synthetic key.  I saw this as no different than associating two tables by a single field.  But given that 90% of synthetic keys are probably created in error, what is my assumption going to be when looking at someone else's program?  Unless the programmer's comments explain that he or she had intended it to join that way, that's one of the first things I'll think could have been a mistake.  So yes, composite keys are self-documenting, and this is an advantage over synthetic keys.

I further agree with you that probably the BEST solution in many cases is to have neither a synthetic key nor a composite key.  This thread has been primarily focused on the comparison of the two, based on what I think is a common suggestion that you replace synthetic keys with composite keys.  But a better solution still can be to modify the data model to perform without either.  And that is certainly relevant in a general discussion about whether or not synthetic keys should be removed.  Even if they are arguably a better choice than composite keys in some particular application, they may not be the best choice.  And if another choice is best, it should generally be used.

Yes, you're right that how I was calculating the load time for synthetic keys was explicitly ignoring the load time for synthetic keys.  I hadn't realized that at the time, though I should have.  Synthetic keys are built after the script finishes executing, so no duration set in the script can include them.  This then misrepresented them in the comparisons.  I know I've done it correctly more recently, as I remember reading log files for the synthetic key creation times, but I can't remember if that was in more recent threads or just something I was doing myself when testing my own applications.  In any case, you're correct - my load times were wrong in the comparisons.

And on top of this, I'll confess that I went back and checked all the places I've used synthetic keys.  "All" turned out to only be a few applications with one each, and in ONE of those applications, the synthetic key was a design flaw.  My data should not have been connected by the two fields involved.  I redesigned the data model to fix the error, and of course the synthetic key went away.  Perhaps if I'd had to build the connection manually, the additional thought required would have been enough to make me realize my error, and I'd have avoided the mistake.  If with seven years of QlikView experience (and much more data modeling experience) I'm still screwing up synthetic keys, should I really be recommending them to less-experienced developers?

You and others have done much to convince me, Oleg, since this was originally written.  There are definite advantages to being explicit.  There are often definite advantages to simply restructuring the data model, such as by joining or concatenating tables.  Even composite keys can be significantly faster in common real world scenarios, such as when combined with incremental QVD loads.  I've written no actual applications where the synthetic key gave me any definite advantage other than slightly faster initial coding time, which is very low on the scale of what is truly important.

I wouldn't say that I've come completely around, but my current opinion is much different than it once was.  So thank you (and others) for continuing to debate me on the subject.  I appreciate your patience and experience.

johnw
Champion III
Champion III
Author

m1lreese wrote:

...we don't have to create preceding loads.  We don't have to explicitly list all fields in select statements or preceding loads. We don't have to explicilty name the table we are creating,  joining or concatenating to.  Does that mean we shouldn't?  Definitely not. 

Being explicit in what we do is a much cleaner and consistent approach to development.  It makes it easier on ourselves when we have to revist the app down the road.  It especially makes it easier on anyone else who has to support your app.

Well, there's often reason to do this sort of thing beyond merely making it easier on ourselves and others when maintaining the application.  Say I LOAD * from a QVD (or database).  My application is now exposed to outside corruption.  All anyone has to do to break my data model is add a field to that QVD that happens to have the same name as a field in another QVD that I'm using.  No thanks.  I'd rather my applications be safe.  And on top of that, LOAD * or SELECT * wastes time and memory on fields I don't need.

But yes, these are good examples of why it isn't always a good idea to just let QlikView do something for us.  It is sometimes important to do it ourselves, manually, explicitly, even when that means more code.

Michael_Reese
Employee
Employee

Right, I totally agree that is another big reason to be explicit. 

I first idfentified * as an issue in a pre-existing app on my first client 5 years ago.

farolito20
Contributor III
Contributor III

If I remove the synthetic key table, Can I keep my tables connected?

Michael_Reese
Employee
Employee

You want to create a link table manually by concatenating the key values in one LinkKey column.  Have separate columns for each associated key field that will link them to the dimensions.

dmohanty
Partner - Specialist
Partner - Specialist

Hi John, Needed your help badly. Could you please help in below scenario:

I have to concatenate the QVDs from different folders. But the problem is that, when the QVDs from 2nd folder are coming, they are replacing the QVDs generated from 1st folder. My aim is to concatenate the QVDs from both the folders. Please suggest. Do I need to mention CONCATENATE somewhere?

Below is my code:

set errormode = 0;

let qvdpath = 'C:\Users\DICKY\Desktop';

for each vPlant in  '$(qvdpath)\MAP\' , '$(qvdpath)\HPG\'

for each qvd_file in FileList('$(vPlant)*')

let QVDName = subfield('$(qvd_file)','\',6);

$(QVDName):

load

*

   

FROM

$(vPlant)$(QVDName)

(qvd);

store $(QVDName) into C:\Users\DICKY\Desktop\Test QVDs\$(QVDName);

drop Tables $(QVDName);

next

next

EXIT Script;

Anonymous
Not applicable

Hi,

     i have one doubt on ciecular loop,any advantages and dis advantages of circular loop,

can you have any document on thins one please provide that one.

Regards