Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
IAMDV
Luminary Alumni
Luminary Alumni

*** 6 Weeks in to QV Development, 30 Million Records QV Document and Help Needed!!! ****

Dear All,

Next week I'll be working on massive dataset (atleast for my experience with QV) which is around 35 million records with 12 columns and I wanted to know the best practices to speed up the document? Please can someone share good practices for efficient development.

I know this is a vague question but this is really important for me...

Many thanks in advance.

Cheers!

1 Solution

Accepted Solutions
Miguel_Angel_Baeyens

Hello,

There are some guidelines that can be generally applied, although the best knowledge of the datamodel is obviously yours and sometimes some of those guidelines are not possible to follow.

But in short, a few steps I always take into consideration are:

1.- LOAD always from QVD files, and keep a file to pull from the database, do all needed transformations and store into the QVD will be loaded later. And LOAD only those fields you really need (pretty obvious, but "LOAD *" or "SELECT *" are usual). You don't need to have the Customer Name in both Invoices and Customer table, just linking "Invoices" to "Customers" through "Customer Code" will do.

2.- Use a fact table with all dimensions required trying to follow a star schema, instead of a snowflake schema (this is quite theoretical, but say that in your data souce you have a table with Customers, another one with Group of Customers, another one with Addresses of Customers, well, join them so you have one big "Customers" table with all required data). When concatenating to create this fact table, rename names so the final table has the same name and number of fields.

3.- Avoid IF() conditionals in any object expression, use instead set analysis.

4.- When linking one table to another by means of renaming a field in both alike (we could say these are the key fields), use a numeric value instead of a string (usually the AutoNumber() function will do).

5.- Note that QlikView is case sensitive, so when possible, set all your values to be in the same case.

6.- Use builtin actions instead of macros

7.- Create flag fields for those analysis or filters you want to use in the expressions, the simpler the expression the faster the chart is rendered.

8.- Take to the script all formatting, conversions, mapping... all complexity you can.

9.- Avoid RESIDENT loads. If needed, load twice from a QVD.

Again, most of the above are obvious and very vague, since your application and requirements will make life more complex. These are not MUSTs, rather than STRONGLY RECOMMENDED.

In addition, note that the rendering engine of QlikView uses all your CPU and GPU and memory available in your computer, while the QlikView Server may be a shared computer, using version 9 doesn't use multithreaded extraction, and even so...

NOTE: Quoting QlikTech's Brad Peterman (from the "Large Data Volumes Roundtable" in Qonnections few weeks ago), make sure you do need all those records, because you might not need them depending on the analyses you want to perform, meaning that that your database has 35M records doesn't mean you need to load them all into memory if they are not going to be displayed or used according to the level of detail the customer wants...

Hope that helps.

View solution in original post

38 Replies
chris_johnson
Creator III
Creator III

Hi again!

From what I've learnt, the best way to speed up the document itself is to keep to a minimum number of tables. Join tables and apply mapping loads where you can. Joining tables together will increase the loading time but will have a positive effect on the performance of the document. It's a bit of a balance but I'm sure you'll find something that works for you.

Hopefully others can chip in and give their advice too, I'll probably learn a few things myself!

Regards,

Chris

Miguel_Angel_Baeyens

Hello,

There are some guidelines that can be generally applied, although the best knowledge of the datamodel is obviously yours and sometimes some of those guidelines are not possible to follow.

But in short, a few steps I always take into consideration are:

1.- LOAD always from QVD files, and keep a file to pull from the database, do all needed transformations and store into the QVD will be loaded later. And LOAD only those fields you really need (pretty obvious, but "LOAD *" or "SELECT *" are usual). You don't need to have the Customer Name in both Invoices and Customer table, just linking "Invoices" to "Customers" through "Customer Code" will do.

2.- Use a fact table with all dimensions required trying to follow a star schema, instead of a snowflake schema (this is quite theoretical, but say that in your data souce you have a table with Customers, another one with Group of Customers, another one with Addresses of Customers, well, join them so you have one big "Customers" table with all required data). When concatenating to create this fact table, rename names so the final table has the same name and number of fields.

3.- Avoid IF() conditionals in any object expression, use instead set analysis.

4.- When linking one table to another by means of renaming a field in both alike (we could say these are the key fields), use a numeric value instead of a string (usually the AutoNumber() function will do).

5.- Note that QlikView is case sensitive, so when possible, set all your values to be in the same case.

6.- Use builtin actions instead of macros

7.- Create flag fields for those analysis or filters you want to use in the expressions, the simpler the expression the faster the chart is rendered.

8.- Take to the script all formatting, conversions, mapping... all complexity you can.

9.- Avoid RESIDENT loads. If needed, load twice from a QVD.

Again, most of the above are obvious and very vague, since your application and requirements will make life more complex. These are not MUSTs, rather than STRONGLY RECOMMENDED.

In addition, note that the rendering engine of QlikView uses all your CPU and GPU and memory available in your computer, while the QlikView Server may be a shared computer, using version 9 doesn't use multithreaded extraction, and even so...

NOTE: Quoting QlikTech's Brad Peterman (from the "Large Data Volumes Roundtable" in Qonnections few weeks ago), make sure you do need all those records, because you might not need them depending on the analyses you want to perform, meaning that that your database has 35M records doesn't mean you need to load them all into memory if they are not going to be displayed or used according to the level of detail the customer wants...

Hope that helps.

chris_johnson
Creator III
Creator III

Really good post Miguel. Definately some stuff I'm personally going to pay consideration to, have marked this thread as a favourite for future reference.

Thanks

Chris

IAMDV
Luminary Alumni
Luminary Alumni
Author

Chris - That's true. Some really good points to consider.

Thanks a ton Miguel for taking time to write down everything. This is definetly one of the best posts I have read so far.

It will be brilliant to get expert suggestions from Top users of the forum.

Cheers!

Not applicable

Hi Miguel,

Good Post.

All your points are correct and experienced in my work.

Learned this new point from your post that Resident loads must be avoided. That is new to me.

Thanks

sravan

Not applicable

Hi Deepak,

Sorry I was giving training at a clientsite for the last few days.

At 35 million and 12 columns/fields deep, you are still within what QlikView can handle without going too much into large dataset configuration issues. So don't worry, this is a "larger" dataset that what you might have seen, but not really a large dataset (from colleagues, largest I know of personally is 200 million rows and just shy of 50 columns).

1. Although you are gunning for a star schema of your data model, you most probably are going to end with a snowflake schema. However, I agree with Chris Johnson. I suggest to all trainees in the course to keep the data model to a maximum of 4 hops from one extremity of the data model to the other. Do whatever you have to do to try and limit it to this, especially with large data models. Like Chris says, use applymap to collapse tables with 2 fields only and left join where business logic is not contradicted.

2. (Most important of all suggestions) As Miguel Angel Baeyens says, use a QVD structure and try and get incremental loads for change data capture so that the amount of time for reload is minimised. Don't do stupid joins in point 1 above, just because it makes technical sense. Try and follow some kind of business logic for your QVDs, so that you can re-use them over multiple apps (QVWs). For e.g., don't mix marketing data with HR data, just because for some technical reason it is good. If you then create an HR app and a Marketing app, you will spend more time untangling the data to what is relevant when you load from the QVD. This is not an optimal solution.

3. If needed, iron out synthetic keys (even create your own link tables) if you get them, as this might also minimise reload time. Although, QV can easily deal with synkeys, by taking this process in your hands and incorporating it in the script, you will prevent longer reload times.

4. Point 4 of Miguel seems to be a very useful piece of info, especially on large data models. QV resolves queries across table the fastest, when the value linking the 2 tables is a small numeric value. Hence, you can use autonumber function on all your linking fields. Don't forget, when using the autonumber function to call for series ID as the 2nd parameter, otherwise, problems..., e.g. autonumber(LinkField1, 1) and then autonumber(LinkField2, 2), otherwise autonumber values will overwrite themselves when you use this function in multiple different places.

5. If need to use link field in count computation, then enter a "1 as Linkfieldcounter" in the relevant table. On front-end, go sum(Linkfieldcounter), you will get the same result as count(counterID) and it is less expensive to do a sum over 1 values.

6. With regard to point 9 of Miguel, in v9, I would say yes, avoid resident loads most possible. In v10, resident loads are so fast, it makes no "big" difference.

Will put more info as go along, I have other office issues to deal with at the same time. I hope the confirmation of what others have said in the post gives you a BP path for development.

Have fun,

C ;o)

chris_johnson
Creator III
Creator III

Yes sravan, I heard that point too on a Developer 3 course I sat on. Also seem to remember that straight tables are quicker to calculate than table boxes. To give a table box-like appearance we just put '1' in the expression dialog and hid the column. Of course, with a staright table you lose the horizontal scroll-bar.

Chris

Not applicable

Miguel refers to the v10 multi-threaded reloads of scripts, reason for which I said a resident reload in v10 is more OK than it was in v9 and mich faster.

QV has tried to release the bottleneck where the QV server is concerned, i.e. the multithreaded reload of the script, does not help naturally with network/connection issues or reading from disk.

What is meant by the multi-threaded reload, as opposed to QV reading multiple tabs of script simultaneously, is that in v9, it would first complete extract, prior to doing it simultaneously, even if you had multiple cores on a server. Now, as soon as QV has some data from extract, it passes it over for transformation. It does not need to complete the E, before doing the T of the ETL.

Hope it helps.

vgutkovsky
Master II
Master II

Miguel, great post!