From what I've learnt, the best way to speed up the document itself is to keep to a minimum number of tables. Join tables and apply mapping loads where you can. Joining tables together will increase the loading time but will have a positive effect on the performance of the document. It's a bit of a balance but I'm sure you'll find something that works for you.
Hopefully others can chip in and give their advice too, I'll probably learn a few things myself!
There are some guidelines that can be generally applied, although the best knowledge of the datamodel is obviously yours and sometimes some of those guidelines are not possible to follow.
But in short, a few steps I always take into consideration are:
1.- LOAD always from QVD files, and keep a file to pull from the database, do all needed transformations and store into the QVD will be loaded later. And LOAD only those fields you really need (pretty obvious, but "LOAD *" or "SELECT *" are usual). You don't need to have the Customer Name in both Invoices and Customer table, just linking "Invoices" to "Customers" through "Customer Code" will do.
2.- Use a fact table with all dimensions required trying to follow a star schema, instead of a snowflake schema (this is quite theoretical, but say that in your data souce you have a table with Customers, another one with Group of Customers, another one with Addresses of Customers, well, join them so you have one big "Customers" table with all required data). When concatenating to create this fact table, rename names so the final table has the same name and number of fields.
3.- Avoid IF() conditionals in any object expression, use instead set analysis.
4.- When linking one table to another by means of renaming a field in both alike (we could say these are the key fields), use a numeric value instead of a string (usually the AutoNumber() function will do).
5.- Note that QlikView is case sensitive, so when possible, set all your values to be in the same case.
6.- Use builtin actions instead of macros
7.- Create flag fields for those analysis or filters you want to use in the expressions, the simpler the expression the faster the chart is rendered.
8.- Take to the script all formatting, conversions, mapping... all complexity you can.
9.- Avoid RESIDENT loads. If needed, load twice from a QVD.
Again, most of the above are obvious and very vague, since your application and requirements will make life more complex. These are not MUSTs, rather than STRONGLY RECOMMENDED.
In addition, note that the rendering engine of QlikView uses all your CPU and GPU and memory available in your computer, while the QlikView Server may be a shared computer, using version 9 doesn't use multithreaded extraction, and even so...
NOTE: Quoting QlikTech's Brad Peterman (from the "Large Data Volumes Roundtable" in Qonnections few weeks ago), make sure you do need all those records, because you might not need them depending on the analyses you want to perform, meaning that that your database has 35M records doesn't mean you need to load them all into memory if they are not going to be displayed or used according to the level of detail the customer wants...
Hope that helps.
Sorry I was giving training at a clientsite for the last few days.
At 35 million and 12 columns/fields deep, you are still within what QlikView can handle without going too much into large dataset configuration issues. So don't worry, this is a "larger" dataset that what you might have seen, but not really a large dataset (from colleagues, largest I know of personally is 200 million rows and just shy of 50 columns).
1. Although you are gunning for a star schema of your data model, you most probably are going to end with a snowflake schema. However, I agree with Chris Johnson. I suggest to all trainees in the course to keep the data model to a maximum of 4 hops from one extremity of the data model to the other. Do whatever you have to do to try and limit it to this, especially with large data models. Like Chris says, use applymap to collapse tables with 2 fields only and left join where business logic is not contradicted.
2. (Most important of all suggestions) As Miguel Angel Baeyens says, use a QVD structure and try and get incremental loads for change data capture so that the amount of time for reload is minimised. Don't do stupid joins in point 1 above, just because it makes technical sense. Try and follow some kind of business logic for your QVDs, so that you can re-use them over multiple apps (QVWs). For e.g., don't mix marketing data with HR data, just because for some technical reason it is good. If you then create an HR app and a Marketing app, you will spend more time untangling the data to what is relevant when you load from the QVD. This is not an optimal solution.
3. If needed, iron out synthetic keys (even create your own link tables) if you get them, as this might also minimise reload time. Although, QV can easily deal with synkeys, by taking this process in your hands and incorporating it in the script, you will prevent longer reload times.
4. Point 4 of Miguel seems to be a very useful piece of info, especially on large data models. QV resolves queries across table the fastest, when the value linking the 2 tables is a small numeric value. Hence, you can use autonumber function on all your linking fields. Don't forget, when using the autonumber function to call for series ID as the 2nd parameter, otherwise, problems..., e.g. autonumber(LinkField1, 1) and then autonumber(LinkField2, 2), otherwise autonumber values will overwrite themselves when you use this function in multiple different places.
5. If need to use link field in count computation, then enter a "1 as Linkfieldcounter" in the relevant table. On front-end, go sum(Linkfieldcounter), you will get the same result as count(counterID) and it is less expensive to do a sum over 1 values.
6. With regard to point 9 of Miguel, in v9, I would say yes, avoid resident loads most possible. In v10, resident loads are so fast, it makes no "big" difference.
Will put more info as go along, I have other office issues to deal with at the same time. I hope the confirmation of what others have said in the post gives you a BP path for development.
Miguel refers to the v10 multi-threaded reloads of scripts, reason for which I said a resident reload in v10 is more OK than it was in v9 and mich faster.
QV has tried to release the bottleneck where the QV server is concerned, i.e. the multithreaded reload of the script, does not help naturally with network/connection issues or reading from disk.
What is meant by the multi-threaded reload, as opposed to QV reading multiple tabs of script simultaneously, is that in v9, it would first complete extract, prior to doing it simultaneously, even if you had multiple cores on a server. Now, as soon as QV has some data from extract, it passes it over for transformation. It does not need to complete the E, before doing the T of the ETL.
Hope it helps.
(Edit: Oh, and TEST. I'm putting this at the top because it's very important. What you THINK is fastest may well not be fastest. What we tell you is fastest most of the time may not be fastest in your case. Try different approaches and SEE which one is fastest. I knocked my load time in half on a recent application by just continually playing with different ways of loading the data, even the ways I thought would be slower. Some of them were faster for my specific case, not slower. Everyone's situation is a little bit different.)
When loading from your QVD, make sure you're getting an optimized load. Doing so often requires some strange gyrations, because the only condition you can put on an optimized load is a single where exists(). So if you want something like this:
LOAD a bunch of fields
FROM MyData.qvd (QVD)
WHERE "Status" = 'Active'
AND "Date" >= yearstart(today(),-1)
It will often be faster to break this up to get an optimized load. First, what's your most restrictive condition? That's the one we want to focus on first. Let's say it's the date condition. We can generate all the dates that we care about, and do an exists() on that. The other conditions can be added as an inner join after the fact. So something like this:
LOAD date(today()+1-recno()) as "Date"
AUTOGENERATE today() - yearstart(today(),-1) + 1
INNER JOIN ([My Table])
LOAD a bunch of fields
FROM MyData.qvd (QVD)
INNER JOIN ([MyTable])
LOAD * INLINE [
In my experience, the second version, while significantly more complicated, is likely to execute faster. On the other hand, it may also use more memory because it is temporarily loading records that you will later throw out. So which version you use may depend on whether your CPU time or RAM is more plentiful. Experiment to see what works best for you.
Miguel Angel Baeyens wrote:3.- Avoid IF() conditionals in any object expression, use instead set analysis.
And where reasonable, consider object model changes instead of even set analysis. In my experience, object model solutions are fastest of all, then set analysis solutions, and then everything else. And from a maintenance standpoint, I prefer seeing the complexity in my load script instead of in my tables. I'm not sure why it matters to me, since in our shop the same people are responsible for both script and charts. But in some shops, this might actually be important.
Miguel Angel Baeyens wrote:9.- Avoid RESIDENT loads. If needed, load twice from a QVD.
This may seem counterintuitive, but yes, in many or maybe most case, it seems that loading again from the QVD is faster than loading from a resident table. I've seen exceptions, but those are the exceptions, not the rule. It's a strange rule, but something to keep in mind for certain.
That's not to say "avoid resident loads" - I use them all the time. More like "if you have a choice between a resident load and an optimized QVD load, in my experience the optimized QVD load is usually faster". The exceptions tend to be when the resident table has already been significantly reduced in size and number of columns, I believe.
Ah, didn't realize this had been improved in version 10. We haven't upgraded yet. I'll have to keep an eye out for that.
CheenuJanakiram wrote:3. If needed, iron out synthetic keys (even create your own link tables) if you get them, as this might also minimise reload time. Although, QV can easily deal with synkeys, by taking this process in your hands and incorporating it in the script, you will prevent longer reload times.
I'll strongly emphasize the "if needed" part of this. IF it is appropriate for your databases to be linked by two fields, it is appropriate for that link to be modeled using a synthetic key. In my experience, as long as your data model is correct, converting synthetic keys to their link table equivalents does nothing but waste script time and use a TINY (basically negligible, but I thought I'd mention it) bit more memory and CPU time in the application itself.
To me, the main reason to remove synthetic keys in a good data model is simply practice. You should understand what the synthetic key IS, and a good way to get that understanding is to model the connection yourself rather than relying on QlikView to do it for you automatically. Also, people less experienced with data modeling and/or with QlikView may often find that synthetic keys are a symptom of an underlying data model problem that SHOULD be fixed. But in a case like that, merely replacing the syntheic key with a link table equivalent is not going to fix the underlying data problem. You have to think about it a bit more deeply than that.
More information on synthetic keys here:
CheenuJanakiram wrote:5. If need to use link field in count computation, then enter a "1 as Linkfieldcounter" in the relevant table. On front-end, go sum(Linkfieldcounter), you will get the same result as count(counterID) and it is less expensive to do a sum over 1 values.
I've seen tests go both ways in this regard. It seems to depend on the QlikView version. I haven't tested version 10, and don't remember which is faster in version 9. I typically do a count(distinct LinkField), but strongly suspect that's the slowest option of all. Basically, if you need a count of the keys, and the expression is slow, try both a sum() and a count(), and use the faster one for your version.
John Witherspoon wrote:Ah, didn't realize this had been improved in version 10. We haven't upgraded yet. I'll have to keep an eye out for that.
I may have got that wrong (I'm not English native), but in one of the sessions of last Qonnections I took note that Brad Peterman from QlikTech agreed to one of the attendees when he said that RESIDENT loads in QlikView 10 perform as fast as (or equivalent to) the graphic engine rendering a chart. It's quite a lot.
But yes, basically, I mentioned the avoidance of RESIDENT loads when they have a considerable amount of records. Of course, what is a considerable amount will depend on the hardware, the analyses that will be done based on that data, the charts to be rendered...
John Witherspoon wrote:When loading from your QVD, make sure you're getting an optimized load.
The optimized LOAD is exactly where the power of loading from QVD lies. That's exactly what I meant and I forgot to mention that, and that's a very good point and the most important, since loading from a QVD file with filters, concatenations, formats... may be as slow as any other source, thus losing the advantage of the QVD idea itself.
Thanks John for your expert contribution, as usual!
Miguel Angel Baeyens wrote:The optimized LOAD is exactly where the power of loading from QVD lies. That's exactly what I meant and I forgot to mention that, and that's a very good point and the most important, since loading from a QVD file with filters, concatenations, formats... may be as slow as any other source, thus losing the advantage of the QVD idea itself.
Filters I don't see as a huge problem as I've had good success handling them as per the example above. But data transformation during the load would be a problem, and of course we often find ourselves needing to do that.
What try to do instead is put as much of that as practical into the QVD itself. Let's say I'm storing a QVD of employee information. Different people may want to see employee names in different ways in different applications. While I think it would be best to standardize, that isn't always practical. So as required by user applications, I might store ALL of the following values as separate fields in the QVD:
John E Witherspoon
John Witherspoon (12345)
Witherspoon, John E
Now, with any luck, people can load the employees in whatever way is appropriate for their application, and avoid deoptimizing the QVD load or needing to do some after-the-load adjustments. Naming all those similar fields in an understandable way, of course, can be a bit of a chore. "Employee Name Last First With Initial", for instance. Fortunately, renaming fields during a QVD load doesn't cause problems.
Conveniently, for us the same people build the user applications and the QVDs, so there isn't really any guesswork going on by the QVD developers. When I find myself transforming QVD fields during the load in a user application, I can just decide on the fly whether to leave it there or push it back into the QVD.
So I think with a little work, you should be able to get an optimized load a good portion of the time.
Yes sravan, I heard that point too on a Developer 3 course I sat on. Also seem to remember that straight tables are quicker to calculate than table boxes. To give a table box-like appearance we just put '1' in the expression dialog and hid the column. Of course, with a staright table you lose the horizontal scroll-bar.
Great tips in this thread!
Another thing you may consider:
- don't include unnecessary details. A typical example is to read a date column which includes timestamp information from a sql database as it is, when you really are only interested in the date part of the column. Make sure to get rid of the time-part by using floor() or similar functionality before storing it in a qvd (if you are not interested in the time of course...)
This is on the same topic so I'll go ahead and ask it in here rather than making a new thread. I've got a 2.35 gb .csv with ~10 columns and something like 22 million rows that I load this from every day:
//already tried removing amount_due but it still ran out of memory without even trying
@2 as 'date',
sum(@3) as 'amount_billed',
sum(@4) as 'amount_due'
(txt, codepage is 1252, no labels, delimiter is ',', msq)
GROUP BY @2;
My laptop isn't very powerful (like 3 gb of RAM and 2ghz duo) so this usually takes about 15 minutes (the result only loads about... 1000 rows. If I tried to load all 22 million rows it would run out of memory). I figured my laptop could barely handle this and one day that file would get too big and I think that day has finally come.
Now when I try to reload (the EXACT same document I've been reloading for 6 months) it thinks for like 10 seconds and I get the out of memory error (for some reason it says allocating 512 mb when it usually says something like 2 or 4mb when it ran out of memory after working for a long time).
Does anyone have any tips at all on how to get this to reload (without getting a new computer)? If I created a .qvd on my home computer, I'm pretty sure my laptop could handle this easily but as is all I have is a .csv. This makes me think there could possible be some solution but I'm just not sure what that is (and I've thought about this for quite awhile...).
Any help would be beyond awesome.
Edit: I tried a FIRST 5 and a FIRST 50 LOAD with all 10 columns and they both worked fine. Then I tried a FIRST 500 LOAD and it instantly gave me the out of memory error (but allocating 256mb this time). I guarantee I could load a .csv with 5 million rows just fine right now, but for some reason since this one has 22 million it won't even try to load the first 500? I don't get it.
So, almost a year after this post was written and I'm adding another comment, only because I'm now dealing with what I consider to be high volumes of data.
I have a raw trade fact table of 250 million rows of data and the fact table is 210 columns wide and Yes, we've been through all of the "is this necessary" questions on the 210 columns and not a single one can be removed.
The reason for adding the comment is that I didn't see Aggregation in the above threads as a constructive method to improve performance, this can obviously only be applied in the right environment but in our case we are not interested in our management dashboard looking at trade level data so our aggregation in the ETL stage reduces those 250 million rows of data to approx 60m.
This was our golden step which produced a qvw document that could actually be used.
A number of other things have also been done, and bear in mind that I have a rather complex document of some 30+ sheets with 2500+ objects.
1. I didn't know this, but a variable that is defined with an = at the start of the definition is re-evaluated every time anything in the document changes, use a dollar expansion rather than an =.
2. I also didn't know that if you put a field event trigger on an "On Change" event, this is also re-evaluated every time anything in the document changes, so avoid this.
3. In this size of document, moving away from a star schema and using applymap functions to produce (as far as possible) a single very large fact table brought the document down to such a poor performance that it was unworkable, we reverted to a star schema but removed the snowflaking.
4. In this size of document, avoid any use of set analysis at all, and any use of IF(... in expressions, part of the reason for such a wide fact table is that we have approx 80 1/0 flags held in the fact to speed up expressions.
Hope this proves of some use to somebody...
Thank you very much for sharing. Very useful and it doesn't matter how long ago was this posted, because QlikView is going enterprise more and more every day. Actually, some of the steps in the product features, such as the Direct Discovery, are going this way. I'm sure yours will help a lot of people as well.
Just one question about your scenario: what client are your deploying? Are users going to use IE Plugin, Ajax, and if Ajax, mobile, desktop...? Take into account that event triggers do not work (as designed) in an Ajax environment. Actions will trigger (i. e.: when you click on a button), but not events.
Great suggestions Nigel.
But sometimes you need to use the "=" in variables to evaluate a conditions at the document level. I'm curious to know why this would have performance impact? Because QlikView takes in to account - cost of calculation while caching the data. And if this is expression has lot of overhead on the processor then it will be cached according to their proprietary caching algorthim. If the expression is simple then it will be calculated, and QlikView uses multi threading for processing the calcuations. But I still buy your points, as thumb rule we should use only when it's really required.
I'm really not sure about the issues relating to caching, but none of
the variable expressions I had with an equal sign at the start was in
any way complex, they were all very simple.
In the document size I'm dealing with, the impact of removing and
replacing with dollar expansion cannot be over-exaggerated, it was
massive, we had delays of many seconds just clicking on a button that
changed a variable and then produced a list box, the variable in
question was not one with an equal sign at the start but this is the
point. Every single click we made in the document went and re-evaluated
every variable with an equal sign at the start.
In lower size documents with much lower volumes of data I seriously
doubt that anybody would be able to tell that this was going on but when
you get to larger volumes you're forced to look into the smallest issues
to improve performance.
There are also a couple of points on this thread that I haven't yet
looked at, so I might be spending some time looking at those soon.......
disclaim.txt 2.1 K
Deepak, I work with large datasets all the time (100m upto 1b rows with 60/70 columns) and generally have no issue, understanding the end goal is the key you should easily be able to aggregate 35m rows down to achieve summary results.
Normal response time for queries on a 100m row document should be ~1 second, ensure your data is aggregated and standardised using set analysis where possible.
The key as Nigel said in part is finding what servers give you the optimal performance: -
Order By - Let the SQL server do it, this is extremely bad for QV on large datasets from my experience.
Aggregation and Group By see which performs better, generally I use QV because aggregation happens in a 2 tier extract.
Thanks Felim. My original question was posted 1.5 years back and I had not aggregated the data in that App. It works fine with 35m and no aggregation at all.
Thanks for your inputs.