Solved: Big data load from a relational database - Qlik Community

Report Inappropriate Content · ‎2010-03-29

Hi,

I've a data model to load :

1 central big table T1 (30 millions lines),

around this big table I've several medium table T2, T3, ...T10 (1 millions lines) which are all linked to this central table by
two keyfields (let's call it Key1 and Key2).

I load all theses tables in QlikView, no problem (except the QlikView file is huge 🙂

Consider now there is a filter in the main bigger table T1 (I want to load only T1 lines where field F1=2010 (that's an simple where Clause)).
I'd like to load data from T2,T3... T10 tables where the main key (Key1,Key2) exists in the main big table T1.
What is the faster way (in terms of reload time) ?

1) Load the big table with the Where clause and then load other tables with left keep (T1) instruction ?
(but isn'it meaning that I will load all too much data* from the relational database* and then only make the select in QlikView ?)

2) Make the filter in the SELECT part, that is to say
- Load the big table T1 : SQL SELECT * FROM T1 WHERE F1=2010;
- then load all other tables (T2,T3...T10) by making, for each table, in the SQL Select part, a JOIN between tables (T2,T3 ...T10) and T1 (I fear it will takes a lot of times in the relational database and it's not easy to maintain) ;

3) Other system using QlikView possibilities (exists function?) ?

Any helps is welcome.

*(I recall that if you use a LOAD; SQL SELECT; syntax you can make actions on data in the relational database (SQL Select) or in QlikView (LOAD)

* I'm working with SQL Server 2005 and QlikView 8.5 SR3.

johnw · ‎2010-03-29

For highest performance, make sure everything is indexed properly and join everything in your SQL. Do nothing in QlikView except for rename fields if you want new names. I think like this:

LOAD fields as the names you want
;
SQL
SELECT only the fields you want
FROM T1 T1
LEFT JOIN T2 T2 ON T2.KEY1 = T1.KEY1 AND T2.KEY2 = T1.KEY2
LEFT JOIN T3 T3 ON T3.KEY1 = T1.KEY1 AND T3.KEY2 = T1.KEY2
LEFT JOIN T4 T4 ON T4.KEY1 = T1.KEY1 AND T4.KEY2 = T1.KEY2
...
LEFT JOIN T10 T10 ON T10.KEY1 = T1.KEY1 AND T10.KEY2 = T1.KEY2
WHERE F1=2010
;

QlikView's compression should take care of any inefficiencies in regards to repeating the data from T2-T10 across all the rows with the same keys in T1. It may use a lot of memory in your DBMS, however, unless your DBMS realizes that it can return your data one row at a time instead of needing to build a huge temporary table of its own. Mind you, if a large portion of the data is for 2010, it may well be more efficient for the DBMS to build a huge temporary table using table scans rather than indexed reads, but that's probably up to your DBMS to decide.

Note: If you're likely to be reusing the T1-T10 data across multiple applications, you may be better off just dumping each table in turn to a QVD (while performing any basic data cleanup functions), then doing optimized loads from the QVDs into your user applications. Straight scans of tables should be pretty efficient for your DBMS. Not likely as efficient as the above approach for a single application loading only a small portion of the overall data, but more efficient overall if you have multiple applications using the same data.

View solution in original post

johnw · ‎2010-03-29

For highest performance, make sure everything is indexed properly and join everything in your SQL. Do nothing in QlikView except for rename fields if you want new names. I think like this:

LOAD fields as the names you want
;
SQL
SELECT only the fields you want
FROM T1 T1
LEFT JOIN T2 T2 ON T2.KEY1 = T1.KEY1 AND T2.KEY2 = T1.KEY2
LEFT JOIN T3 T3 ON T3.KEY1 = T1.KEY1 AND T3.KEY2 = T1.KEY2
LEFT JOIN T4 T4 ON T4.KEY1 = T1.KEY1 AND T4.KEY2 = T1.KEY2
...
LEFT JOIN T10 T10 ON T10.KEY1 = T1.KEY1 AND T10.KEY2 = T1.KEY2
WHERE F1=2010
;

QlikView's compression should take care of any inefficiencies in regards to repeating the data from T2-T10 across all the rows with the same keys in T1. It may use a lot of memory in your DBMS, however, unless your DBMS realizes that it can return your data one row at a time instead of needing to build a huge temporary table of its own. Mind you, if a large portion of the data is for 2010, it may well be more efficient for the DBMS to build a huge temporary table using table scans rather than indexed reads, but that's probably up to your DBMS to decide.

Note: If you're likely to be reusing the T1-T10 data across multiple applications, you may be better off just dumping each table in turn to a QVD (while performing any basic data cleanup functions), then doing optimized loads from the QVDs into your user applications. Straight scans of tables should be pretty efficient for your DBMS. Not likely as efficient as the above approach for a single application loading only a small portion of the overall data, but more efficient overall if you have multiple applications using the same data.

Report Inappropriate Content · ‎2010-03-29

Hello Bertrand, the most efficient way I see is loading T1 first with the SQL where clause and make and new column with qlikview load syntax, that probably be something like "hash128(Key1, Key2) as Filter". Then you can proceed to load the other tables with an statement similar to these Load A, B, C, hash128(Key1, Key2) as Filter from T2...Tn where exists (Filter,Filter);

Best Regards, hope this helps.

Report Inappropriate Content · ‎2010-04-07

Hi,

thanks for your suggestions,

I've questions to John :

are you suggesting to load every data (from T1, T2, .... T10) into only one big QlikView table ?
I'm just surprised because it could suggest that the best way to load data in QlikView is to put all in one table ?
Are there some conditions to it (for example this works only if there is a 0..1 relation between the big table and others and there are no 1..n tables) ?
Or maybe I've misunderstood because it's possible to load several QlikView tables from an only SQL SELECT Statement ?

Best regards

johnw · ‎2010-04-07

Yes, I'm suggesting that you load it all into one big table in QlikView, and use SQL to do all the hard work for the highest performance.

But I'm NOT suggesting that the best way to load data into QlikView IN GENERAL is to put everything into one table. Generally speaking, assuming your helper tables (T2-T10) have more than a few fields, I would load everything in with a star schema approach, keeping the helper tables separate, linked only by their keys. But I'm not seeing a way to do that in your case without compromising reload time, and I'm confident in QlikView's ability to compress your data down to avoid the memory problem normally associated with denormalized data.

I DO often combine tables together when my helper tables are small, such as only having a key, a description, and a couple of additional fields. In that case, I'll usually just left join the helper tables onto the big table.

More specifically, if I have something like this in my source system:

Orders: Order ID, Customer ID, Product ID
Customers: Customer ID, Customer Name
Products: Product ID, Product Description, Packaging ID
Packaging: Packaging ID, Packaging Description

I will often join everything into one big table in QlikView, typically dropping the linking IDs since they often won't mean anything to a user, or will at least mean less than the descriptions:

Orders: Order ID, Customer Name, Product Description, Packaging Description

Report Inappropriate Content · ‎2010-04-07

Ok,

it's more clear now,

For your information I've tried the four methods on a 10 Millions lines table (T1) just to have an idea :

1) Left Keep
reload time : 1 min 12 s qvw size : 24 Mo

2) One SQL Select instruction by QlikView table
reload time : 48 s qvw size : 24 Mo

3) Filter with exists function (Ivan suggestion)
reload time : 1 min 26 s qvw size : 24 Mo

4) Big merge (John sugestion)
reload time : 27s qvw size : 13 Mo

So :

The big merge is faster (in that case) or to be precise : my complete model is more complex, but I've used this solution on a sub-part where cardinalites between helper tables (T2 ... T10) and T1 are 0..1 (i.e one line in T1 can, at max, be related to only one line of each helper tables) and it perfectly works. I suppose (If I understand John' posts correctly) that the reload time is better because I made only one SQL request instead of 10 and I think the qvw is smaller because there are less QlikView tables, no more link tables, no need to keep IDs between tables, and at last this is due to QlikView Data compression.

Using One SQL Select instruction by QlikView table may be usefull in other case (when Star schema is better). It's slower here, I think this is because the DBMS have to re-load the big table T1 when I load all helper tables (to execute the WHERE clause).

The left keep end the exists filter solution are the slowest solutions. I suppose this is because the data selections is made in QlikView side, so it means that too much data is first loaded from the DBMS to QlikView and then only the data selection is made.

remarks : my conclusions are guesses and suppositions, I'm not a QlikTech Engine R&D Expert 🙂 . Values (Time to reload and .qvw size) are results on my particular cases, results can be different in more general cases. Values are just a few tests, not sufficient big statistical tests 🙂

Thanks for your help

Best regards

johnw · ‎2010-04-07

I would not have expected my approach to yield a smaller size. It's presumably the same amount of information, so a perfect compression algorithm should squeeze the information to exactly the same size. No compression algorithm is perfect, but I would expect QlikView's to be good enough that we wouldn't see one approach taking only half the memory of another approach. This worries me, as it may be that my approach is somehow dropping data that you should have, or the other approaches are somehow reading data that you shouldn't have. But maybe everything's fine. I've seen this sort of difference before without it indicating any problems.

As far as why the big merge takes less time, it isn't really so much that it's one SQL request instead of 10. It's how that one SQL request is structured to read the minimum number of rows from your source tables. Let's say that your filter corresponds to 20% of the rows from T1, and the resulting keys correspond to 20% of the rows for T2-T10.

With the big merge, the DBMS only has to return 20% of all of the rows, so if T1 has 10 million, and T2-T10 have 1 million each, it's returning 20%*(10m + 9*1m) = 3.8 million rows.

With the left keep and the filter, the DBMS only has to return 20% of T1, but ALL of T2-T10, and then QlikView discards the 80% of rows from them that it doesn't need. So the DBMS has to retrieve 20%*10m + 9*1m = 11 million rows.

If I understand what you did in the "One SQL Select instruction by QlikView table", you did one SQL statement to load T1, then each SQL for T2-T10 was a merge of T1 with the indicated helper table to make sure you had only the right rows? In one sense, that's the worst approach, because you retrieve 20%*10m + 9*(20%*10m + 20%*1m) = 21.8 million rows. But that's a vast oversimplification of what's actually happening in the DBMS. Your DBMS is almost certainly caching the T1 rows, and then pulling directly from this cache when it recognizes that you've issued the exact same "where" conditions in the subsequent select. So it might be 21.8m rows, but 9*(20%*10m) = 18m of those rows are being pulled from a cache, and only the remaining 3.8 million rows are being read from the table. So you would get performance much closer to the big merge, and the remaining 11 seconds are probably mostly the time to pull 18m rows from the cache.

That's my understanding of it, anyway.