Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I'm looking for some advice on a project I'm working on. Essentially I have multiple fact tables, and multiple dimensional/reference tables. Here's the problem: The reference tables are sometimes incomplete. They need information from the fact tables appended to them to have a complete list of the items.
Good example: I have two sets of sales data that have the Seller ID and Seller Name, and Customer ID and Customer Name. Because people are always coming and going, one sales data set has some seller IDs or Customer IDs, and the other set may have the other. The reference/dimensional table that is supposed to have a list of all seller IDs and Names may be missing the name or ID from one or both fact sets.
I can't build a reference/dimensional table from just the facts sets because the fact sets may be missing some of the information (IE may have a customer ID without a name). Any thoughts would be appreciated.
Hi Jordan,
If you attach sample data then it would be easier to understand and provide the solution.
Regards,
Jagan.
Hi,
I would suggest that instead of having facts, have dimensions. This way, the dimensions will have data associated to that dataset. EG: a person dimension will have; name, customerID, etc. The fact than will just have references to the dimensions as well as the measures that are associated. Basically, create a star-schema model.
Here is a design pattern for loading a dimension into a separate dimension table, including possible multiple sources, and handling non referenced id values. This option is good if the dimension contains many attributes:
// Load Fact table
Fact:
LOAD ...,
CustomerID,
...;
// Load first dimension source
// Not exists ensures that only one instance of each ID is loaded
DimCustomer:
LOAD CustomerID,
CustomerID As _CID,
CustomerName,
...
From Source1
WHERE Not Exists(_CID);
// If you have multiple sources for the dimension, load them like this
// Again, the not exists ensures that you load each ID only once, and
// if it was loaded for the frst source, it is not loaded again.
LOAD CustomerID,
CustomerID As _CID,
CustomerName,
...
FROM Source2
WHERE Not Exists(_CID);
// Now load the unknowns. These IDs were not found in the reference data
LOAD Distinct CustomerID,
'Unknown' As CustomerName
Resident Fact
Where Not Exists(_CID, CustomerID);
DROP Field _CID; // not needed any more
If you want to integrate the CustomerName (in the previous example) into the act table, then use this pattern instead. This option is good if you only loading one or two fields in the dimension.
// Load first dimension source
// Not exists ensures that only one instance of each ID is loaded
DimCustomer:
LOAD CustomerID,
CustomerID As _CID,
CustomerName
From Source1
WHERE Not Exists(_CID);
// If you have multiple sources for the dimension, load them like this
// Again, the not exists ensures that you load each ID only once, and
// if it was loaded for the frst source, it is not loaded again.
LOAD CustomerID,
CustomerID As _CID,
CustomerName
FROM Source2
WHERE Not Exists(_CID);
MapCustomers:
Mapping LOAD * Resident DimCustomer;
// Load Fact table
Fact:
LOAD ...,
CustomerID,
ApplyMap('MapCustomers', CustomerID, 'Unknown') As CustomerName,
...;
DROP Table DimCustomer;
This actually seems like exactly what I'm looking for. I will definitely try this out today and see if it works.