Measure Percentage of Duplicated Fields

DrB1 · ‎2024-03-20

I have these 10 data fields and would like to create a formula to identify potential duplicates with an index score of how strong the match is between any 2 records. For example if 10 of 10 match the two records would be flagged as 100% match. Here are the fields.

IndividualId (UniqueID),
FirstName,
MiddleName,
LastName,
PreferredName,
BirthDate,
PrimaryEmail,
Home_PhoneNumber,
Mobile_PhoneNumber,
Address1_PrimaryAddress

I would like to do this in a Qlik table during the load script if possible. I've tried, and tried, and tried, this concept of creating the Individuals_Data table, a second table with same fields as in IndividualIdDup, FirstNameDup, ...etc. and then trying to join the tables with a where clause (where IndividualId<>IndividualIdDup). I've tried that in several different ways, but the where clause always fails - unable to find IndividualId or whichever field used to keep it from potential duplicate matching of a record to itself.

Tried it with record numbers or row numbers using similar where clause, but it fails. If anyone has an idea, I would love it!

// Load your data
Individuals_Data:
LOAD
RowNo() as RecNo,
IndividualId,
FirstName,
MiddleName,
LastName,
PreferredName,
BirthDate,
PrimaryEmail,
Home_PhoneNumber,
Mobile_PhoneNumber,
Address1_PrimaryAddress
FROM [myfile]
(txt, utf8, embedded labels, delimiter is ',', msq);

// Use Resident load to compare each record with all other records
DuplicateCheck:
LOAD
RowNo() as RecNo_Dup,
IndividualId as IndividualIdDup,
FirstName as FirstNameDup,
MiddleName as MiddleNameDup,
LastName as LastNameDup,
PreferredName as PreferredNameDup,
BirthDate as BirthDateDup,
PrimaryEmail as PrimaryEmailDup,
Home_PhoneNumber as Home_PhoneNumberDup,
Mobile_PhoneNumber as Mobile_PhoneNumberDup,
Address1_PrimaryAddress as Address1_PrimaryAddressDup
Resident Individuals_Data;

// Join the original data with the DuplicateCheck table to compare each record with its duplicates
Join (Individuals_Data)
LOAD
RecNo,
IndividualIdDup,
FirstNameDup,
MiddleNameDup,
LastNameDup,
PreferredNameDup,
BirthDateDup,
PrimaryEmailDup,
Home_PhoneNumberDup,
Mobile_PhoneNumberDup,
Address1_PrimaryAddressDup,
RecNo_Dup
Resident DuplicateCheck
Where RecNo <> RecNo_Dup and (
IndividualId <> IndividualIdDup or
FirstName <> FirstNameDup or
MiddleName <> MiddleNameDup or
LastName <> LastNameDup or
PreferredName <> PreferredNameDup or
BirthDate <> BirthDateDup or
PrimaryEmail <> PrimaryEmailDup or
Home_PhoneNumber <> Home_PhoneNumberDup or
Mobile_PhoneNumber <> Mobile_PhoneNumberDup or
Address1_PrimaryAddress <> Address1_PrimaryAddressDup
);

// Calculate the percentage of matching fields
Individuals_Data:
LOAD
*,
(
(If(IndividualId = IndividualIdDup, 1, 0) +
If(FirstName = FirstNameDup, 1, 0) +
If(MiddleName = MiddleNameDup, 1, 0) +
If(LastName = LastNameDup, 1, 0) +
If(PreferredName = PreferredNameDup, 1, 0) +
If(BirthDate = BirthDateDup, 1, 0) +
If(PrimaryEmail = PrimaryEmailDup, 1, 0) +
If(Home_PhoneNumber = Home_PhoneNumberDup, 1, 0) +
If(Mobile_PhoneNumber = Mobile_PhoneNumberDup, 1, 0) +
If(Address1_PrimaryAddress = Address1_PrimaryAddressDup, 1, 0)
) / 10
) * 100 as MatchPercentage
Resident Individuals_Data;

// Drop intermediate tables
Drop Table DuplicateCheck;

// Output the final data
Data_Final:
LOAD *
Resident Individuals_Data;

// Drop the original Data table
Drop Table Individuals_Data;

marcus_sommer · ‎2024-03-21

I think I would go another way with checking the field-values with interrecord-functions like peek() or previous() within a sorted resident load, like:

load *,
rangesum(-(F1=previous(F1)), -(F2=previous(F2)), -(F3=previous(F3)), ...) / 10 as Flag
resident X order by F1, F2, F3, ...;

It needs a pre-load within Qlik to be able to define an order by statement which needs to be suitable.

Simpler than this may be just to compare the PrimaryMail because mail-addresses are unique unless you have really a duplicated record.

DrB1 · ‎2024-03-26

These records contain families so mail or email or phone is not unique. I have not worked with interrecord-functions before, so I will need to look into that. What happens here is that persons can register for various events as a "guest" in our system, which immediately assigns a new IndividualId to someone who may already be in the system records. I'm looking for a way to prioritize the review based on percent of matching fields.

Data Load Editor

Qlik Sense

Script