Solved: Delete Duplicates with distinct IDs - Qlik Community

JuMacedo

Hi!

Thought about asking if anyone can think of a solution for my problem here, since I'm new to Qlik I'm not sure if there is a function that can do that job.

I have a dataset with many duplicated records, but every record has a different Id. I'd like to delete the duplicates and maintain just the first of the duplicated records.

In this example, rows 2-6 are duplicated.

Row_number	Id	Start Station	Departure Time	Final Station	Arrival Time	DepartureTime_Retur	ArrivalTime_Retur	Count Changes	score
1	bb945103	City1	07:09:00	City2	12:38:00	16:07:00	21:46:00	1	-16860
2	3d63b7b4	City1	07:09:00	City2	12:38:00	21:07:00	06:45:00	2	-48240
3	658a5c90	City1	07:09:00	City2	12:38:00	21:07:00	06:45:00	2	-48240
4	e12f4755	City1	07:09:00	City2	12:38:00	21:07:00	06:45:00	2	-48240
5	f0770ed2	City1	07:09:00	City2	12:38:00	21:07:00	06:45:00	2	-48240
6	a107b4e0	City1	07:09:00	City2	12:38:00	20:07:00	06:45:00	2	-48240
7	fc6fea72	City1	07:09:00	City2	12:38:00	21:07:00	05:42:00	2	-44460

What I'd like to see instead is:

Row_number	Id	Start Station	Departure Time	Final Station	Arrival Time	DepartureTime_Retur	ArrivalTime_Retur	Count Changes	score
1	bb945103	City1	07:09:00	City2	12:38:00	16:07:00	21:46:00	1	-16860
2	3d63b7b4	City1	07:09:00	City2	12:38:00	21:07:00	06:45:00	2	-48240
7	fc6fea72	City1	07:09:00	City2	12:38:00	21:07:00	05:42:00	2	-44460

Is there a way to achieve this? I'm attaching a small sample of data as well if there is someone who woul'd like to try a solution here.

I tried Load Distinct, but it doesn't work as I have distinct ID for every row.

stefanmereuta

This can be done in one load (adapting from @igoralcantara's answer):

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

View solution in original post

igoralcantara

Hello and welcome to Qlik. This is how I would do:

// Exclude the non distinct values
Tablename:
Load Distinct
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE];

// Add the min row_number
Inner Join(Tablename)
Load Distinct
Min(Row_number) as Row_number,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

// Add what else is missing
Inner Join(Tablename)
Load
Row_number,
id
From [YOUR SOURCE HERE];

----
datavoyagers.net

rubenmarin

Hi, after loading the first table you can do an inner join to only keep the lower Row_number of each combination, like:

Inner Join ([DataTableName])
LOAD Min(Row_number) as Row_number
Resident [DataTableName]
Group By [All fields except Id];

Clement15

Hello, could this work?

Load
max(Row_number),
max(Id),
Start Station,
Departure Time,
Final Station,
Arrival Time,
DepartureTime_Retur,
ArrivalTime_Retur,
Count,
Changes score
From ...
group by Start Station,Departure Time,Final Station,Arrival Time,DepartureTime_Retur,ArrivalTime_Retur,Count,Changes score

igoralcantara

This would not work because the Max Row_Number does not always links to the Max Id. Also, Id is not a number, Max does not work. It would be MaxString instead.

----
datavoyagers.net

marcus_sommer

For the usual view-requirements you won't need the Row_number and Id and therefore the simplest approach would be just to remove them and loading the data distinct.

Another approach might be to combine the relevant fields within a first load, like:

F1 & '|' & F2 & '|' & F3 ... as Key

and within a second one you applies something like:

exists(Key) as Flag

and a third load may filter on the flag with: where Flag = 0;

stefanmereuta

This can be done in one load (adapting from @igoralcantara's answer):

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

JuMacedo

Hi Marcus, thanks for the answer. I'll need the Id as it will be used to filter a table that comes next. The Row_number though won't be needed. So I guess this solution might not work.

marcus_sommer

You may consider to change the order of script-statements within your ETL chain and/or to extract the id within a separate table to apply them in any mapping/exists-approaches.

Beside this the id's are all different - how will you ensure to remove/remain the needed ones? Or all id's exists within the other source, too - then which taken from there?

Therefore I suggest to review the sources and the data-quality at first before looking for technically solutions.

JuMacedo

Thank you for the contributions both @igoralcantara and you. I decided to go for this solutions as it seemed to be the simplest, with just one load. But I guess what @igoralcantara would have worked as well since you adapted the script from his idea.
This worked, so thanks!

Delete Duplicates with distinct IDs

Qlik Sense

Script