Skip to main content
Announcements
Qlik Connect 2024! Seize endless possibilities! LEARN MORE
cancel
Showing results for 
Search instead for 
Did you mean: 
JuMacedo
Partner - Contributor III
Partner - Contributor III

Delete Duplicates with distinct IDs

Hi!

Thought about asking if anyone can think of a solution for my problem here, since I'm new to Qlik I'm not sure if there is a function that can do that job.

I have a dataset with many duplicated records, but every record has a different Id. I'd like to delete the duplicates and maintain just the first of the duplicated records. 

In this example, rows 2-6 are duplicated.

Row_number Id Start Station Departure Time Final Station Arrival Time DepartureTime_Retur ArrivalTime_Retur Count Changes score
1 bb945103 City1 07:09:00 City2 12:38:00 16:07:00 21:46:00 1 -16860
2 3d63b7b4 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
3 658a5c90 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
4 e12f4755 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
5 f0770ed2 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
6 a107b4e0 City1 07:09:00 City2 12:38:00 20:07:00 06:45:00 2 -48240
7 fc6fea72 City1 07:09:00 City2 12:38:00 21:07:00 05:42:00 2 -44460

 

 What I'd like to see instead is:

Row_number Id Start Station Departure Time Final Station Arrival Time DepartureTime_Retur ArrivalTime_Retur Count Changes score
1 bb945103 City1 07:09:00 City2 12:38:00 16:07:00 21:46:00 1 -16860
2 3d63b7b4 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
7 fc6fea72 City1 07:09:00 City2 12:38:00 21:07:00 05:42:00 2 -44460

 

Is there a way to achieve this? I'm attaching a small sample of data as well if there is someone who woul'd like to try a solution here.

I tried Load Distinct, but it doesn't work as I have distinct ID for every row.

 

 

Labels (2)
1 Solution

Accepted Solutions
stefanmereuta
Partner - Contributor
Partner - Contributor

This can be done in one load (adapting from @igoralcantara's answer):

 

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

 

 

View solution in original post

12 Replies
igoralcantara
Partner - Creator II
Partner - Creator II

Hello and welcome to Qlik. This is how I would do:

// Exclude the non distinct values
Tablename:
Load Distinct
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE];


// Add the min row_number
Inner Join(Tablename)
Load Distinct
Min(Row_number) as Row_number,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

// Add what else is missing
Inner Join(Tablename)
Load
Row_number,
id
From [YOUR SOURCE HERE];

----
datavoyagers.net
rubenmarin

Hi, after loading the first table you can do an inner join to only keep the lower Row_number of each combination, like:

Inner Join ([DataTableName])
LOAD Min(Row_number) as Row_number
Resident [DataTableName]
Group By [All fields except Id];

 

Clement15
Creator
Creator

Hello, could this work?

 

Load 
max(Row_number),
max(Id),
Start Station,
Departure Time,
Final Station,
Arrival Time,
DepartureTime_Retur,
ArrivalTime_Retur,
Count,
Changes score
From ...
group by Start Station,Departure Time,Final Station,Arrival Time,DepartureTime_Retur,ArrivalTime_Retur,Count,Changes score

igoralcantara
Partner - Creator II
Partner - Creator II

This would not work because the Max Row_Number does not always links to the Max Id. Also, Id is not a number, Max does not work. It would be MaxString instead.

----
datavoyagers.net
marcus_sommer

For the usual view-requirements you won't need the Row_number and Id and therefore the simplest approach would be just to remove them and loading the data distinct.

Another approach might be to combine the relevant fields within a first load, like:

F1 & '|' & F2 & '|' & F3 ... as Key

and within a second one you applies something like:

exists(Key) as Flag

and a third load may filter on the flag with: where Flag = 0;

stefanmereuta
Partner - Contributor
Partner - Contributor

This can be done in one load (adapting from @igoralcantara's answer):

 

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

 

 

JuMacedo
Partner - Contributor III
Partner - Contributor III
Author

Hi Marcus, thanks for the answer. I'll need the Id as it will be used to filter a table that comes next. The Row_number though won't be needed. So I guess this solution might not work. 

marcus_sommer

You may consider to change the order of script-statements within your ETL chain and/or to extract the id within a separate table to apply them in any mapping/exists-approaches.

Beside this the id's are all different - how will you ensure to remove/remain the needed ones? Or all id's exists within the other source, too - then which taken from there?

Therefore I suggest to review the sources and the data-quality at first before looking for technically solutions.

JuMacedo
Partner - Contributor III
Partner - Contributor III
Author

Thank you for the contributions both @igoralcantara and you. I decided to go for this solutions as it seemed to be the simplest, with just one load. But I guess what @igoralcantara would have worked as well since you adapted the script from his idea. 
This worked, so thanks!