Performance issue of Qlik Replicate

iti-attunity-sup · ‎2024-04-04

Hello,

I have a question regarding the customer's problem.

Our customer encounters performance issue for several months.

The environment :

- 3 tasks are running on the qlik replicate server
- Source DB is Oracle, Target is Azure Synapse
- Each task connects to *different* source oracle database

All the three task does not seem to capture any logs for a certain period of time (about 1 hour) almost every day,
even though many archive logs are generated during this time.

According to the log with setting Performance to trace, Source latency is just increasing.

00006620: 2024-04-04T05:05:20 [PERFORMANCE ]T: Source latency 5098.19 seconds, Target latency 5215.19 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:05:50 [PERFORMANCE ]T: Source latency 5128.31 seconds, Target latency 5245.31 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:06:21 [PERFORMANCE ]T: Source latency 5159.56 seconds, Target latency 5159.56 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:06:51 [PERFORMANCE ]T: Source latency 5189.67 seconds, Target latency 5189.67 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:07:21 [PERFORMANCE ]T: Source latency 5219.84 seconds, Target latency 5219.84 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:07:57 [PERFORMANCE ]T: Source latency 5255.15 seconds, Target latency 5255.15 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
:
00006620: 2024-04-04T06:01:12 [PERFORMANCE ]T: Source latency 8447.67 seconds, Target latency 8448.67 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:01:42 [PERFORMANCE ]T: Source latency 8477.82 seconds, Target latency 8478.82 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:02:12 [PERFORMANCE ]T: Source latency 8507.92 seconds, Target latency 8508.92 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:02:43 [PERFORMANCE ]T: Source latency 8538.53 seconds, Target latency 8539.53 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:03:13 [PERFORMANCE ]T: Source latency 8568.64 seconds, Target latency 8569.64 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:03:43 [PERFORMANCE ]T: Source latency 8598.76 seconds, Target latency 8599.76 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:04:13 [PERFORMANCE ]T: Source latency 8628.86 seconds, Target latency 8629.86 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:04:43 [PERFORMANCE ]T: Source latency 8650.97 seconds, Target latency 8658.97 seconds, Handling latency 8.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:05:14 [PERFORMANCE ]T: Source latency 8676.09 seconds, Target latency 8689.09 seconds, Handling latency 13.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:05:44 [PERFORMANCE ]T: Source latency 8692.21 seconds, Target latency 8716.21 seconds, Handling latency 24.00 seconds (replicationtask.c:3403)

The CPU usage of the Qlik server is very low during this time and also
CPU usage of Oracle server is not that high.

I don't have any idea about the reason of that qlik replicate does not proceed at all for the certain period.

Does anyone have been in a similar situation ?

What bottleneck could be the reason for such a situation?
What kind of point would be helpful to check?
Any ideas would be appreciated.

Regards,

Kyoko Tajima

john_wang · ‎2024-04-04

Hello Tajima-san @iti-attunity-sup ,

Thanks for reaching out to Qlik Community!

If the behavior happens periodically or by a specific timing then probably an uncommon job was done in Oracle side, for example a big transaction (whatever it's commit, or rollback finally) was made and huge transaction log were generated, Qlik Replicate was impacted by the big redo log entries. You might check the Monitor --> Change Processing --> Incoming Changes --> Accumulating (n transactions) (until source commit). Please check with DBA if any housekeeping jobs done during the period.

And read through article: Latency / Performance Troubleshooting and Tuning for Replicate for troubleshooting.

If you still need further investigation, the possibilities include:

If there is latency between primary and standby nodes, and standby nodes act as Replicate source endpoint
if the 3 databases running on the same VM or hardware so far they impact each other
if any network traffic especially if the DBs are at cloud
if it's data related eg the data changes in the period are large columns update or LOB columns related

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

Heinvandenheuvel · ‎2024-04-04

Interesting. Pretty bad latency to start with and then it looks like it is loosing 30 seconds every 30 seconds during the bad periods. Yikes.

Now how much volume/changes is coming in? Towards the end of the provided sample we see source latency no longer getting as bad quite as quickly ('only' 25 seconds every 30 🙂 and handling time increase suggestion perhaps that there was a lot of work to be done.

>>> Each task connects to *different* source oracle database

Can you provide more detail? Different Source DB servers? Different PDB in single container? I ask because with pluggable databases the Redo/Archive is with the container, shared by all. So if one slows down, all would even though they are believed to be independent. Logstream is not smart enough (yet??) to deal with that - I think that theoretically it could.

What versions for OracIe and Replicate are in play? I thought the PERFORMANCE TRACE would also give your Oracle Redo/Arch details - my Replicate 2023.05 does.

12:19:20 [PERFORMANCE     ]T:  Completed to read from online Redo log 32256 bytes...
12:19:20 [PERFORMANCE     ]T:  Start read from online Redo log 32256 bytes
12:19:20 [PERFORMANCE     ]T:  Completed to read from online Redo log 32256 bytes
12:19:25 [PERFORMANCE     ]T:  Source latency 5.80 seconds, Target latency 5.80 sec...

That might give you further clues.

And... work with you DBA (is that you ? :-). Have them report on what they think Replicate is doing/asking to be done.

Hein.

iti-attunity-sup · ‎2024-04-04

Hello John

Thank you for your reply.

I will check the documentation and what you have suggested above.

Regards,
Kyoko Tajima

iti-attunity-sup · ‎2024-04-04

@Heinvandenheuvel

Hello Hein

Thank you for your reply.

The environment is Exadata and each database is 8 Node RAC.

I will check the disk layout, but am wondering about the lack of progress only during certain hours.

Each database generated large amount of archive logs during batch, latency was getting higher gradually at that time.

For example, at Apr 4th 03:00 Source latency was about 200 seconds,

00006620: 2024-04-04T03:00:28 [PERFORMANCE ]T: Source latency 209.59 seconds, Target latency 209.59 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T03:00:58 [PERFORMANCE ]T: Source latency 239.69 seconds, Target latency 239.69 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)

Then, it got about 5000 sec at 05:05 as I mentioned before.

00006620: 2024-04-04T05:05:20 [PERFORMANCE ]T: Source latency 5098.19 seconds, Target latency 5215.19 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:05:50 [PERFORMANCE ]T: Source latency 5128.31 seconds, Target latency 5245.31 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)

Though I can see the messages like 'Completed to read from archive Redo log ...' until 05:03:57,
the messages are not shown suddenly after that.
I am wondering why SOURCE_CAPTURE was not progressing at all during only this period.

00006832: 2024-04-04T05:03:55 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 000000000f423a00 for requested offset 000000000f423c00, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T05:03:56 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 000000000f423a00 with rc 1, read time is 604 ms, thread '5' (oradcdc_redo.c:1042)
00006832: 2024-04-04T05:03:56 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 00000000124f7800 for requested offset 00000000124f7a00, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T05:03:56 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 00000000124f7800 with rc 1, read time is 580 ms, thread '5' (oradcdc_redo.c:1042)
00006832: 2024-04-04T05:03:56 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 00000000155cb600 for requested offset 00000000155cb800, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T05:03:57 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 00000000155cb600 with rc 1, read time is 567 ms, thread '5' (oradcdc_redo.c:1042)
00006620: 2024-04-04T05:04:20 [PERFORMANCE ]T: Source latency 5038.01 seconds, Target latency 5155.01 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:04:50 [PERFORMANCE ]T: Source latency 5068.10 seconds, Target latency 5185.10 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:05:20 [PERFORMANCE ]T: Source latency 5098.19 seconds, Target latency 5215.19 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:05:50 [PERFORMANCE ]T: Source latency 5128.31 seconds, Target latency 5245.31 seconds, Handling latency 117.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:06:21 [PERFORMANCE ]T: Source latency 5159.56 seconds, Target latency 5159.56 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:06:51 [PERFORMANCE ]T: Source latency 5189.67 seconds, Target latency 5189.67 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:07:21 [PERFORMANCE ]T: Source latency 5219.84 seconds, Target latency 5219.84 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006832: 2024-04-04T05:07:50 [SOURCE_CAPTURE ]I: Change captured data object id of table 'ACCOUNT.ALGO_STOCK_ORD' (object id '338583') from '12538548' to '12553778' (oracle_endpoint_capture.c:1938)
00006620: 2024-04-04T05:07:57 [PERFORMANCE ]T: Source latency 5255.15 seconds, Target latency 5255.15 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T05:08:27 [PERFORMANCE ]T: Source latency 5285.26 seconds, Target latency 5285.26 seconds, Handling latency 0.00 seconds (replicationtask.c:3403)
:
00006620: 2024-04-04T06:03:13 [PERFORMANCE ]T: Source latency 8568.64 seconds, Target latency 8569.64 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:03:43 [PERFORMANCE ]T: Source latency 8598.76 seconds, Target latency 8599.76 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T06:04:13 [PERFORMANCE ]T: Source latency 8628.86 seconds, Target latency 8629.86 seconds, Handling latency 1.00 seconds (replicationtask.c:3403)
00006836: 2024-04-04T06:04:17 [TARGET_APPLY ]I: ODBC driver version: '11.00.6020' (ar_odbc_conn.c:612)
00006832: 2024-04-04T06:04:17 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 0000000027ac2a00 for requested offset 0000000027ac2c00, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T06:04:18 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 0000000027ac2a00 with rc 1, read time is 671 ms, thread '5' (oradcdc_redo.c:1042)
00006832: 2024-04-04T06:04:19 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 000000002ab96800 for requested offset 000000002ab96a00, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T06:04:20 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 000000002ab96800 with rc 1, read time is 725 ms, thread '5' (oradcdc_redo.c:1042)
00006832: 2024-04-04T06:04:20 [PERFORMANCE ]T: Start read from archived Redo log 51200000 bytes at offset 000000002dc6a600 for requested offset 000000002dc6a800, thread '5' (oradcdc_redo.c:1025)
00006832: 2024-04-04T06:04:21 [PERFORMANCE ]T: Completed to read from archived Redo log 51200000 bytes at offset 000000002dc6a600 with rc 1, read time is 708 ms, thread '5' (oradcdc_redo.c:1042)

Is there any situation or settings which limits source capture?

Regards,

Kyoko Tajima

DesmondWOO · ‎2024-04-04

Hi @iti-attunity-sup /Tajima san,

The PERFORMANCE message is printed every 30 seconds. Additionally, your source latency is increasing by 30 seconds as well. It appears that Replicate is still awaiting a response from Oracle , , and this delay could be due to a dead connection.

Have you defined ENABLE=BROKEN?

If source latency is kept increasing by 30 seconds, please stop and resume the task.

Regards,
Desmond

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

Heinvandenheuvel · ‎2024-04-05

Thanks for the extra information. Those are hefty source servers. Exadata is always configured ASM I seem to recall? Are those servers busy with update/insert activity around 5am?

I suppose they are based on the remark: "even though many archive logs are generated during this time."

I see no 'redo' reads - which can be normal with high latency - or be the cause (read on)

When the 'arch' reads are taking place the speed appears decent (but not great) at 70 - 100 MB/sec..

Applying Occam's razor would suggest that your Oracle source endpoint is simply set to "Use archived redo logs only" - is that perhaps the case? That could explain everything if there are no new archive logs generated during those slow windows. In that case the latency is expected to climb and climb - until a new archive log with a relevant transaction is available. Is the Oracle DB itself perhaps configured to only generate archive logs when a redo is full (how big? or is there a periodic "alter system switch logfile"? Even though you indicated many archive logs are generated during this time did you query V$ARCHIVED log around the slow times to verify the archived logs? Trust but verify. Something along the lines of:

select SEQUENCE#, thread#, to_char(FIRST_TIME, 'YYYY-MM-DD HH24:MI:SS') "Date&Time",
TO_CHAR((blocks*block_size)/(1024*1024), '999,999,999') "Megabytes", name
from v$archived_log
where first_time > sysdate - 2 and status = 'A' order by first_time;

I'll attach a silly Perl script (as .txt) which can help interpreting the reptask log PERFORMANCE entries when there are a lot of them.

Also attached a silly script to which you probably have to modify to show a matrix of Archived logs by hour in the day - just in case but I get the feeling that the issue is not in that area. Maybe something with thread ID?

Besides the latency concerns, is the replication itself believed to be happening are appropriate? Is "redo logs destination ID" set up correctly? Since you indicate 'daily' one must assume the latency becomes low again at some point. When? Does it say (reasonably) low for several hours after catching up so to speak?

Is "redo logs destination ID" set up correctly?

Hein.

iti-attunity-sup · ‎2024-04-05

Hello Desmond
Thank you for your reply.

I don't know if this is caused by a dead connection, because it is automatically recovered after about an hour or so without any reconnecting messages.
But we will consider this as a possibility.

Regards,
Kyoko Tajima

iti-attunity-sup · ‎2024-04-05

Hello Hein

Thank you for your update and providing the scripts.

About "Use archived redo logs only", they don't set to " "Use archived redo logs only".

During daytime hours when the load is not high, replicate reads data from online redo logs..

00006824: 2024-03-06T16:50:52 [SOURCE_CAPTURE ]I: Start processing online Redo log sequence 224153 thread 3 name +DATA/RSDB03_B/ONLINELOG/group_7.633.999919903 (oradcdc_redo.c:898)
00006824: 2024-03-06T16:50:54 [SOURCE_CAPTURE ]I: Start processing online Redo log sequence 97904 thread 5 name +DATA/RSDB03_B/ONLINELOG/group_15.623.999919905 (oradcdc_redo.c:898)

I supposed that "many archive logs are generated during this time" by checking alert logs of each node
but I will ask the customer to check v$archive_log to confirm.

Regarding the replication itself is appropriate or not, the customer doesn't say any problems other than this, so I think it works fine.
And after a few hours the issue occurred, source latency get relatively low and sees to work fine in the daytime.

00006620: 2024-04-04T08:28:29 [PERFORMANCE ]T: Source latency 218.55 seconds, Target latency 10552.55 seconds, Handling latency 10334.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T08:28:59 [PERFORMANCE ]T: Source latency 225.70 seconds, Target latency 10582.70 seconds, Handling latency 10357.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T08:29:29 [PERFORMANCE ]T: Source latency 255.83 seconds, Target latency 10612.83 seconds, Handling latency 10357.00 seconds (replicationtask.c:3403)
00006620: 2024-04-04T08:29:59 [PERFORMANCE ]T: Source latency 285.96 seconds, Target latency 10642.96 seconds, Handling latency 10357.00 seconds (replicationtask.c:3403)

So I believe large amount of archive logs triggered the issue, but not sure the reason why source capture doesn't work only specific time.

Regards,

Kyoko Tajima

General Question

Performance