How to correct umlauts that have been incorrectly ... - Qlik Community

shane_spencer

I've got some data from an external source where the Umlaut characters have been encoded as an A-like character. It's a real chore to identify the incorrect character in the data then find the correct unlaut, let alone doing the Replace. I cannot be sure I'm even identifying the correct match as it's a matter of deduction by working out what it should be. I'm wondering if there's a way to get Qilk to translate these or failing that a way of looking up individual characters online so I can build an table for a map.

i.e.

Character Replacement

Ãœ	Ü
Ã–	Ö
Åƒ	Ń
ÃŸ	ß
Ã“	Ó
ÄŒ	Č
Ã‰	É
Å‡	Ň
Å˜	Ř
Äš	Ě
Åž	Ș
Å½	Ž
Å¹	Ź

marcus_sommer

If these chars are correct within the source it would mean that the data are loaded in Qlik with the wrong charset and you could try it with another one - within the file-format or as function: ApplyCodepage - script and chart function | Qlik Sense on Windows Help.

Are there chars already wrong in the source you will need to apply a replacement but not mandatory with n replace() statements else you could use a mapping, like demonstrated here:

Passing parameter strings that contain special cha... - Qlik Community - 1247166

Identifying all possible chars manually is a rather ugly job but it might be supported with a loop and returning the index behind the chars which may grouped and clustered in some way. Here the main-idea:

load *, rowno() as RowNo;
load *, ord(mid(MyField, iterno(), 1) as Index, mid(MyField, iterno(), 1) as Char, iterno() as IterNo
while iterno() <= len(MyField);
load text(fieldvalue('MyField', recno()) as MyField, recno() as RecNo
autogenerate fieldvaluecount('MyField');

shane_spencer

I suspect ApplyCodepage() is getting me close but I cannot identify the correct codepage number (if there is indeed one that will do the job. The data I am getting from VIES so I have no control over the format. For example what I receive is: HEMSBÃœNDE but what it should be is HEMSBÜNDE. The "Ü" (and many other umlauts are being replaced with "Ãœ" and I want to translate it back. What I have had to settle on is manually building a lookup table and using MapSubString() to replace.

marcus_sommer

I use mostly ANSI (I think it's 1252) or UTF8 and have currently no issues with all the German special chars. But I think you may need some different ones because in your example above are various types of extra quotes on top of the chars, like they are common in French and Scandinavian and east-europe languages.

Beside the above you may also consider your used font - if it could display the wanted chars and your region-settings (OS + Qlik + collation-variables) which may also impact the results.

But the most practically approach is probably the mapping because the investigation in the char-set stuff may take much more time without any guarantee to find a working solution.

shane_spencer

The data I am getting is from numerous EU countries such as Germany, France, Slovakia and Czechia. I don't have any issues with the umlaut characters, but the source data has been incorrectly encoded. I was hoping for a quick way to re-encode it. I found this info on Stack Overflow which was enlightening but didn't give me a solution on how to correct them:

https://stackoverflow.com/questions/5127744/how-to-convert-these-strange-characters-%C3%83-%C3%83-%C...

"If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252."

...my concern is that I'm having to investigate and deduce what each character should be so I could get it wrong.

... I tried ApplyCodepage() with 1252 and a few others such as 65001 but could not get the right results.

marcus_sommer

Yes, a manual correcting is surely an ugly job and will need some time and iteration.

Just a thought - you may re-produce the errors at your own and using it to build a mapping. This means going through the alphabet of each language - maybe per copy & paste from Wikipedia or similar sites - and loading the chars correctly as UTF8 and afterwards applying a wrong char-set within an extra field or maybe even several fields with different char-sets. Afterwards everything is loaded together within a mapping.

Beside this I'm not sure if the wrong coding is producing a definitely unique replacement-char. I could imagine that this is by several languages not always the case which would mean you may also need to include some context to the mapping - maybe by considering a country-information or similar stuff (this would be also true by a manual approach).

shane_spencer

That's a good idea, recreating the issue in a controlled manner. I did think it was a one to one relationship between characters but I've getting data from different VIES countries and I have noticed that one incorrectly encoded character (I think) in the source translates to a different umlaut. I think what I've done so far is "good enough" but I can foresee having similar issues in the future so it would be nice to have a simple or reusable solution.

How to correct umlauts that have been incorrectly encoded

General Question