Author |
Topic: Matching UTF8 / ANSI accented text (Read 1432 times) |
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Matching UTF8 / ANSI accented text
« Reply #5 on: Mar 6th, 2015, 6:38pm » |
|
Hi,
Currently I implemented the following: Code:
DIM wchar% 257
ansi$=""
SYS "MultiByteToWideChar",CP_UTF8,0,UTF8%,-1,wchar%,128 TO n%
FOR i%=wchar% TO wchar%+n%*2-3 STEP 2
ansi$+=CHR$(?i%)
NEXT
Where UTF8% is a pointer to a null terminated UTF8 string.
I then compare the 2 ansi strings using: Code:
REM NORM_IGNORECASE = 0x00000001;
REM NORM_IGNORENONSPACE = 0x00000002;
REM NORM_IGNORESYMBOLS = 0x00000004;
REM LINGUISTIC_IGNORECASE = 0x00000010;
REM LINGUISTIC_IGNOREDIACRITIC = 0x00000020;
REM NORM_IGNOREKANATYPE = 0x00010000;
REM NORM_IGNOREWIDTH = 0x00020000;
REM NORM_LINGUISTIC_CASING = 0x08000000;
REM SORT_STRINGSORT = 0x00001000;
REM SORT_DIGITSASNUMBERS = 0x00000008;
SYS "CompareString",0,&1021,str1$,LENstr1$,str2$,LENstr1$ TO result%
The 2 times LENstr1$ is intentional since the string from the text file is always the same or shorter in length than the string from the SQLite DB.
This works as I intended however I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.
Thanks for mentioning @vdu%?74. I will check it out to see if I can benefit from it. The new Wiki article is awesome to have a better understanding. Indeed I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. I thought the interpreter will make sure the pointer is passed anyhow.
Regards,
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Matching UTF8 / ANSI accented text
« Reply #6 on: Mar 6th, 2015, 9:30pm » |
|
on Mar 6th, 2015, 6:38pm, hellomike wrote:| I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10. |
|
Your method of converting UTF-16 to ANSI is non-standard and I'm not even sure it is guaranteed to work. Why did you use that method rather than the 'official' WideCharToMultiByte API? Indeed, why did you choose not to take my advice and convert both source strings to UTF-16 and compare those (e.g. using the CompareStringW API)?
Another question raised by your code is what encoding your 'ANSI' string is actually using. The Windows constant CP_ACP does not necessarily mean the ANSI code page, instead it refers to the (default) code page for which the PC is configured. If it's set up for use in the UK or US it may well be the ANSI code page, but in almost any other country it probably won't be, and that could cause your code to fail.
Quote:| I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. |
|
If (and only if) the last character of string str$ is a NUL there is not really any difference between them; they both pass the same address to the API function. However if str$ does not end with a NUL they have quite different effects (passing str$ causes the string to be copied to a temporary place and a NUL terminator added).
Richard.
|
|
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Matching UTF8 / ANSI accented text
« Reply #7 on: Mar 10th, 2015, 8:25pm » |
|
Richard,
I'm still optimizing my code with the advise received. All the information is really useful, don't worry.
Quote:| (passing str$ causes the string to be copied to a temporary place and a NUL terminator added) |
|
Clear now.
Regards,
Mike
|
|
Logged
|
|
|
|
|