BBC BASIC for Windows - Matching UTF8 / ANSI accented text

BBC BASIC for Windows

Programming

Database and Files (Moderator: admin)

Matching UTF8 / ANSI accented text

« Previous Topic | Next Topic »

Pages: 1

Author

Topic: Matching UTF8 / ANSI accented text (Read 1432 times)

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Matching UTF8 / ANSI accented text
« Reply #5 on: Mar 6^th, 2015, 6:38pm »

Hi,

Currently I implemented the following:
Code:

        DIM wchar% 257
        ansi$=""
        SYS "MultiByteToWideChar",CP_UTF8,0,UTF8%,-1,wchar%,128 TO n%
        FOR i%=wchar% TO wchar%+n%*2-3 STEP 2
          ansi$+=CHR$(?i%)
        NEXT

Where UTF8% is a pointer to a null terminated UTF8 string.

I then compare the 2 ansi strings using:
Code:

      REM NORM_IGNORECASE = 0x00000001;
      REM NORM_IGNORENONSPACE = 0x00000002;
      REM NORM_IGNORESYMBOLS = 0x00000004;
      REM LINGUISTIC_IGNORECASE = 0x00000010;
      REM LINGUISTIC_IGNOREDIACRITIC = 0x00000020;
      REM NORM_IGNOREKANATYPE = 0x00010000;
      REM NORM_IGNOREWIDTH = 0x00020000;
      REM NORM_LINGUISTIC_CASING = 0x08000000;
      REM SORT_STRINGSORT = 0x00001000;
      REM SORT_DIGITSASNUMBERS = 0x00000008;

      SYS "CompareString",0,&1021,str1$,LENstr1$,str2$,LENstr1$ TO result%

The 2 times LENstr1$ is intentional since the string from the text file is always the same or shorter in length than the string from the SQLite DB.

This works as I intended however I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Thanks for mentioning @vdu%?74. I will check it out to see if I can benefit from it.
The new Wiki article is awesome to have a better understanding. Indeed I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. I thought the interpreter will make sure the pointer is passed anyhow.

Regards,

Mike

Logged

rtr2
Guest

Re: Matching UTF8 / ANSI accented text
« Reply #6 on: Mar 6^th, 2015, 9:30pm »

on Mar 6^th, 2015, 6:38pm, hellomike wrote:

I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Your method of converting UTF-16 to ANSI is non-standard and I'm not even sure it is guaranteed to work. Why did you use that method rather than the 'official' WideCharToMultiByte API? Indeed, why did you choose not to take my advice and convert both source strings to UTF-16 and compare those (e.g. using the CompareStringW API)?

Another question raised by your code is what encoding your 'ANSI' string is actually using. The Windows constant CP_ACP does not necessarily mean the ANSI code page, instead it refers to the (default) code page for which the PC is configured. If it's set up for use in the UK or US it may well be the ANSI code page, but in almost any other country it probably won't be, and that could cause your code to fail.

Quote:

I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'.

If (and only if) the last character of string str$ is a NUL there is not really any difference between them; they both pass the same address to the API function. However if str$ does not end with a NUL they have quite different effects (passing str$ causes the string to be copied to a temporary place and a NUL terminator added).

Richard.

Logged

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Matching UTF8 / ANSI accented text
« Reply #7 on: Mar 10^th, 2015, 8:25pm »

Richard,

I'm still optimizing my code with the advise received. All the information is really useful, don't worry.

Quote:

(passing str$ causes the string to be copied to a temporary place and a NUL terminator added)

Clear now.

Regards,

Mike

Logged

Pages: 1


« Previous Topic \| Next Topic »