BBC BASIC for Windows
« Matching UTF8 / ANSI accented text »

Welcome Guest. Please Login or Register.
Apr 5th, 2018, 10:16pm



ATTENTION MEMBERS: Conforums will be closing it doors and discontinuing its service on April 15, 2018.
Ad-Free has been deactivated. Outstanding Ad-Free credits will be reimbursed to respective payment methods.

If you require a dump of the post on your message board, please come to the support board and request it.


Thank you Conforums members.

BBC BASIC for Windows Resources
Online BBC BASIC for Windows documentation
BBC BASIC for Windows Beginners' Tutorial
BBC BASIC Home Page
BBC BASIC on Rosetta Code
BBC BASIC discussion group
BBC BASIC for Windows Programmers' Reference

« Previous Topic | Next Topic »
Pages: 1  Notify Send Topic Print
 thread  Author  Topic: Matching UTF8 / ANSI accented text  (Read 1432 times)
hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Re: Matching UTF8 / ANSI accented text
« Reply #5 on: Mar 6th, 2015, 6:38pm »

Hi,

Currently I implemented the following:
Code:
        DIM wchar% 257
        ansi$=""
        SYS "MultiByteToWideChar",CP_UTF8,0,UTF8%,-1,wchar%,128 TO n%
        FOR i%=wchar% TO wchar%+n%*2-3 STEP 2
          ansi$+=CHR$(?i%)
        NEXT
 

Where UTF8% is a pointer to a null terminated UTF8 string.

I then compare the 2 ansi strings using:
Code:
      REM NORM_IGNORECASE = 0x00000001;
      REM NORM_IGNORENONSPACE = 0x00000002;
      REM NORM_IGNORESYMBOLS = 0x00000004;
      REM LINGUISTIC_IGNORECASE = 0x00000010;
      REM LINGUISTIC_IGNOREDIACRITIC = 0x00000020;
      REM NORM_IGNOREKANATYPE = 0x00010000;
      REM NORM_IGNOREWIDTH = 0x00020000;
      REM NORM_LINGUISTIC_CASING = 0x08000000;
      REM SORT_STRINGSORT = 0x00001000;
      REM SORT_DIGITSASNUMBERS = 0x00000008;

      SYS "CompareString",0,&1021,str1$,LENstr1$,str2$,LENstr1$ TO result%
 

The 2 times LENstr1$ is intentional since the string from the text file is always the same or shorter in length than the string from the SQLite DB.

This works as I intended however I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Thanks for mentioning @vdu%?74. I will check it out to see if I can benefit from it.
The new Wiki article is awesome to have a better understanding. Indeed I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. I thought the interpreter will make sure the pointer is passed anyhow.

Regards,

Mike
User IP Logged

rtr2
Guest
xx Re: Matching UTF8 / ANSI accented text
« Reply #6 on: Mar 6th, 2015, 9:30pm »

on Mar 6th, 2015, 6:38pm, hellomike wrote:
I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Your method of converting UTF-16 to ANSI is non-standard and I'm not even sure it is guaranteed to work. Why did you use that method rather than the 'official' WideCharToMultiByte API? Indeed, why did you choose not to take my advice and convert both source strings to UTF-16 and compare those (e.g. using the CompareStringW API)?

Another question raised by your code is what encoding your 'ANSI' string is actually using. The Windows constant CP_ACP does not necessarily mean the ANSI code page, instead it refers to the (default) code page for which the PC is configured. If it's set up for use in the UK or US it may well be the ANSI code page, but in almost any other country it probably won't be, and that could cause your code to fail.

Quote:
I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'.

If (and only if) the last character of string str$ is a NUL there is not really any difference between them; they both pass the same address to the API function. However if str$ does not end with a NUL they have quite different effects (passing str$ causes the string to be copied to a temporary place and a NUL terminator added).

Richard.

User IP Logged

hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Re: Matching UTF8 / ANSI accented text
« Reply #7 on: Mar 10th, 2015, 8:25pm »

Richard,

I'm still optimizing my code with the advise received. All the information is really useful, don't worry.

Quote:
(passing str$ causes the string to be copied to a temporary place and a NUL terminator added)

Clear now.

Regards,

Mike
User IP Logged

Pages: 1  Notify Send Topic Print
« Previous Topic | Next Topic »

| |

This forum powered for FREE by Conforums ©
Terms of Service | Privacy Policy | Conforums Support | Parental Controls