BBC BASIC for Windows
« Matching UTF8 / ANSI accented text »

Welcome Guest. Please Login or Register.
Apr 5th, 2018, 9:54pm



ATTENTION MEMBERS: Conforums will be closing it doors and discontinuing its service on April 15, 2018.
Ad-Free has been deactivated. Outstanding Ad-Free credits will be reimbursed to respective payment methods.

If you require a dump of the post on your message board, please come to the support board and request it.


Thank you Conforums members.

BBC BASIC for Windows Resources
Online BBC BASIC for Windows documentation
BBC BASIC for Windows Beginners' Tutorial
BBC BASIC Home Page
BBC BASIC on Rosetta Code
BBC BASIC discussion group
BBC BASIC for Windows Programmers' Reference

« Previous Topic | Next Topic »
Pages: 1  Notify Send Topic Print
 thread  Author  Topic: Matching UTF8 / ANSI accented text  (Read 1427 times)
hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Matching UTF8 / ANSI accented text
« Thread started on: Mar 4th, 2015, 7:21pm »

Hi fellow programmers,

For a project I need to match strings from 2 different sources.

The first source is a simple ANSI text file where strings can have accented characters. The second source is a (SQLite3) database where strings come out as UTF8.

So for example a string from source 1 is

"COMÉDIE À LA FRANÇAISE, LA"

and from the database it is

"Comédie Ã_€ La Française, La" (_ actually is unprintable block character)

This should be a match but I find no way of making this happen.
The perfect solution would be if both strings can be converted to upper case without accented characters.
Then "COMEDIE A LA FRANCAISE, LA" would match with "COMEDIE A LA FRANCAISE, LA".

Also converting the UTF8 text to "Comédie À La Française, La", would be sufficient because that can easily made uppercase (accented characters included).

To make this conversion, I experimented with
- SYS "MultiByteToWideChar"
- SYS "CompareString"
- SYS "NormalizeString"

All with minimal, undesirable or no effect.

What code/technique/calls would I need to normalize both source inputs to comparable versions?

Thanks for all the help in advance.

Regards,

Mike
User IP Logged

rtr2
Guest
xx Re: Matching UTF8 / ANSI accented text
« Reply #1 on: Mar 4th, 2015, 8:36pm »

on Mar 4th, 2015, 7:21pm, hellomike wrote:
What code/technique/calls would I need to normalize both source inputs to comparable versions?

As you have discovered, Windows does not provide an API to perform a direct conversion between UTF-8 and ANSI encodings. However it does provide a routine MultiByteToWideChar to convert either of those to UTF-16 encoding. So the most straightforward approach is probably to convert both your strings to UTF-16 and then compare the two outputs (a regular string comparison will do that job perfectly well).

The MultiByteToWideChar API is not particularly difficult to use, but as always you must abide by the requirements of the MSDN description in every detail. A useful hint to be aware of is that when an API function returns a string you must add an explicit NUL termination to the supplied parameter string (which must be long enough to receive the output, i.e. 2*N+1, including the NUL, where N is the number of UTF-16 characters expected).

Richard.
User IP Logged

hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Re: Matching UTF8 / ANSI accented text
« Reply #2 on: Mar 5th, 2015, 7:45pm »

on Mar 4th, 2015, 8:36pm, g4bau wrote:
So the most straightforward approach is probably to convert both your strings to UTF-16 and then compare the two outputs (a regular string comparison will do that job perfectly well).


Thanks for the answer Richard. At least I now know that I didn't miss something.
Non-matching string might have to be displayed/printed so at some point I must convert it to something readable. I'm thinking of either:
- using "MultiByteToWideChar" and then glue every other character of the 2-byte-per-character output into an ANSI string or
- scan the UTF8 string searching for the 'Ã' byte, removing it and replacing the next byte with the appropriate accented character (via a table)

The sources might hold up to say around 50,000 strings to compare and BBC BASIC for Windows is fast so the process is fast enough for such a small amount (no need for machine-code).

Regards,

Mike
User IP Logged

rtr2
Guest
xx Re: Matching UTF8 / ANSI accented text
« Reply #3 on: Mar 5th, 2015, 8:58pm »

on Mar 5th, 2015, 7:45pm, hellomike wrote:
Non-matching string might have to be displayed/printed so at some point I must convert it to something readable.

I don't quite understand why. BB4W can display and print both ANSI and UTF-8 strings so no conversion ought to be necessary. It's true that ordinarily one wouldn't be switching dynamically between the two encodings, but I can't see why it wouldn't work correctly if you do (bit 7 of the VDU flags byte @vdu%?74 if memory serves me correctly).

So what I would envisage is converting both the ANSI and UTF-8 strings to UTF-16, comparing those UTF-16 strings, and if they are different displaying/printing the source ANSI (@vdu%?74 AND= &7F) and UTF-8 (@vdu%?74 OR= &80) strings directly using PRINT.

Richard.
« Last Edit: Mar 5th, 2015, 8:59pm by rtr2 » User IP Logged

rtr2
Guest
xx Re: Matching UTF8 / ANSI accented text
« Reply #4 on: Mar 6th, 2015, 10:35am »

on Mar 4th, 2015, 8:36pm, g4bau wrote:
A useful hint to be aware of is that when an API function returns a string you must add an explicit NUL termination to the supplied parameter string

I have written a short Wiki article with a couple of examples:

http://bb4w.wikispaces.com/Calling+DLL+functions+that+return+strings

Richard.
User IP Logged

hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Re: Matching UTF8 / ANSI accented text
« Reply #5 on: Mar 6th, 2015, 6:38pm »

Hi,

Currently I implemented the following:
Code:
        DIM wchar% 257
        ansi$=""
        SYS "MultiByteToWideChar",CP_UTF8,0,UTF8%,-1,wchar%,128 TO n%
        FOR i%=wchar% TO wchar%+n%*2-3 STEP 2
          ansi$+=CHR$(?i%)
        NEXT
 

Where UTF8% is a pointer to a null terminated UTF8 string.

I then compare the 2 ansi strings using:
Code:
      REM NORM_IGNORECASE = 0x00000001;
      REM NORM_IGNORENONSPACE = 0x00000002;
      REM NORM_IGNORESYMBOLS = 0x00000004;
      REM LINGUISTIC_IGNORECASE = 0x00000010;
      REM LINGUISTIC_IGNOREDIACRITIC = 0x00000020;
      REM NORM_IGNOREKANATYPE = 0x00010000;
      REM NORM_IGNOREWIDTH = 0x00020000;
      REM NORM_LINGUISTIC_CASING = 0x08000000;
      REM SORT_STRINGSORT = 0x00001000;
      REM SORT_DIGITSASNUMBERS = 0x00000008;

      SYS "CompareString",0,&1021,str1$,LENstr1$,str2$,LENstr1$ TO result%
 

The 2 times LENstr1$ is intentional since the string from the text file is always the same or shorter in length than the string from the SQLite DB.

This works as I intended however I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Thanks for mentioning @vdu%?74. I will check it out to see if I can benefit from it.
The new Wiki article is awesome to have a better understanding. Indeed I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. I thought the interpreter will make sure the pointer is passed anyhow.

Regards,

Mike
User IP Logged

rtr2
Guest
xx Re: Matching UTF8 / ANSI accented text
« Reply #6 on: Mar 6th, 2015, 9:30pm »

on Mar 6th, 2015, 6:38pm, hellomike wrote:
I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Your method of converting UTF-16 to ANSI is non-standard and I'm not even sure it is guaranteed to work. Why did you use that method rather than the 'official' WideCharToMultiByte API? Indeed, why did you choose not to take my advice and convert both source strings to UTF-16 and compare those (e.g. using the CompareStringW API)?

Another question raised by your code is what encoding your 'ANSI' string is actually using. The Windows constant CP_ACP does not necessarily mean the ANSI code page, instead it refers to the (default) code page for which the PC is configured. If it's set up for use in the UK or US it may well be the ANSI code page, but in almost any other country it probably won't be, and that could cause your code to fail.

Quote:
I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'.

If (and only if) the last character of string str$ is a NUL there is not really any difference between them; they both pass the same address to the API function. However if str$ does not end with a NUL they have quite different effects (passing str$ causes the string to be copied to a temporary place and a NUL terminator added).

Richard.

User IP Logged

hellomike
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 46
xx Re: Matching UTF8 / ANSI accented text
« Reply #7 on: Mar 10th, 2015, 8:25pm »

Richard,

I'm still optimizing my code with the advise received. All the information is really useful, don't worry.

Quote:
(passing str$ causes the string to be copied to a temporary place and a NUL terminator added)

Clear now.

Regards,

Mike
User IP Logged

Pages: 1  Notify Send Topic Print
« Previous Topic | Next Topic »

| |

This forum powered for FREE by Conforums ©
Terms of Service | Privacy Policy | Conforums Support | Parental Controls