I started work on a new project and need to address search testing with a wide assortment of languages. And geez, this is a puzzle I’ve worked with before so I thought I would share some thoughts around the topic of search testing with multiple languages.
At the start, I look into how many languages and what languages I’ll be working with. Based on past (and current) experiences, I have certain reactions – from a testing perspective – to some languages.
Latin-based languages fall into one group. I used to think of English, Spanish and French as “different” languages but now I see these languages are rather similar from a test perspective. The same is true for any languages that use the Latin-based character set.
While Swedish, Slovenian, Romanian and other languages seem to me to represent languages with a heavy use of diacriticals. Yeah, I know the word diacriticals has a daunting sound to it but it’s the term to describe all those little character marks that some languages seem to use with more frequency. If you’re gone to test with different languages, it’s worthwhile to read up a bit on diacritic marks to understand what they are.
I’m highlighting diacriticals in the same way I think of certain characters in Latin-based languages – like names and words that use an apostrophe or an ampersand – two characters that still encounter issues frequently. It’s amazing how often ampersands and apostrophes are not handled well by software. Case in point, importing names and emails with apostrophes in Gmail from CSV files. If these examples seem far-fetched – consider the last name O’Brien and a company name like Smith & Bros.
Then I think about Right to Left languages like Hebrew and Arabic. And even after testing years (yes, years) of exposure to testing with right to left languages, I get disoriented having scrollbars appear on the left side of a screen as my eye is so trained to expect scrollbars on the right. Both entering characters and having the characters appear in the opposite direction on the screen as I type takes a mental adjustment. I adjust, and make UI checkpoints like scrollbars, bulleted listed and text alignment on search results.
Still there are more languages. There is another set of languages that I think of as more “symbolic” than character-based. This interpretation may be my own or widely-shared, I don’t know. But I think of Chinese, Singhalese and Tamil this way – there is no remnant of a Latin character. Instead these languages are more symbolic looking; in the case of Singhalese and Tamil, the languages seem lyrical and flowing while Chinese appears as singularly posed symbols, each with a story and meaning of its own.
From a database perspective, it would be more accurate to discuss the use of Unicode chars and using UTF8 vs UTF16. But I test with a mix of technical and logical insights as well as and instinctual reactions based on experiences.
So when it was time to choose a handful of languages to test with, my reaction was to choose:
- one or more Latin-based languages
- one or more languages with a heavy use of diacriticals
- a RTL language
- a language that is more symbolic than character-based
A common problem in testing with these languages is the lack of keyboard or a means of entering characters from different languages. Cut and paste can work if you’re careful.
As for where I get text from in all these languages – these three sources seem to have kept me supplied well.
- content from the application I’m testing which is both handy and means that I’m using words are used within the application and likely to be words found in search results
- user manuals -when I buy something like a new external drive, it often comes with a user manual in a mix of languages – I keep the manuals and use chunks of text
- Wikipedia. The overview of a language generally includes a few phrases I can use and points out diacritics and other insights on a particular language.
It would be great to hear how other people address this type of testing.