Linguistic heuristics

Searching. I’m currently testing a product that has a search feature. I’ve tested search functionality before but not a multilingual search engine that utilizes two different search engines based on the language selected. Nor have I previously worked with a search engine that uses stemming and stop words.

I almost didn’t want to write about this – figured I would wait until I resolved my challenges before writing about it – but it occurred to me why? It’s not as though in software testing I haven’t learned that first understanding the complexities of a problem is an essential starting point.

Some basic questions about testing search functionality.

How do I know a search result set is accurate?

How do I know if the search results should returned have more matches, less?

If I sort a search result set according to the displayed column headings and the data displayed meets the criteria specified, is that good enough? (My answer, no.)

Being a data geek, my past process has been to know the data set being tested. I’ve done this before – troll about the database, execute SQL queries, and become aware of some of the data well enough to confirm search results. I’ve coupled this tactic with building my own data. I create small sets of data meeting certain criteria, insert the data, and then execute searches. I specifically plant data for testing. Are there other ways to prove a search engine is working?

I’ve attacked the challenge in a couple of ways.

One step has been to learn about stemming. Stemming is a process of taking a word, reducing the word to its base form and using the base or the core word as the search criteria. For example, the words testing, tester, testy would be reduced to the core word: test.

I found the following article helpful – in part because it explains stemming and because it discusses Arabic stemming specifically and Arabic is one language I’m working with. See: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf.

Another step has been learning about stop words. Stop words are words discarded by a search engine.

As for working with a multilingual search, one thing I’ve done is to head to the library. I thought it would be helpful to know a few basic words in different languages. So I dug into two sections in opposite spots of the library. My first stop was the travel section where I picked up travel books that featured common phrases – I especially like Barron’s travel books because the focus is for the business traveler and so the words listed are more in fitting with the content I’m working with. My next stop was at children’s section. I found a Cat in the Hat book in Polish, complete with a CD that’s been interesting to listen to – although not truly helpful with my challenge. I picked up a book that lists 12 core phrases in 12 different languages, a kids book designed as a multilingual introduction.

So far, I’ve found getting content in other languages has been more helpful than trying to write clever phrases on my own. I’ve been able to pull back search results and pick up words to use – in many cases, not knowing the word or the meaning but instead looking through results in foreign languages feels more like pattern matching. In fact, I’ve tried not to be distracted by the content or the language but instead to focus on matching entry criteria to results.

Another plan I have is to build a small set of words in different languages. What’s challenging is working with a language that I don’t know and that I don’t have a keyboard to create text in – such as Arabic.

I’m not done testing and I’m not done with the questions. So more to come …

This entry was posted in heuristics, software testing, SQL. Bookmark the permalink.