A Practical Framework for Search Evaluation | by Kaizad Wadia | Jul, 2024

Published:


A Data-Driven Approach to Elevating User Experience and Business Performance with Search

Towards Data Science
Search Evaluation — Photo by inspiring.team

The search functionality underlines the user experience of almost every digital asset today. Be it an e-commerce platform, a content-heavy website, or an internal knowledge base, quality in your search results can make all the difference between disappointment and satisfaction of the user.

But how do you really know if your search algorithm is returning relevant results? How can you determine that it is fulfilling user needs and driving business objectives? While this is a pretty important subapplication, we actually lack a structured approach for the evaluation of search algorithms.

That is what this framework on search algorithm evaluation provides. By instituting a systematic procedure toward the quality assessment of searches, a business would be able to derive meaningful insights on how their algorithm is performing, where efforts should be placed to drive improvement, and learn to measure progress over time.

In this post, we will look at an integral framework for the evaluation of search algorithms that includes defining relevance using user behavior, quantitative metrics for performance measurement, and how these methods can be adapted for specific business needs.

Search evaluation is not a purely technical exercise, it is a strategic business decision that has wide ramifications at every turn. To understand why, consider the place that search holds in today’s digital landscape.

For many businesses, the search feature would be the number one way that users will engage with their digital offerings. This can be customers seeking out products on an e-commerce site, employees searching an internal knowledge base, or readers exploring a content platform — very often, it is the search that happens first. Yet when this key function underperforms, serious implications can result therefrom.

Poor search performance drives poor user satisfaction and engagement. Users get frustrated very fast when they can’t find what they are looking for. That frustration quickly places upward pressure on bounce rates, eventually reducing time on site, finally resulting in missed opportunities.

On the other hand, a fine-tuned search function can become one of the biggest drivers for business success. It can increase conversion rates and improve user engagement, sometimes opening completely new streams of revenue. For content sites, improved search may drive advertisement impressions and subscriptions, and for internal systems it may significantly shorten the hours lost by employees looking for information.

In an ultra-personalized era, good search functionality would lie at the heart of all personalized experiences. Search performance evaluation helps to understand and give you a notion about the users’ preferences and behaviors, thus informing not only search improvements but broad, strategical decisions as well.

By investing in a comprehensive manner in search evaluation, what you are doing is not merely improving a technical function. It is implicitly investing in your business’s resilience to thrive in the digital age.

The basic problem in measuring the performance of search functions for businesses is not technical in nature. Specifically, it is defining what constitutes relevant results for any given search by any user. To put it simply, the question being asked is “For any particular search, what are good search results?”

This is highly subjective since different users may have different intentions and expectations for the same query. The definition of quality also varies by business segment. Each type of business would need to complete this in a different way, according to their own objectives and user demographics.

Though being complex and subjective, the problem has driven the search community to develop several widely-adopted metrics and methods for satisfying the assessment of search algorithms. These methods operationalize, and thus attempt to quantify relevance and user satisfaction. Therefore, they provide a way to assess and improve search performance. No method alone will capture the whole complexity of search relevance, but their combination gives valuable insights into how well a search algorithm serves its users. In the remaining sections, we will look at some common methods of evaluation, including clickstream analytics and human-centered approaches.

Clickstream Analytics

Some of the most common metrics to gain insights from are the metrics obtained from user’s actions when they interact with the website. The first is clickthrough rate (CTR), which is the proportion of users who click on a result after seeing it.

The clickthrough rate doesn’t necessarily measure the relevance of a search result, as much as it does attractiveness. However, most businesses still tend to prioritize attractive results over those that users tend to ignore.

Secondly, there’s the dwell time, which is the amount of time a user spends on the a page after clicking on it. A relatively low dwell time indicates that a user is not engaging enough with the content. This could mean that the search result in question is irrelevant for them.

We also have the bounce rate (BR). The bounce rate is the proportion of users who leave the search without clicking on any results.

Generally, a high bounce rate indicates that none of the search results were relevant to them and therefore a good search engine tends to minimize the bounce rate.

Finally, another metric to analyze (if applicable) is the task completion rate (TCR). The task completion rate is the proportion of users who performed a desirable task (eg. buy a product) out of all those that have viewed it.

This metric is highly industry and use-case specific. For example, this is one that an e-commerce business would prioritize greatly, whereas an academic journal generally wouldn’t. A high task completion rate indicates that the product or service is desirable to the customers, so it is relevant to prioritize in the search algorithm.

Human-Centered Evaluation Methods

While clickstream analytics provide some useful quantitative data, human-centered evaluation methods contribute critical qualitative insights to search relevance. These are approaches that are based on direct human judgment that gets feedback on both quality and relevance of the search results.

Probably one of the most straightforward measures of search effectiveness is just to ask users. This could be performed with something as basic as a thumbs-up/thumbs-down button beside every search result, allowing users to indicate whether a result is useful or not. More detailed questionnaires further allow for checking user satisfaction and particulars of the search experience, ranging from very basic to quite elaborate and giving first-hand, precious data about user perception and needs.

More formally, many organizations can use panels of reviewers, search analysts or engineers. A variety of test queries are generated, and the outcome is rated on predefined criteria or scales (eg. relevance grades from 1–10). Although this process is potentially very time-consuming and costly it provides nuanced assessment that an automated system cannot match. Reviewers can appraise contextual relevance, content quality, and, most importantly, relevance to business objectives.

Task-based user testing provides information regarding what happens when users try to accomplish particular tasks using the search. It gives insights not only into result relevance but also how it contributes towards the overall search experience including parameters such as ease of use and satisfaction. These methods bring to light usability issues and user behaviors, at times obscured by quantitative data alone.

These human-centered methods, though much more resource-intensive than automated analytics, offer profound insights into the relevance of the search. Using these approaches in conjunction with quantitative methods, an organization can develop an understanding of its search performance and areas for targeted improvement.

With a system in place to define what constitutes good search results, it’s time to measure how well our search algorithm retrieves such results. In the world of machine learning, these reference evaluations are known as the ground truth. The following metrics apply to the evaluation of information retrieval systems, most of which have their counterpart in recommender systems. In the following sections, we will present some of the relevant quantitative metrics, from very simple ones, such as precision and recall, to more complex measures, like Normalized Discounted Cumulative Gain.

Confusion Matrix

While this is normally a tool in the arsenal of machine learning for classification problems, a confusion matrix can be effectively adapted for the evaluation of search algorithms. This will provide an intuitive way to measure the performance of a search due to the fact that the results are simply classified as relevant or irrelevant. Furthermore, some important metrics can be computed from it, and make it more useful while remaining simple to use. The confusion matrix applied for information retrieval can be seen below.

Confusion Matrix for Retrieval Systems

Here, for a given search query, the resultant search can be put into one of these four buckets: it was correctly retrieved, incorrectly retrieved though it is irrelevant, or it could have been ignored correctly or the result was relevant, but it was ignored anyway.

What we need to consider here is mostly the first page because most users rarely go beyond this. We introduce a cutoff point, which is usually around the number of results per page.

Let’s run an example. Say we have an e-commerce site listing 10 products per page. There are 8 actually relevant products in the library of 50. The search algorithm managed to get 7 of them on the first page. In this case:

  • RR = 7 (relevant products correctly returned)
  • IR = 3 (10 total on page — 7 relevant = 3 irrelevant results shown)
  • RI = 1 (8 total relevant — 7 shown = 1 relevant product missed)
  • II = 39 (50 total products — 10 shown — 1 missed relevant = 39 correctly ignored)

The key metrics that can be derived from the confusion matrix include precision and recall. Precision is the proportion of retrieved items that are relevant. In the given example that would be 7/10. This is also known as Precision @ K, where K is the cutoff point for the top-ranked items.

Recall is the proportion of relevant items that are retrieved. In the given example that would be 7/8.

These are both important metrics to keep track of as a low precision indicates the user is seeing a lot of irrelevant results and a low recall indicates that many relevant results don’t show up for users. These two are combined and balanced out in a single metric, which is the F1-score that takes the harmonic mean of the two. In the above example, the F1-score would be 7/9.

We can attribute two main limitation to this simple measure of search performance. The first being that it doesn’t take into account the position among the results, just whether it successfully retrieved them or not. This can be mitigated by expanding upon the metrics derived from the confusion matrix to provide more advanced ones such as Mean Average Precision (MAP). The second limitation is (one apparent from our example) that if we have fewer relevant results (according to the ground truth) than results per page our algorithm would never get a perfect score even if it retrieved all of them.

Overall, the confusion matrix provides a simple way to examine the performance of a search algorithm by classifying search results as either relevant or irrelevant. This is quite a simplistic measure but works easily with most search result evaluation methods, particularly those similar to where the user has to provide thumbs-up/thumbs-down feedback for specific results.

Classical Error Metrics

Most databases that store search indices, such as OpenSearch tend to assign scores to search results, and retrieve documents with the highest scores. If these scores are provided, there are more key metrics that can be derived using ground truth scores.

One metric that is very common is mean-absolute-error (MAE), which compares the difference in the scores that is deemed to be correct or ideal to the ones the algorithm assigns to a given search result. The mean of all of these deviations is then taken, with the following formula where the hat denotes the estimated value and y is the actual value of the score for a given search result.

A higher MAE indicates that the search result is doing poorly, with a MAE of zero meaning that it performs ideally, according to the ground truth.

A similar but even more common metric is the mean-squared-error (MSE), which is akin to the mean-absolute-error, but now each deviation is squared.

The main advantage of using MSE over MAE is that MSE penalizes extreme values, so a few really poor performing queries would result in a much higher MSE compared to the MAE.

Overall, with scores assigned to results, we can use more classical methods to quantify the difference in relevance perceived by the search algorithm compared to the one that we find with empirical data.

Advanced Information Retrieval Metrics

Advanced metrics such as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) are turned to by many organizations to gain insight into their search systems’ performance. These metrics provide insights beyond simple precision and recall of search quality.

Normalized Discounted Cumulative Gain (NDCG) is a metric for the quality of ranking in search results. Particularly, in cases with graded relevance scores, it considers the relevance of results and puts them in order within the search output. The central idea of NDCG is to have very relevant results displayed at the top of the list in the search result. First of all, one needs to compute the DCG for the calculation of NDCG. In this case, it is the sum of the relevance scores obtained from the search index alone, discounted by the logarithm of their position, and then normalized against an ideal ranking to produce a score between 0 and 1. The representation for the DCG calculation is shown here:

Here, p is the position in the ranking of the search result and rel is the relevance score of the result at position i. This calculation is done for both the real scores and the ground truth scores, and the quotient of the two is the NDCG.

In the above equation, IDCG refers to the DCG calculation for ideal or ground truth relevance scores. What makes NDCG especially useful is that it can cater to multi-level relevance judgment. It may differentiate between results that are somewhat relevant from those that are highly relevant. Moreover, this is modulated by position using a reduc­ing function in NDCG, reflecting that the user would not normally look at results further down the list. A perfect rating of 1 in NDCG means the algorithm is returning results in the optimal order of relevance.

In contrast, Mean Reciprocal Rank (MRR) focuses on the rank of the first correct or relevant result. The MRR is assessed as being the average of the reciprocal of the rank where the first relevant document was read for some collection of queries.

Here, Q denotes the number of queries, and rank denotes the position of the first relevant result for a given query. MRR values are between 0 and 1 where higher is better. An MRR of 1 would mean that for any query, the most relevant result was always returned in the top position. This is especially a good metric to use when assessing the performance of search in applications where users typically look for a single piece of information, like in question-answering systems or when searching for certain products on an e-commerce platform.

These metrics, when put into the system, build a perspective for how your search algorithm performs.

In every search algorithm, there is a need for a comprehensive evaluation system that merges the methods outlined above and the quantitative metrics.

While automated metrics have a powerful role in providing quantitative data, one should not forget the role of human judgment in truly relating search relevance. Add context through regular expert reviews and reviews of user feedback in the process of evaluation. The qualitative nature of expert and user feedback can help give meaning to sometimes ambiguous quantitative results and, in turn, shed light onto issues in the system that automated metrics might not pick up on. The human element puts your feedback into context and adds dimension to it, ensuring we optimize not just for numbers but real user satisfaction.

Finally, one needs to tune the metrics to business requirements. A measure that fits an e-commerce site may not apply at all in a content platform or in an internal knowledge base. A relevant view of the evaluation framework would be the one tailored for context — on the basis of relevance to business aims and expectations from the algorithm being measured. Regular reviews and adjusting the criteria of evaluation will provide consistency with the changing business objectives and requirements of the end-users.

Related Updates

Recent Updates