Abstract
To improve the relevance of search results on Wikipedia and other Wikimedia Foundation projects, the Search Platform team has focused endeavor on ranking results with machine learning. The first iteration used a click-based model to create relevance labels. Here we present a way to augment that training data with a deep neural network (although other models are also considered) that can predict relevance of a wiki page to a specific search query based on users’ responses to the question “If someone searched for ‘…’, would they want to read this article?”. The model has 80% overall accuracy but over 90% accuracy when there are at least 40 responses and 100% accuracy when there are 70 or more responses. Once deployed, we can utilize users’ responses to search relevance surveys and this model to create relevance labels for training the ranker.Phabricator ticket | Open source analysis | Open data
Our current efforts to improve relevance of search results on Wikipedia and other Wikimedia Foundation projects are focused on information retrieval using machine-learned ranking (MLR). In MLR, a model is trained to predict a document’s relevance from various document-level and query-level features which represent the document. The first iteration used a click-based Dynamic Bayesian Network (implemented via ClickModels Python library) to create relevance labels for training data fed into XGBoost. For further details, please refer to First assessment of learning-to-rank: testing machine-learned ranking of search results on English Wikipedia.
To augment the click-based training data that the ranking model uses, we decided to try crowd-sourcing relevance opinions. Our initial prototype showed promise, so we decided to up the scale. It was not feasible for us to manually grade thousands of pages to craft the golden standard, so instead we used a previously collected dataset of relevance scores from an earlier endeavor called Discernatron.
Discernatron is a search relevance tool developed by the Search Platform team (formerly Discovery department). Its goal was to help improve search relevance - showing articles that are most relevant to search queries - with human assistance. We asked for people to use Discernatron to review search suggestions across multiple search tools and respondents were presented with a set of search results from four tools - CirrusSearch (Wiki search), Bing, Google, and DuckDuckGo. For each query, users were provided with the set of titles found and asked to rank each title from 1 to 4, or leave the title unranked. For each query - result pair, we calculated an average score. After examining the scores, we decided that a score 0-1 indicated the result was “irrelevant” and a score greater than 1 (up to a maximum of 3) indicated the result was “relevant”. These are the labels we trained our model to predict.
However, since graders have different opinions and some scores are calculated from more data than others, we separated the scores into those which are “reliable” and those which are “unreliable” based on whether the Krippendorff’s alpha exceeded a threshold of 0.45. (Refer to RelevanceScoring/Reliability.php and RelevanceScoring/KrippendorffAlpha.php in Discernatron’s source code for implementation details.)
The (binary) classifiers trained are:
In addition to the 7 classifiers listed above (which we will refer to as base learners), we also trained a super learner in a technique called stacking. The base learners are trained in the first stage and are then asked to make predictions. Those predictions are then used as features (one feature for each base learner) to train the super learner in the second stage. Specifically, we chose Bayesian network (via the bnclassify package) as the super learner, coded as “meta” in the Results tables.
Furthermore, we trained a pair of deeper neural networks for binary classification. We utilized Keras with a TensorFlow backend and performed the training separately from the others, which is why they are not included as base learners in the stacked training of a super learner. The results from these two models are coded as “keras-1” and “keras-2” in the Results tables and they are specified as follows:
We utilized the caret package to perform hyperparameter tuning (via 5-fold cross-validation) and training (using 80% of the available data) of each base learner for a combination of each of the following:
Standardization via means the predictor was centered around the mean and scaled by the standard deviation. Normalization was achieved via dividing by the maximum observed value and then subtracting 0.5 to center it around 0.
To correct for class imbalance – there were more irrelevant articles than relevant ones – instances were upsampled. In the case of Keras-constructed DNNs, we specified class weights such that the model paid slightly more attention to the less-represented “relevant” articles.
Classifier | Reliable Discernatron score | Unreliable Discernatron score |
---|---|---|
Logistic Regression (multinom) | 65.1% | 65.9% |
Shallow NN (nnet) | 64.5% | 64.0% |
Deep NN (dnn) | 67.5% | 66.0% |
Deeper NN (keras-1) | 73.7% | 62.5% |
Much Deeper NN (keras-2) | 63.0% | 58.4% |
Naive Bayes (nb) | 53.7% | 40.0% |
Random Forest (rf) | 66.5% | 65.1% |
Gradient-boosted Trees (xgbTree) | 66.3% | 64.1% |
C5.0 trees | 66.2% | 64.6% |
Bayesian Network super learner (meta) | 62.0% | 61.4% |
In this section, we fit a beta regression model of classification performance to the different components to help us see the impact of each component. We chose beta regression because accuracy is a value between 0 and 1, and this family of models enables us to work with that without transforming the response variable.
component | estimate |
---|---|
(Intercept) | -0.0564 |
reliability | 0.1602 |
After looking at the results of the beta regression in the table and figure, we came to the following conclusions:
The model’s performance increases as we collect more responses to calculate the score from – the model is very accurate with at least 40 yes/no/unsure/dismiss responses and the most accurate with at least 70 responses. That is, we do not necessarily need to wait until we obtain at least 70 responses from users before we request a relevance prediction from the model, but we should wait until we have at least 40.
By deploying “keras-1” trained just on survey responses to “If someone searched for…”, we will be able to classify a wiki page as relevant or irrelevant to a specific search query. In order to use the model to augment the training data for the ranking model, we should wait until we have at least 40 responses, but preferably until we have at least 70. Once we have sufficient number of responses, we can request a probability of relevance and then map that to a 0-10 scale that the ranking model expects.
The model can also be easily included in the pipeline for MjoLniR – our Python and Spark-based library for handling the backend data processing for Machine Learned Ranking at Wikimedia – because the final model was created with the Python-based Keras. Furthermore, since the final model uses only the survey response data and does not include additional data such as traffic or information about the page, the pipeline simply needs to calculate a score from users’ yes/no responses, the proportion of users who were unsure, and users’ engagement (responses/impressions) with the survey.
Finally, it would be interesting to evaluate how robust the models are to predicting relevance using response data from questions other than the one the model was trained with. For example, how well does the model trained on responses to “If someone searched for…” perform when it’s given responses to “If you searched for…”?