Abstract
The Wikimedia Web team ran an A/B test from 20 October 2020 to 4 November 2020 to assess the impact of moving the search bar to a more prominent location on the top of the desktop page. The test included all logged-in users on the following early adopter wikis: Basque Wikipedia, French Wikipedia, French Wiktionary, Hebrew Wikipedia, Persian Wikipedia, and Portuguese Wikiversity. Results varied overall and by Wiki project. Estimates indicated that the new search location resulted in a higher search session completion rate overall and on Persian Wikipedia; however, there was not sufficient evidence to definitively say that the new search location increased search sessions initiated.{ RMarkdown Source | Analysis Codebase }
The Wikimedia Foundation’s Web team is working on researching and building out improvements to the desktop experience to make Wikimedia wikis more welcoming and to increase the utility amongst readers while maintaining utility for existing editors.
As part of this effort, the Web team deployed a new location of the search bar, moving the search bar to a more prominent location on the top of the page. The team ran an A/B test of the new location from 20 October 2020 through 4 November 2020 to assess the efficacy of this feature. The test included all logged-in users on the early adopter wikis (Basque Wikipedia, French Wikipedia, French Wiktionary, Hebrew Wikipedia, Persian Wikipedia, and Portuguese Wikiversity). In the test, 50% of logged-in users saw the search bar in the new location, while the other 50% continued to see the search bar in the previous location.
The new location (as shown in Figure 1) was also deployed as default for anonymous users on our early adopter wikis, and by preference for all other users.
Figure 1: Example of the new location of the search bar on desktop. The search bar was moved to a more prominent location on the top of the desktop page.
The primary goal of the AB Test was to test the hypothesis that the group with the search bar in the new location will initiate more search sessions. The target was identified as 2.5% overall increase in search sessions initiated. The other primary questions we wanted to answer are:
Upon conclusion of the test on 4 November 2020, a total of 61,602 search sessions had been initiated across both groups. You can find more information on this change and other feature deployments on the desktop improvement project page.
The AB test was a run on a per wiki basis and users included in the test were randomly assigned to either the control (search header in the old location) or treatment (search header in the new location) using their user ID and received the same treatment the duration of the test.
Data was collected in the SearchSatisfaction event logging table.
See the following Phabricator tickets for further details regarding the instrumentation and implementation of the AB test:
A search session begins when a person starts typing in the search widget. We measured the number of unique search sessions started for each search location type and wiki included in the AB Test.
We initally considered using a Welch two-sample t-test to determine if there is statistical difference between the two means; however, since there are only 6 observations (each wiki serves as a single observation in this case), there was not enough evidence to confirm a difference between the number of search sessions initiated between the two groups.
As an alternative, we conducted a pre and post deployment analysis comparing changes in search sessions initiated for the 6 test wikis to a set of wikis not included in the test from 6 October 2020 through 3 November 2020 (two weeks before and after deployment of the test).
We selected similar wiki projects to each of the test wikis for comparison. These wikis were selected based on their size (number of articles) and desktop pageview traffic during the reviewed time period.
Test Wiki | Similar Wikis |
---|---|
Basque Wikipedia | Slovenian Wikipedia, Belarusian Wikipedia |
French Wikipedia | German Wikpedia |
French Wikitionary | Russian Wiktionary, Spanish Wiktionary |
Hebrew Wikipedia | Catalan Wikipedia, Danish Wikipedia, Romanian Wikipedia |
Persian Wikipedia | Arabic Wikipedia, Indonesian Wikipedia |
Portuguese Wikiversity | Japanese Wikiversity |
Note: For wikis not included in the AB test, the new search location was displayed for all users that had enabled the new vector skin in their user preferences. In this analysis, we removed these users that had enabled the new search location by only reviewing search sessions initatiated on the old (‘legacy’) vector skin.
New Search Location (Treatment) | Old Search Location (Control) | Both Groups | |
---|---|---|---|
Basque Wikipedia | 872 | 649 | 1521 |
Persian Wikipedia | 22750 | 26235 | 48985 |
French Wikipedia | 1943 | 1709 | 3652 |
French Wikitionary | 1047 | 1394 | 2441 |
Hebrew Wikipedia | 2527 | 2457 | 4984 |
Portuguese Wikiversity | 12 | 7 | 19 |
All 6 wikis | 29151 | 32451 | 61602 |
Figure 2: Total number of search sessions initiated by search location group and Wiki included in the AB test.
The total number of distinct search sessions initiated range from 48985 search sessions on French Wikipedia to only 19 search sessions on Portuguese Wikiversity.
The search location with the higher number of search sessions initiated varies for each of the test wikis. Basque Wikipedia, Persian Wikipedia, French Wikitionary, and Portuguese Wikiversity had more search sessions initiated when the search widget was displayed in the new location while French Wikipedia and Hebrew Wikipedia had more search sessions initiated when the search widget was displayed in the old location.
Figure 3: The boxplot shows the distribution of the search sessions initiated across all of the test wikis. A logarithmic scale was applied to the number of search sessions initiated (y-axis) to address the large discrepancies between data for each wiki.
The overall distribution between the two test groups are similar. The middle half of the data for each test group has little variability in the search sessions initiated and each has two outliers: French Wikipedia and Portuguese Wikiversity. The new search location has a slightly higher overall number of search sessions initiated compared to the old search location.
To determine if the search location move had an impact, we calculated the total number of unique search sessions initiated for each Wiki in the AB test and search location group for a two-week period before and after the AB test. We compared these results to search sessions initiated on similar size wikis over the same time period.
Unfortunately, we are unable to identify logged-in or anonymous users prior to 20 October 2020 when theisAnon
field was added to the SearchSatisfaction schema and the AB test was started. As a result, we reviewed both logged-in and logged-out users when comparing search sessions initiated pre and post deployment of the AB test; however, only logged-in users were included in the AB test. Logged-out users on the test wikis received the new search location by default.
Figure 4: Total daily search sessions initiated before and following deployment of the AB test (5 October 2020 through 4 November 2020.). The figure includes a comparison of overall daily search sessions for Wikis included in the AB test and a similar set of Wikis not included in the AB test. Post deployment search sessions are highlighted in yellow
Overall, there appears to be no sudden change in daily search sessions initiated following deployment of the AB test on the test wikis.
We fit a linear regression model to correctly infer the impact of the search location move on the number of search sessions initiated and confirm any statistical difference between pre and post deployment data.
Figure 5: Search Sessions Initiated Pre AB Test vs Post AB Test by Search Location Group. We applied the log transformation to both the pre and post deployment data to address skew in the data caused by outlier Wikis. The lines represent fitted regression lines to the data for the Wiki Projects Not in the AB Test (red line) and Wiki Projects in the AB test (blue line).
As shown in Figure 5, the slopes for each test group are similar since the overall effect of test group is so small. If there had been a bigger positive effect by the AB test on search sessions initiated, we would see a significantly steeper slope for the blue line (Wiki Project in Test) compared to the red line (Wiki Project Not in Test).
We obtained the following estimates fitting the model to the data using a simple linear regression model (lm
) function. We applied the log transformation to stabilize variance since the data are highly skewed by the outliers of French Wikipedia and Portuguese Wikiversity. Because the response variable is on the log scale, we then took the multiplicative effect of the treatment, and estimated the 95% confidence interval using the delta method.
log(post) | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | -0.06 | -0.37 – 0.25 | 0.699 |
pre [log] | 1.01 | 0.98 – 1.04 | <0.001 |
test_group [WikiProjectInTest] |
0.02 | -0.15 – 0.20 | 0.771 |
Observations | 16 | ||
R2 / R2 adjusted | 0.998 / 0.997 |
Estimate | SE | 2.5 % | 97.5 % | |
---|---|---|---|---|
exp(test_groupWikiProjectInTest) | 1.024328 | 0.0828304 | 0.8619834 | 1.186673 |
While the estimate of the effect of the new search location is 2.4% increase in search sessions initiated, the 95% confidence interval is [-13.8%, 18.7%], meaning we do not have sufficient evidence to draw definitive conclusions. Note that the effect from the search move is dampened because it was only deployed to 50% of logged-in users as part of the AB test.
We also calculated the percent of all search sessions in the AB test that included a click to one of the results returned by test wiki and search location. Data was restricted to only sessions that had more than zero results returned to them.
We used the internally-developed Bayesian Categorical Data Analysis (BCDA) (Popov, n.d.) package for Bayesian statistical analysis and confidence intervals.
We first reviewed the number of sessions with at least one clickthrough to a provided search result by wiki in the AB test and overall.
`
Group 1 | Group 2 | Pr(Success) in Group 1 | Pr(Success) in Group 2 | Difference | Relative Risk | Odds Ratio |
---|---|---|---|---|---|---|
27875 | 30959 | 3.169% (2.967%, 3.380%) | 2.890% (2.708%, 3.078%) | 0.279% (0.005%, 0.564%) | 1.098 (1.002, 1.204) | 1.101 (1.002, 1.211) |
x |
---|
Table 3: Parameter estimates obtained by fitting the model to the search sessions completed data using the BCDA R package. |
Figure 6: How likely new search location users were to click on a search result at least once in a session across all Wikis in the AB test. A relative risk greater than 1 indicates the new search location group was more likely to complete a session, while a relative risk less than 1 indicates the new search location group was less likely to complete a session
Table 3 and Figure 6 shows the relative risk – how much more likely the new search location group is to click on a search result at least once in a session than the old search location group. Overall, users that saw the new search location are 1.098 times more likely to click on at least 1 search result in a session compared to users that saw the header in the old location with a 95% credible interval of (1.002, 1.204).
Figure 7: Search sessions completed by search location by each Wiki included in the AB Test, where a completed search session is defined as a session with at least 1 click to a provided search result.
Wiki | Relative Risk | 95% CI |
---|---|---|
Basque Wikipedia | 0.953 | (0.927, 0.978) |
French Wikipedia | 0.996 | (0.993, 0.999) |
French Wikitionary | 1.010 | (0.998, 1.022) |
Hebrew Wikipedia | 0.980 | (0.964, 0.995) |
Persian Wikipedia | 1.029 | (1.017, 1.043) |
Portuguese Wikiversity | 1.037 | (0.759, 1.362) |
Figure 8: How likely new search location users were to click on a search result at least once in a session by wiki. A relative risk greater than 1 indicates the new search location group was more likely to complete a session, while a relative risk less than 1 indicates the new search location group was less likely to complete a session
On a per wiki basis, there were more completed search sessions (defined by sessions with at least 1 click to search result) for the new search location group than the old search location group on Persian Wikipedia, French Wiktionary, and Portuguese Wikiversity while the number of completed search sessions were lower on Hebrew Wikipedia, Basque Wikipedia, and French Wikipedia.
Table 4 and Figure 8 shows the relative risk or how much more likely each respective test group is to click on a search result at least once in a session in the new search location (test group) than the old search location (control group). On Persian Wikipedia, users that saw the new search location are about 1.029 times more likely to click on at least one result during a session than users that saw the old search location, with a 95% credible interval of 1.017-1.043. For the other 5 wikis in the test, the 95% credible intervals contain 1 indicating do not have sufficient evidence to draw definitive conclusions for these.
As an alternate measure of user engagement with the provided results, we also reviewed clickthrough rate defined as the number of clicks to a search result divided by the number of searches by wiki and overall.
Figure 9: Average number of searches and average clicks per session by search location and Wiki included in the AB Test
Users in both the new and the old search location groups have a similar average number of clicks per session, ranging from 0.94 to 1. There is a little more variance in the average number of searches per session for the test wikis and search location groups. Hebrew Wikipedia only averages between 1 to 2 searches per session for both search location groups while the other wikis average searches per session range from 3 to 7 searches per session.
Comparison | Relative Risk | 95% CI |
---|---|---|
Test vs Control | 0.905 | (0.892, 0.919) |
Figure 10: How likely new search location users were to click on a search result during a search across all Wikis in the AB test. A relative risk greater than 1 indicates the new search location group was more likely to click on a search result during a search, while a relative risk less than 1 indicates the new search location group was less likely to click on a search result during a search
Table 5 and Figure 10 shows the relative risk – how much more likely users were to click on a search result during a search. Estimates indicate users are less likely to click on a result during a search overall but we do not have sufficient evidence to draw definitive conclusions.
Figure 11: Clickthrough rates for each search location group, split by wiki.
On a per wiki basis, there were more completed searches (defined by total number of clicks over total number of searches) for the new search location than the old search location on Hebrew, Persian, and Portuguese Wikiversity, while the completed search rates were lower on French Wikipedia, Basque Wikipedia, and French Wiktionary.
Wiki | Relative Risk | 95% CI |
---|---|---|
Basque Wikipedia | 0.958 | (0.868, 1.049) |
French Wikipedia | 0.910 | (0.895, 0.925) |
French Wikitionary | 0.804 | (0.754, 0.852) |
Hebrew Wikipedia | 1.082 | (1.021, 1.146) |
Persian Wikipedia | 1.009 | (0.956, 1.064) |
Portuguese Wikiversity | 1.206 | (0.370, 2.351) |
Figure 12: How likely new search location users were to click on a search result during a search by wiki. A relative risk greater than 1 indicates the new search location group was more likely to click on a search result during a search, while a relative risk less than 1 indicates the new search location group was less likely to click on a search result during a search
Table 6 and Figure 12 above shows the relative risk or how much more likely each respective test group is to click results in the new search location (test group) than the old search location (control group). On Hebrew Wikipedia, users that saw the new header location are about 1.082 times more likely to clickthrough during a search than users that saw the old header location, with a 95% credible interval of 1.021-1.145. For the other 5 wikis in the test, we do not have sufficient evidence to infer any impact from the search location move.
Mikhail Popov and Os Keyes (2021). wmfdata: R Tools For Wikimedia Foundation’s Analysts And Data Scientists. R package version 0.9.1. https://github.com/wikimedia/wmfdata-r
Popov, Mikhail. n.d. BCDA: Tools for Bayesian Categorical Data Analysis. https://github.com/bearloga/BCDA.
Screenshot by Alex Hollender available on Wikimedia Commons, licensed under CC BY-SA 3.0.