{ RMarkdown Source | Analysis Codebase }
Revision Notes: In the original version of this report, some of the engagement and PaulScore calculations did not filter out searches with zero results. In this revised version, we added a filter that only keeps searches with some results (since that means users actually have something to click on). The clickthrough rates and PaulScores are slightly higher all around, so the overall conclusions are the same, but we still wanted this report to be as accurate as it can be.
In order to improve the relevancy of search results, Discovery’s Search team has decided to test a new document ranking function called BM25, which would replace the Lucene classic similarity that is currently used. We saw the following results from our analysis:
We recommend switching to BM25 ranking with incoming links (and possibly pageviews) as query-independent factors, as this configuration appears to give users results that are more relevant and that they engage with more (especially the first search result).
The Discovery department’s mission is to help users discover and access knowledge on Wikipedia and other Wikimedia projects. One of the goals of Discovery’s Search team is to improve the relevance of results when users search Wikipedia and its sister projects. Currently, our search engine uses Lucene classic similarity to rank articles using term frequency–inverse document frequency (tf–idf). To improve the search results, we decided to try a new document-ranking function: Okapi BM25 (BM stands for Best Matching). To assess the efficacy of the proposed switch, we ran an A/B test for 10 days and anonymously tracked randomly sampled search sessions. Some users received search results acquired through the Lucene similarity, while others received search results acquired through BM25 with various configurations. We are primarily interested in:
Users who searched English Wikipedia (“enwiki”) had a 1 in 66 chance of being selected for search satisfaction tracking according to our TestSearchSatisfaction2 #15700292 schema. See change 307536 on Gerrit for more details. Those users who were on enwiki and were randomly selected to have their sessions anonymously tracked then had a 2 in 3 chance of being selected for the BM25 test.
Users who were randomly selected for the A/B test were then randomly assigned to one of five groups. For all the groups listed below, we include the normalized discounted cumulative gain (nDCG) score – a measure of ranking quality – computed from ranking data collected through Discernatron, a tool that allows participants to judge the relevance of search results to help the Search team be able to test changes before making them available on-wiki.
The “all field” field approach combines raw term frequency and field weights at index time, it creates an artificial field where the content of each field is copied n times:
At query time, we use this single all weighted field. The per field builder approach combines scores of individual fields at query time.
The full-text (as opposed to auto-complete) searching event logging data was extracted from the database using the following SQL query:
SELECT
LEFT(`timestamp`, 8) AS date,
event_subTest AS test_group,
`timestamp` AS ts,
event_uniqueId AS event_id,
event_mwSessionId AS session_id,
event_searchSessionId AS search_id,
event_pageViewId AS page_id,
event_searchToken AS cirrus_id,
CAST(event_query AS CHAR CHARACTER SET utf8) AS query,
event_hitsReturned AS n_results_returned,
event_msToDisplayResults AS load_time,
CASE WHEN event_action = 'searchResultPage' THEN 'SERP' ELSE 'click' END AS action,
event_position AS position_clicked
FROM TestSearchSatisfaction2_15700292
WHERE
LEFT(`timestamp`, 8) >= '20160830' AND LEFT(`timestamp`, 8) <= '20160910'
AND event_source = 'fulltext'
AND LEFT(event_subTest, 4) = 'bm25'
AND (
(event_action = 'searchResultPage' AND event_hitsReturned IS NOT NULL AND event_msToDisplayResults IS NOT NULL)
OR
(event_action = 'click' AND event_position IS NOT NULL AND event_position > -1)
)
ORDER BY date, wiki, session_id, search_id, page_id, action DESC, timestamp;
We used the statistical analysis software and programming language R[2] to perform this analysis. R packages used the most in this report include ggplot2,[4] dplyr,[5] and binom.[8]
# Install packages used in the report:
install.packages(c("devtools", "tidyverse", "binom"))
devtools::install_github("hadley/ggplot2")
# ^ development version of ggplot2 includes subtitles
devtools::install_github("wikimedia/wikimedia-discovery-polloi")
# ^ for converting 100000 into 100K via polloi::compress()
# Load packages that we will be using in this report:
library(tidyverse) # for ggplot2, dplyr, tidyr, broom, etc.
library(binom) # for Bayesian confidence intervals of proportions
When a user searches Wikipedia using query \(Q\) and does not get the results that they were looking for then they might give up, or they might reformulate their query and search for \(Q', Q'', Q'''\), etc., each time fixing a typo, editing a word, or adding new information. For example, if the user searches for the restaurant chain “Buffalo Wild Wings” using the search query “buffalo”, they won’t see the result on the first search results page. The user might go to the second page of results, in which case they would see the result they want, or they might reformulate their query to instead say “buffalo wings”.
First, we computed the Levenshtein (edit) distance (normalized via dividing by the maximum of the two search queries’ lengths). Then we adjusted those normalized distances by the amount that their search results overlapped. Specifically, if \(d_{Q, Q'}\) is the edit distance between queries \(Q\) and \(Q'\) and \(\rho_{Q, Q'} \in [0, 1]\) is the number of results in common normalized via dividing by the minimum of the two result sets’ sizes, then the new distance is: \[{d'}_{Q, Q'} = \frac{d_{Q, Q'}}{10^{\rho_{Q, Q'}}}.\] So if \(Q\) and \(Q'\) are moderately similar but have all results in common, then the new distance is 1/10th of their initially calculated distance, and brings them a lot closer to each other. Another example is if two searches have 1 result out of 20 in common, then \(d' = 10^{-0.05} d = 0.89 d\). If they do not have any results in common, their edit distance is unaffected. After we have calculated the distances between queries in a search session, we use hierarchical agglomerative clustering to group those queries based on their adjusted closeness (similarity). In this report we present an analysis result when employing each of the three linkages (complete, single, and average), since the choice of linkage affects the clustering (see Table 1 for an example).
For example:
Search query | Single-linkage | Complete-linkage |
---|---|---|
brtisth gas | Cluster A | Cluster A |
brtisth gaz | Cluster A | Cluster A |
brtisth gazcomapny | Cluster B | Cluster A |
fusion shell bg group | Cluster C | Cluster B |
overlapping_results <- function(x) {
if (all(is.na(x))) {
return(diag(length(x)))
}
input <- strsplit(stringr::str_replace_all(x, "[\\[\\]]", ""), ",")
output <- vapply(input, function(y) {
temp <- vapply(input, function(z) { length(intersect(z, y)) }, 0L)
# Normalize by diving by number of possible matches
# e.g. if two queries have two results each that are
# exactly the same, that's worth more than if
# two queries have 20 results each but have
# only three in common
temp <- temp/pmin(rep(length(y), length(input)), vapply(input, length, 0L))
temp[is.na(x)] <- 0L
return(temp)
}, rep(0.0, length(input)))
diag(output) <- 1L
return(output)
}
cluster_queries <- function(queries, results, linkage = c("complete", "single", "average"), threshold = NULL, debug = FALSE) {
if (length(queries) < 2) {
return(1)
}
input <- data.frame(query = queries, stringsAsFactors = FALSE)
x <- do.call(rbind, lapply(input$query, function(x) {
# Compute for each x in input$query the normalized edit distance from x to input$query:
normalized_distances <- adist(tolower(x), tolower(input$query), fixed = TRUE)/pmax(nchar(x), nchar(input$query))
# Return:
return(normalized_distances)
}))
# Decrease distance of queries that share results:
overlaps <- overlapping_results(results)
x <- x * (10^(-overlaps))
# ^ if two queries have the exact same results, we make the new
# edit distance 0.1 of what their original edit distance is
# Create distance object:
y <- x[lower.tri(x, diag = FALSE)]
d <- structure(
y, Size = length(queries), Labels = queries, Diag = FALSE, Upper = FALSE,
method = "levenshtein", class = "dist", call = match.call()
)
clustering_tree <- hclust(d, method = linkage[1])
# When using average linkage, we may end up with funky trees
# that cannot be properly cut. So this logic helps against
# errors and yieds NAs instead.
clusters <- tryCatch(
cutree(clustering_tree, h = threshold),
error = function(e) { return(NA) })
if (all(is.na(clusters))) {
clusters <- rep(clusters, nrow(input))
names(clusters) <- input$query
}
output <- left_join(
input,
data.frame(query = names(clusters),
cluster = as.numeric(clusters),
stringsAsFactors = FALSE),
by = "query")
if (debug) {
return(
list(
original_distances = x / (10^(-overlaps)),
overlaps = overlaps,
modified_distances = d,
output = output,
hc = clustering_tree
)
)
}
return(output$cluster)
}
PaulScore[1] is a measure of search results’ relevancy which takes into account the position of the clicked results, and is computed via the following steps:
We can calculate the confidence interval of PaulScore\((F)\) by approximating its distribution via boostrapping.
# PaulScore Calculation
query_score <- function(positions, F) {
if (length(positions) == 1 || all(is.na(positions))) {
# no clicks were made
return(0)
} else {
positions <- positions[!is.na(positions)] # when operating on 'events' dataset, SERP events won't have positions
return(sum(F^positions))
}
}
# Bootstrapping
bootstrap_mean <- function(x, m, seed = NULL) {
if (!is.null(seed)) {
set.seed(seed)
}
n <- length(x)
return(replicate(m, mean(x[sample.int(n, n, replace = TRUE)])))
}
# Import events fetched from MySQL
load(path("data/ab-test_bm25.RData"))
events$test_group <- factor(
events$test_group,
levels = c("bm25:control", "bm25:allfield", "bm25:inclinks", "bm25:inclinks_pv", "bm25:inclinks_pv_rev"),
labels = c("Control Group (tf–idf)", "Same query builder as control group but using BM25 as similarity function", "Using per-field query building with incoming links as QIF", "Using per-field query builder with incoming links and pageviews as QIFs", "Track typos in first 2 characters"))
cirrus <- readr::read_tsv(path("data/ab-test_bm25_cirrus-results.tsv.gz"), col_types = "ccc")
events <- left_join(events, cirrus, by = c("event_id", "page_id"))
rm(cirrus)
The test was deployed on September 1st and ran for 10 days, collecting a total of 119.6K events from 36.2K unique sessions. See Table 2 for counts broken down by test group.
events_summary <- events %>%
group_by(`Test group` = test_group) %>%
summarize(`Search sessions` = length(unique(search_id)), `Events recorded` = n()) %>%
{
rbind(., tibble(
`Test group` = "Total",
`Search sessions` = sum(.$`Search sessions`),
`Events recorded` = sum(.$`Events recorded`)
))
} %>%
mutate(`Search sessions` = prettyNum(`Search sessions`, big.mark = ","),
`Events recorded` = prettyNum(`Events recorded`, big.mark = ","))
events_summary$`Test group` <- paste0("<span class='test-group-", 1:nrow(events_summary), "'>", events_summary$`Test group`, "</span>")
knitr::kable(events_summary, format = "markdown", align = c("l", "r", "r"))
Test group | Search sessions | Events recorded |
---|---|---|
Control Group (tf–idf) | 7,297 | 23,484 |
Same query builder as control group but using BM25 as similarity function | 7,133 | 22,908 |
Using per-field query building with incoming links as QIF | 7,330 | 24,799 |
Using per-field query builder with incoming links and pageviews as QIFs | 7,235 | 23,557 |
Track typos in first 2 characters | 7,161 | 24,868 |
Total | 36,156 | 119,616 |
An issue we noticed with the event logging is that when the user goes to the next page of search results or clicks the Back button after visiting a search result, a new page ID is generated for the search results page. The page ID is how we connect click events to search result page events. There is currently a Phabricator ticket (T146337) for addressing these issues. For this analysis, we de-duplicated by connecting search engine results page (SERP) events that have the exact same search query, and then connected click events together based on the SERP connectivity.
# Correct for when user uses pagination or uses back button to go back to SERP after visiting a result.
# Start by assigning the same page_id to different SERPs that have exactly the same query:
temp <- events %>%
filter(action == "SERP") %>%
group_by(session_id, search_id, query) %>%
mutate(new_page_id = min(page_id)) %>%
ungroup %>%
select(c(page_id, new_page_id)) %>%
distinct
# We also need to do the same for associated click events:
events <- left_join(events, temp, by = "page_id"); rm(temp)
# Find out which SERPs are duplicated:
temp <- events %>%
filter(action == "SERP") %>%
arrange(new_page_id, ts) %>%
mutate(dupe = duplicated(new_page_id, fromLast = FALSE)) %>%
select(c(event_id, dupe))
events <- left_join(events, temp, by = "event_id"); rm(temp)
events$dupe[events$action == "click"] <- FALSE
# Remove duplicate SERPs and re-sort:
events <- events[!events$dupe & !is.na(events$new_page_id), ] %>%
select(-c(page_id, dupe)) %>%
rename(page_id = new_page_id) %>%
arrange(date, session_id, search_id, page_id, desc(action), ts)
# Summarize on a page-by-page basis:
searches <- events %>%
group_by(`test group` = test_group, session_id, search_id, page_id) %>%
filter("SERP" %in% action) %>% # filter out searches where we have clicks but not SERP events
summarize(ts = ts[1], query = query[1],
results = ifelse(n_results_returned[1] > 0, "some", "zero"),
clickthrough = "click" %in% action,
`first clicked result's position` = ifelse(clickthrough, position_clicked[2], NA),
`result page IDs` = result_pids[1],
`Query score (F=0.1)` = query_score(position_clicked, 0.1),
`Query score (F=0.5)` = query_score(position_clicked, 0.5),
`Query score (F=0.9)` = query_score(position_clicked, 0.9)) %>%
arrange(ts)
# Cluster queries
safe_clust <- function(search_id, page_ids, queries, results, threshold, linkage) {
clusters <- cluster_queries(queries, results, linkage, threshold)
if (length(clusters) != length(page_ids)) {
stop("Number of cluster labels does not match number of searches for search session ", unlist(search_id)[1])
} else {
return(clusters)
}
}
After de-duplicating, we collapsed 101.1K (SERP and click) events into 70K searches.
searches <- searches %>%
group_by(`test group`, session_id, search_id) %>%
mutate(
cluster_single = safe_clust(search_id, page_id, query, `result page IDs`, 0.301, "single"),
cluster_average = safe_clust(search_id, page_id, query, `result page IDs`, 0.433, "average"),
cluster_complete = safe_clust(search_id, page_id, query, `result page IDs`, 0.45, "complete")
)
most_common <- function(x) {
if (all(is.na(x))) {
return(as.character(NA))
} else {
return(names(sort(table(x), decreasing = TRUE))[1])
}
}
summarize_reformulations <- function(grouped_data) {
return({
grouped_data %>%
# Count number of similar searches made in a single search session
# (multiple search sessions per MW session allowed)
summarize(
reformulations = n() - 1,
clickthrough = any(clickthrough),
results = ifelse("some" %in% results, "some", "zero"),
`most popular position clicked first` = most_common(`first clicked result's position`),
`Cluster score (F=0.1)` = mean(`Query score (F=0.1)`, na.rm = TRUE),
`Cluster score (F=0.5)` = mean(`Query score (F=0.5)`, na.rm = TRUE),
`Cluster score (F=0.9)` = mean(`Query score (F=0.9)`, na.rm = TRUE)
) %>%
ungroup
})
}
set.seed(0) # for reproducibility
query_reformulations_single <- searches %>%
group_by(`test group`, session_id, search_id, cluster = cluster_single) %>%
summarize_reformulations
query_reformulations_complete <- searches %>%
group_by(`test group`, session_id, search_id, cluster = cluster_complete) %>%
summarize_reformulations
query_reformulations_average <- searches %>%
ungroup %>%
filter(!is.na(cluster_average)) %>%
group_by(`test group`, session_id, search_id, cluster = cluster_average) %>%
summarize_reformulations
query_reformulations <- bind_rows(
"Queries grouped via average linkage" = query_reformulations_average,
"Queries grouped via complete linkage" = query_reformulations_complete,
"Queries grouped via single linkage" = query_reformulations_single,
.id = "linkage")
rm(query_reformulations_average, query_reformulations_complete, query_reformulations_single)
As mentioned in the Methods section, we used hierarchical clustering to group similar searches together. That is, if the edit distance between two search queries is small enough and the search result sets overlap, those searches are probably the user reformulating their query. In Figures 1 and 2, we see proportions of searches that were reformulations, according to three different ways to group searches. The second group (“same query builder as control group but using BM25”) had the highest proportion of searches where the user reformulated their query at least once, and this difference is statistically significant. According to clustering via single linkage, the last group (BM25 with incoming links and tracking typos) had the lowest proportion of query reformulations.
# Calculate proportions of searches with 0, 1, 2, 3+ query reformulations
query_reformulations %>%
mutate(`query reformulations` = forcats::fct_lump(factor(reformulations), 3, other_level = "3+")) %>%
group_by(linkage, `test group`, `query reformulations`) %>%
tally %>%
mutate(proportion = n/sum(n)) %>%
ggplot(aes(x = `query reformulations`, y = proportion, fill = `test group`)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
facet_wrap(~ linkage, ncol = 3) +
labs(y = "Proportion of searches", x = "Approximate number of query reformulations per search session",
title = "Number of query reformulations by test group and linkage",
subtitle = "Queries were grouped via hierarchical clustering using average/complete/single linkage and edit distance adjusted by search results in common") +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom",
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
# Calculate proportion of searches where user reformulated their query
reformulation_counts <- query_reformulations %>%
group_by(linkage, `test group`) %>%
summarize(`searches with query reformulations` = sum(reformulations > 0),
searches = n(),
proportion = `searches with query reformulations`/searches) %>%
ungroup
reformulation_counts <- cbind(
reformulation_counts,
as.data.frame(
binom:::binom.bayes(
reformulation_counts$`searches with query reformulations`,
n = reformulation_counts$searches)[, c("mean", "lower", "upper")]
)
)
reformulation_counts %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_hline(aes(yintercept = mean), linetype = "dashed", color = RColorBrewer::brewer.pal(3, "Set1")[1],
data = filter(reformulation_counts, `test group` == "Control Group (tf–idf)")) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(),
breaks = seq(.10, .24, 0.01),
minor_breaks = seq(.10, .24, 0.005),
expand = c(0.01, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
facet_wrap(~ linkage, ncol = 3) +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
labs(y = "% of searches with query reformulations", x = NULL,
title = "Searches with reformulated queries by test group (and linkage)",
subtitle = "Queries were grouped via hierarchical clustering using average/complete/single linkage and edit distance adjusted by search results in common") +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
In Figure 3, we see that zero results rate (ZRR) was not so different between the groups. The test group where we tracked typos in first two characters had a significantly lower ZRR, but perhaps at the cost of relevance and engagement with the results (see Figure 5).
zrr_pages <- searches %>%
group_by(`test group`, results) %>%
tally %>%
spread(results, n) %>%
mutate(`zero results rate` = zero/(some + zero)) %>%
ungroup
zrr_pages <- cbind(zrr_pages, as.data.frame(binom:::binom.bayes(zrr_pages$zero, n = zrr_pages$some + zrr_pages$zero)[, c("mean", "lower", "upper")]))
zrr_pages %>%
ggplot(aes(x = `test group`, y = `mean`, color = `test group`)) +
geom_hline(
yintercept = zrr_pages$`mean`[zrr_pages$`test group` == "Control Group (tf–idf)"],
linetype = "dashed", color = "gray50") +
geom_pointrange(aes(ymin = lower, ymax = upper)) +
scale_x_discrete(limits = rev(levels(events$test_group))) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_brewer("Test Group", palette = "Set1", guide = FALSE) +
labs(x = NULL, y = "Zero Results Rate",
title = "Proportion of searches that did not yield any results, by test group",
subtitle = "With 95% credible intervals. Dashed line represents the baseline (ZRR of the control group).") +
geom_text(aes(label = sprintf("%.2f%%", 100 * `zero results rate`),
vjust = "bottom", hjust = "center"), nudge_x = 0.1) +
coord_flip() +
theme_minimal(base_family = "Lato")
In Figure 4, we see that the test groups that used BM25 with incoming links and pageviews as query-independent factors had higher PaulScores, which indicates that the results were more relevant.
set.seed(0) # for reproducibility
paulscores <- searches %>%
ungroup %>%
filter(results == "some") %>%
select(c(`test group`, `Query score (F=0.1)`, `Query score (F=0.5)`, `Query score (F=0.9)`)) %>%
gather(`F value`, `Query score`, -`test group`) %>%
mutate(`F value` = sub("^Query score \\(F=(0\\.[159])\\)$", "F = \\1", `F value`)) %>%
group_by(`test group`, `F value`) %>%
summarize(
PaulScore = mean(`Query score`),
Interval = paste0(quantile(bootstrap_mean(`Query score`, 1000), c(0.025, 0.975)), collapse = ",")
) %>%
extract(Interval, into = c("Lower", "Upper"), regex = "(.*),(.*)", convert = TRUE)
paulscores %>%
ggplot(aes(x = `F value`, y = PaulScore, color = `test group`)) +
geom_pointrange(aes(ymin = Lower, ymax = Upper), position = position_dodge(width = 0.7)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
labs(x = NULL, y = "PaulScore(F)",
title = "PaulScore(F) by test group and value of F",
subtitle = "With bootstrapped 95% confidence intervals. Dashed line indicates baseline (control group) for comparing test groups.") +
geom_text(aes(label = sprintf("%.3f", PaulScore), y = Upper + 0.01, vjust = "bottom"),
position = position_dodge(width = 0.7)) +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom") +
annotate("segment",
x = (0:2) + 0.6, xend = (1:3) + 0.4,
y = paulscores$PaulScore[paulscores$`test group` == "Control Group (tf–idf)"],
yend = paulscores$PaulScore[paulscores$`test group` == "Control Group (tf–idf)"],
color = RColorBrewer::brewer.pal(3, "Set1")[1],
linetype = "dashed")
In Figures 5 and 6, we see that the test groups that used BM25 with incoming links and pageviews as query-independent factors engaged with their search results more. The group whose search configuration included pageviews had a much higher clickthrough rate than the control group when they did not reformulate their search query.
engagement_overall <- searches %>%
filter(results == "some") %>%
group_by(`test group`) %>%
summarize(clickthroughs = sum(clickthrough > 0),
searches = n(), ctr = clickthroughs/searches) %>%
ungroup
engagement_overall <- cbind(
engagement_overall,
as.data.frame(
binom:::binom.bayes(
engagement_overall$clickthroughs,
n = engagement_overall$searches)[, c("mean", "lower", "upper")]
)
)
engagement_overall %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_hline(aes(yintercept = mean), linetype = "dashed", color = RColorBrewer::brewer.pal(3, "Set1")[1],
data = filter(engagement_overall, `test group` == "Control Group (tf–idf)")) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Clickthrough rate",
title = "Engagement with search results by test group") +
geom_text(aes(label = sprintf("%.1f%%", 100 * ctr), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom")
# Figure out engagement (clickthroughs) on a per-search ("query cluster") basis:
per_search_engagement <- query_reformulations %>%
filter(results == "some") %>%
group_by(`test group`, linkage,
search = paste(search_id, cluster),
reformulated = reformulations > 0) %>%
summarize(clickthroughs = sum(clickthrough), `non-ZR searches` = n())
engagement <- per_search_engagement %>%
group_by(`test group`, linkage,
reformulated = ifelse(reformulated, "reformulated query", "did not reformulate query")) %>%
summarize(clickthroughs = sum(clickthroughs > 0),
searches = n(), ctr = clickthroughs/searches) %>%
ungroup
engagement <- cbind(
engagement,
as.data.frame(
binom:::binom.bayes(
engagement$clickthroughs,
n = engagement$searches)[, c("mean", "lower", "upper")]
)
)
engagement %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_hline(aes(yintercept = mean), linetype = "dashed", color = RColorBrewer::brewer.pal(3, "Set1")[1],
data = filter(engagement, `test group` == "Control Group (tf–idf)")) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
facet_grid(reformulated ~ linkage, scales = "free_y") +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Clickthrough rate",
title = "Engagement with search results by test group and presence of query reformulation",
subtitle = "Queries were grouped via hierarchical clustering using average/complete/single linkage and edit distance adjusted by search results in common") +
geom_text(aes(label = sprintf("%.1f%%", 100 * ctr), y = upper + 0.005, vjust = "bottom"),
position = position_dodge(width = 1)) +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
In Figure 7, we see that users who got the “all-field” query builder with BM25 were much less likely to click on the first search result first than the other groups. Other test groups (whose search configurations used incoming links) first clicked on the first result at a much higher rate than users in the control group.
safe_ordinals <- function(x) {
return(vapply(x, toOrdinal::toOrdinal, ""))
}
first_clicked <- searches %>%
filter(results == "some" & clickthrough & !is.na(`first clicked result's position`)) %>%
mutate(`first clicked result's position` = ifelse(`first clicked result's position` < 4, safe_ordinals(`first clicked result's position` + 1), "5th or higher")) %>%
group_by(`test group`, `first clicked result's position`) %>%
tally %>%
mutate(total = sum(n), prop = n/total) %>%
ungroup
set.seed(0)
temp <- as.data.frame(binom:::binom.bayes(first_clicked$n, n = first_clicked$total, tol = .Machine$double.eps^0.1)[, c("mean", "lower", "upper")])
first_clicked <- cbind(first_clicked, temp); rm(temp)
first_clicked %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_hline(
aes(yintercept = mean),
linetype = "dashed", color = RColorBrewer::brewer.pal(3, "Set1")[1],
data = filter(first_clicked, `test group` == "Control Group (tf–idf)")
) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
geom_text(aes(label = sprintf("%.1f", 100 * prop), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(),
expand = c(0, 0.005), breaks = seq(0, 1, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
facet_wrap(~ `first clicked result's position`, scale = "free_y", nrow = 1) +
labs(x = NULL, y = "Proportion of searches",
title = "Position of the first clicked result",
subtitle = "With 95% credible intervals") +
theme_minimal(base_family = "Lato") +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
We recommend switching to BM25 ranking with incoming links (and possibly pageviews) as query-independent factors, as this configuration appears to give users results that are more relevant and that they engage with more (especially the first search result).
spread()
andgather()
Functions. R package version 0.6.0, https://CRAN.R-project.org/package=tidyr