{ RMarkdown Source | Analysis Codebase | PDF Version }
In order to assess the efficacy of BM25 in space-less language, Discovery’s Search team has decided to conduct a second A/B test in Chinese, Japanese and Thai Wikipedias. We observed that the test group that used per-field query builder with incoming links and pageviews as query-independent factors had a much better Zero Result Rate but slightly worse PaulScores, large decrease in clickthrough rate, and fewer users clicked on the first result first, which indicates that we are showing test group users worse results. However, longer dwell-time and fewer query reformulations show that test group users might actually like the results they are getting better than the control group in that respect. We recommend deploying BM25 for all wikis but not reindexing projects in space-less languages for now.
To improve the relevancy of search results, Discovery’s Search team decided to try a new document-ranking function called Okapi BM25 (BM stands for Best Matching), and ran an A/B test from August 30 to September 10 to assess the efficacy of the proposed switch. The analysis showed that BM25 ranking with incoming links and pageviews as query-independent factors appears to give users results that are more relevant and that they engage with more.
However, we then realized that our analysis chain is sub-optimal for space-less language queries, which will break words on every characters for the plain field. Therefore, we ran a second A/B test for Chinese, Japanese and Thai Wikipedias to test whether the new per-field BM25 builder is sufficiently worse with those languages. We are primarily interested in:
# Install packages used in the report:
install.packages(c("devtools", "tidyverse", "binom"))
devtools::install_github("hadley/ggplot2")
# ^ development version of ggplot2 includes subtitles
devtools::install_github("wikimedia/wikimedia-discovery-polloi")
# ^ for converting 100000 into 100K via polloi::compress()
# Load packages that we will be using in this report:
library(tidyverse) # for ggplot2, dplyr, tidyr, broom, etc.
library(binom) # for Bayesian confidence intervals of proportions
# PaulScore Calculation
query_score <- function(positions, F) {
if (length(positions) == 1 || all(is.na(positions))) {
# no clicks were made
return(0)
} else {
positions <- positions[!is.na(positions)] # when operating on 'events' dataset, searchResultPage events won't have positions
return(sum(F^positions))
}
}
# Bootstrapping
bootstrap_mean <- function(x, m, seed = NULL) {
if (!is.null(seed)) {
set.seed(seed)
}
n <- length(x)
return(replicate(m, mean(x[sample.int(n, n, replace = TRUE)])))
}
# Import events fetched from MySQL
load(path("data/ab-test_bm25.RData"))
events <- events[!duplicated(events$event_id),]
events <- events %>%
group_by(date, wiki, test_group, session_id, search_id) %>%
filter("searchResultPage" %in% action) %>%
ungroup %>% as.data.frame() # remove search_id without SERP asscociated
events$test_group <- factor(
events$test_group,
levels = c("bm25:control", "bm25:inclinks_pv"),
labels = c("Control Group (tf–idf)", "Using per-field query builder with incoming links and pageviews as QIFs"))
cirrus <- readr::read_tsv(path("data/ab-test_bm25_cirrus-results.tsv.gz"), col_types = "cccc")
cirrus <- cirrus[!duplicated(cirrus),]
events <- left_join(events, cirrus, by = c("event_id", "page_id", "cirrus_id"))
rm(cirrus)
For Chinese Wikipedia (zhwiki) and Japanese Wikipedia (jawiki), users had a 1 in 16 chance of being selected for anonymous tracking according to our TestSearchSatisfaction2 #15922352 event logging schema. Those users who were randomly selected to have their sessions anonymously tracked then had a 12 in 13 chance of being selected for the BM25 test. For Thai Wikipedia (thwiki), users had a 1 in 5 chance of being selected for search satisfaction tracking and then had a 38 in 39 chance of being selected for the BM25 test due to fewer visitors to that particular project. The sampled sessions were evenly put into a control group (tf-idf) and a test group (using per-field query builder with incoming links and pageviews as QIFs); see the first BM25 test report for more details. The test was deployed on October 27th and ran for a week for zhwiki and jawiki, but until November 15th for thwiki specifically.
The full-text (as opposed to auto-complete) searching event logging data was extracted from the database using this script. We collected a total of 230.2K events from 36.1K unique sessions. See Table 1 for counts broken down by wiki and test group.
events_summary <- events %>%
group_by(wiki, `Test group` = test_group) %>%
summarize(`Search sessions` = length(unique(search_id)), `Events recorded` = n()) %>% ungroup %>%
{
rbind(., tibble(
`wiki` = "All Wikis",
`Test group` = "Total",
`Search sessions` = sum(.$`Search sessions`),
`Events recorded` = sum(.$`Events recorded`)
))
} %>%
mutate(`Search sessions` = prettyNum(`Search sessions`, big.mark = ","),
`Events recorded` = prettyNum(`Events recorded`, big.mark = ","))
knitr::kable(events_summary, format = "markdown", align = c("l", "l", "r", "r"))
wiki | Test group | Search sessions | Events recorded |
---|---|---|---|
jawiki | Control Group (tf–idf) | 7,579 | 54,261 |
jawiki | Using per-field query builder with incoming links and pageviews as QIFs | 7,610 | 59,808 |
thwiki | Control Group (tf–idf) | 3,889 | 18,954 |
thwiki | Using per-field query builder with incoming links and pageviews as QIFs | 4,055 | 21,217 |
zhwiki | Control Group (tf–idf) | 6,510 | 37,561 |
zhwiki | Using per-field query builder with incoming links and pageviews as QIFs | 6,478 | 38,382 |
All Wikis | Total | 36,121 | 230,183 |
An issue we noticed with the event logging is that when the user goes to the next page of search results or clicks the Back button after visiting a search result, a new page ID is generated for the search results page. The page ID is how we connect click events to search result page events. There is currently a Phabricator ticket (T146337) for addressing these issues. For this analysis, we de-duplicated by connecting search engine results page (searchResultPage) events that have the exact same search query, and then connected click events together based on the searchResultPage connectivity.
temp <- events %>%
filter(action == "searchResultPage") %>%
group_by(session_id, search_id, query) %>%
mutate(new_page_id = min(page_id)) %>%
ungroup %>%
select(c(page_id, new_page_id)) %>%
distinct
events <- left_join(events, temp, by = "page_id"); rm(temp)
events$new_page_id[is.na(events$new_page_id)] <- events$page_id[is.na(events$new_page_id)]
temp <- events %>%
filter(action == "searchResultPage") %>%
arrange(new_page_id, ts) %>%
mutate(dupe = duplicated(new_page_id, fromLast = FALSE)) %>%
select(c(event_id, dupe))
events <- left_join(events, temp, by = "event_id"); rm(temp)
events$dupe[events$action != "searchResultPage"] <- FALSE
events <- events[!events$dupe & !is.na(events$new_page_id), ] %>%
select(-c(page_id, dupe)) %>%
rename(page_id = new_page_id) %>%
arrange(date, session_id, search_id, page_id, desc(action), ts)
# Summarize on a page-by-page basis for each SERP:
searches <- events %>%
group_by(wiki, `test group` = test_group, session_id, search_id, page_id) %>%
filter("searchResultPage" %in% action) %>% # filter out searches where we have clicks but not searchResultPage events
summarize(ts = ts[1], query = query[1],
results = ifelse(n_results_returned[1] > 0, "some", "zero"),
clickthrough = "click" %in% action,
`no. results clicked` = length(unique(position_clicked))-1,
`first clicked result's position` = ifelse(clickthrough, position_clicked[2], NA),
`result page IDs` = paste(unique(result_pids[!is.na(result_pids)]), collapse=','),
`Query score (F=0.1)` = query_score(position_clicked, 0.1),
`Query score (F=0.5)` = query_score(position_clicked, 0.5),
`Query score (F=0.9)` = query_score(position_clicked, 0.9)) %>%
arrange(ts)
searches$`result page IDs`[searches$`result page IDs`==""] <- NA
After de-duplicating, we collapsed 213.4K (searchResultPage and click) events into 64.5K searches. See Table 2 for counts broken down by wiki and test group.
events_summary2 <- events %>%
group_by(wiki, `Test group` = test_group) %>%
summarize(`Search sessions` = length(unique(search_id)), `Events recorded` = n()) %>% ungroup %>%
{
rbind(., tibble(
`wiki` = "All Wikis",
`Test group` = "Total",
`Search sessions` = sum(.$`Search sessions`),
`Events recorded` = sum(.$`Events recorded`)
))
} %>%
mutate(`Search sessions` = prettyNum(`Search sessions`, big.mark = ","),
`Events recorded` = prettyNum(`Events recorded`, big.mark = ","))
searches_summary <- searches %>%
group_by(wiki, `Test group` = `test group`) %>%
summarize(`Search sessions` = length(unique(search_id)), `Searches recorded` = n()) %>% ungroup %>%
{
rbind(., tibble(
`wiki` = "All Wikis",
`Test group` = "Total",
`Search sessions` = sum(.$`Search sessions`),
`Searches recorded` = sum(.$`Searches recorded`)
))
} %>%
mutate(`Search sessions` = prettyNum(`Search sessions`, big.mark = ","),
`Searches recorded` = prettyNum(`Searches recorded`, big.mark = ","))
knitr::kable(inner_join(searches_summary, events_summary2, by=c("wiki", "Test group", "Search sessions")),
format = "markdown", align = c("l", "l", "r", "r", "r"))
wiki | Test group | Search sessions | Searches recorded | Events recorded |
---|---|---|---|---|
jawiki | Control Group (tf–idf) | 7,579 | 13,839 | 50,845 |
jawiki | Using per-field query builder with incoming links and pageviews as QIFs | 7,610 | 13,982 | 56,097 |
thwiki | Control Group (tf–idf) | 3,889 | 7,005 | 16,379 |
thwiki | Using per-field query builder with incoming links and pageviews as QIFs | 4,055 | 7,086 | 18,481 |
zhwiki | Control Group (tf–idf) | 6,510 | 11,450 | 35,262 |
zhwiki | Using per-field query builder with incoming links and pageviews as QIFs | 6,478 | 11,173 | 36,373 |
All Wikis | Total | 36,121 | 64,535 | 213,437 |
There are 22.1K visitPage events (When the user clicks a link in the results a visitPage event is created). See Table 3 for counts broken down by wiki and test group.
# Summarize on a page-by-page basis for each visitPage:
clickedResults <- events %>%
group_by(wiki, `test group` = test_group, session_id, search_id, page_id) %>%
filter("visitPage" %in% action) %>% #only checkin and visitPage action
summarize(ts = ts[1],
dwell_time=ifelse("checkin"%in% action, max(checkin, na.rm=T), 0),
scroll=sum(scroll)>0) %>%
arrange(ts)
clickedResults$dwell_time[is.na(clickedResults$dwell_time)] <- 0
clickedResults_summary <- clickedResults %>%
group_by(wiki, `Test group` = `test group`) %>%
summarize(`Visited pages` = length(unique(page_id))) %>% ungroup %>%
{
rbind(., tibble(
`wiki` = "All Wikis",
`Test group` = "Total",
`Visited pages` = sum(.$`Visited pages`)
))
} %>%
mutate(`Visited pages` = prettyNum(`Visited pages`, big.mark = ","))
knitr::kable(clickedResults_summary, format = "markdown", align = c("l", "l", "r", "r"))
wiki | Test group | Visited pages |
---|---|---|
jawiki | Control Group (tf–idf) | 5,431 |
jawiki | Using per-field query builder with incoming links and pageviews as QIFs | 5,954 |
thwiki | Control Group (tf–idf) | 1,554 |
thwiki | Using per-field query builder with incoming links and pageviews as QIFs | 1,776 |
zhwiki | Control Group (tf–idf) | 3,591 |
zhwiki | Using per-field query builder with incoming links and pageviews as QIFs | 3,807 |
All Wikis | Total | 22,113 |
In Figure 1, we see that the test group that used BM25 with incoming links and pageviews as query-independent factors had a significantly lower zero results rate (ZRR). Smaller ZRR is usually better, but looking at the PaulScore, engagement and First Clicked Result’s Position, we doubt that it came at the cost of relevance and engagement with the results. Figure 2 shows that zhwiki had the largest ZRR difference between control and test group.
zrr_pages <- searches %>%
group_by(`test group`, results) %>%
tally %>%
spread(results, n) %>%
mutate(`zero results rate` = zero/(some + zero)) %>%
ungroup
zrr_pages <- cbind(zrr_pages, as.data.frame(binom:::binom.bayes(zrr_pages$zero, n = zrr_pages$some + zrr_pages$zero)[, c("mean", "lower", "upper")]))
zrr_pages %>%
ggplot(aes(x = `test group`, y = `mean`, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper)) +
scale_x_discrete(limits = levels(events$test_group)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_brewer("Test Group", palette = "Set1", guide = FALSE) +
labs(x = NULL, y = "Zero Results Rate",
title = "Proportion of searches that did not yield any results, by test group",
subtitle = "With 95% credible intervals.") +
geom_text(aes(label = sprintf("%.2f%%", 100 * `zero results rate`),
vjust = "bottom", hjust = "center"), nudge_x = 0.1)
zrr_pages <- searches %>%
group_by(wiki, `test group`, results) %>%
tally %>%
spread(results, n) %>%
mutate(`zero results rate` = zero/(some + zero)) %>%
ungroup
zrr_pages <- cbind(zrr_pages, as.data.frame(binom:::binom.bayes(zrr_pages$zero, n = zrr_pages$some + zrr_pages$zero)[, c("mean", "lower", "upper")]))
zrr_pages %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
labs(x = NULL, y = "Zero Results Rate",
title = "Proportion of searches that did not yield any results, by test group and wiki",
subtitle = "With 95% credible intervals.") +
geom_text(aes(label = sprintf("%.2f%%", 100 * `zero results rate`),
y = upper + 0.0025, vjust = "bottom"), position = position_dodge(width = 1)) +
facet_wrap(~ wiki, ncol = 3) +
theme(legend.position = "bottom")
In Figure 3, we see that the test group had slightly lower PaulScores, which indicates that the results were less relevant. The difference is not significant when F = 0.9. This make sense because the smaller the value of scoring factor, the more weight put on the first few results, with lower ranking results counting for almost nothing. F=0.9 takes into account the broadest range of result rankings, and thus is less likely to change as dramatically. Figure 4 shows that zhwiki had the largest PaulScore differences between control and test group.
set.seed(777)
paulscores <- searches %>%
ungroup %>%
filter(clickthrough==TRUE) %>% #???
select(c(`test group`, `Query score (F=0.1)`, `Query score (F=0.5)`, `Query score (F=0.9)`)) %>%
gather(`F value`, `Query score`, -`test group`) %>%
mutate(`F value` = sub("^Query score \\(F=(0\\.[159])\\)$", "F = \\1", `F value`)) %>%
group_by(`test group`, `F value`) %>%
summarize(
PaulScore = mean(`Query score`),
Interval = paste0(quantile(bootstrap_mean(`Query score`, 1000), c(0.025, 0.975)), collapse = ",")
) %>%
extract(Interval, into = c("Lower", "Upper"), regex = "(.*),(.*)", convert = TRUE)
paulscores %>%
ggplot(aes(x = `F value`, y = PaulScore, color = `test group`)) +
geom_pointrange(aes(ymin = Lower, ymax = Upper), position = position_dodge(width = 0.7)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
#scale_y_continuous(limits = c(0.2, 0.35)) +
labs(x = NULL, y = "PaulScore(F)",
title = "PaulScore(F) by test group and value of F",
subtitle = "With bootstrapped 95% confidence intervals.") +
geom_text(aes(label = sprintf("%.3f", PaulScore), y = Upper + 0.01, vjust = "bottom"),
position = position_dodge(width = 0.7)) +
theme(legend.position = "bottom")
set.seed(777)
paulscores <- searches %>%
ungroup %>%
filter(clickthrough==TRUE) %>%
select(c(wiki, `test group`, `Query score (F=0.1)`, `Query score (F=0.5)`, `Query score (F=0.9)`)) %>%
gather(`F value`, `Query score`, -c(`test group`, wiki)) %>%
mutate(`F value` = sub("^Query score \\(F=(0\\.[159])\\)$", "F = \\1", `F value`)) %>%
group_by(wiki, `test group`, `F value`) %>%
summarize(
PaulScore = mean(`Query score`),
Interval = paste0(quantile(bootstrap_mean(`Query score`, 1000), c(0.025, 0.975)), collapse = ",")
) %>%
extract(Interval, into = c("Lower", "Upper"), regex = "(.*),(.*)", convert = TRUE)
paulscores %>%
ggplot(aes(x = `F value`, y = PaulScore, color = `test group`)) +
geom_pointrange(aes(ymin = Lower, ymax = Upper), position = position_dodge(width = 0.7)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
#scale_y_continuous(limits = c(0.2, 0.35)) +
labs(x = NULL, y = "PaulScore(F)",
title = "PaulScore(F) by wiki, test group and value of F",
subtitle = "With bootstrapped 95% confidence intervals.") +
geom_text(aes(label = sprintf("%.3f", PaulScore), y = Upper + 0.01, vjust = "bottom"),
position = position_dodge(width = 0.7)) +
facet_wrap(~ wiki, ncol = 3) +
theme(legend.position = "bottom")
In Figure 5, we see that the test group had a significantly lower clickthrough rate, which means users are less engaged with their search results. Again, zhwiki shows the largest discrepancy between control and test group in Figure 6.
engagement_overall <- searches %>%
filter(results=="some") %>% #???
group_by(`test group`) %>%
summarize(clickthroughs = sum(clickthrough > 0),
searches = n(), ctr = clickthroughs/searches) %>%
ungroup
engagement_overall <- cbind(
engagement_overall,
as.data.frame(
binom:::binom.bayes(
engagement_overall$clickthroughs,
n = engagement_overall$searches)[, c("mean", "lower", "upper")]
)
)
engagement_overall %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Clickthrough rate",
title = "Engagement with search results by test group") +
geom_text(aes(label = sprintf("%.1f%%", 100 * ctr), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
theme(legend.position = "bottom")
engagement_overall <- searches %>%
filter(results=="some") %>%
group_by(wiki, `test group`) %>%
summarize(clickthroughs = sum(clickthrough > 0),
searches = n(), ctr = clickthroughs/searches) %>%
ungroup
engagement_overall <- cbind(
engagement_overall,
as.data.frame(
binom:::binom.bayes(
engagement_overall$clickthroughs,
n = engagement_overall$searches)[, c("mean", "lower", "upper")]
)
)
engagement_overall %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Clickthrough rate",
title = "Engagement with search results by test group and wiki") +
geom_text(aes(label = sprintf("%.1f%%", 100 * ctr), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
facet_wrap(~ wiki, ncol = 3) +
theme(legend.position = "bottom")
In Figure 7, we see that test group users were less likely to click on the first search result first than the control group. Figure 8 shows that only zhwiki users first clicked on the first result at a significantly lower rate, which indicates that the results were less relevant.
safe_ordinals <- function(x) {
return(vapply(x, toOrdinal::toOrdinal, ""))
}
first_clicked <- searches %>%
filter(results == "some" & clickthrough & !is.na(`first clicked result's position`)) %>%
mutate(`first clicked result's position` = ifelse(`first clicked result's position` < 4, safe_ordinals(`first clicked result's position` + 1), "5th or higher")) %>%
group_by(`test group`, `first clicked result's position`) %>%
tally %>%
mutate(total = sum(n), prop = n/total) %>%
ungroup
set.seed(0)
temp <- as.data.frame(binom:::binom.bayes(first_clicked$n, n = first_clicked$total, tol = .Machine$double.eps^0.1)[, c("mean", "lower", "upper")])
first_clicked <- cbind(first_clicked, temp); rm(temp)
first_clicked %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
geom_text(aes(label = sprintf("%.1f", 100 * prop), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(),
expand = c(0, 0.005), breaks = seq(0, 1, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
facet_wrap(~ `first clicked result's position`, scale = "free_y", nrow = 1) +
labs(x = NULL, y = "Proportion of searches",
title = "Position of the first clicked result",
subtitle = "With 95% credible intervals") +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
first_clicked <- searches %>%
filter(results == "some" & clickthrough & !is.na(`first clicked result's position`)) %>%
mutate(`first clicked result's position` = ifelse(`first clicked result's position` < 4, safe_ordinals(`first clicked result's position` + 1), "5th or higher")) %>%
group_by(wiki, `test group`, `first clicked result's position`) %>%
tally %>%
mutate(total = sum(n), prop = n/total) %>%
ungroup
set.seed(0)
temp <- as.data.frame(binom:::binom.bayes(first_clicked$n, n = first_clicked$total, tol = .Machine$double.eps^0.1)[, c("mean", "lower", "upper")])
first_clicked <- cbind(first_clicked, temp); rm(temp)
first_clicked %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
geom_text(aes(label = sprintf("%.1f", 100 * prop), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(),
expand = c(0, 0.005), breaks = seq(0, 1, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
facet_wrap(~ wiki+`first clicked result's position`, scale = "free_y", ncol=5, nrow = 3) +
labs(x = NULL, y = "Proportion of searches",
title = "Position of the first clicked result",
subtitle = "With 95% credible intervals") +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "gray90"),
panel.border = element_rect(color = "gray30", fill = NA))
Figures 9 and 10 show the survival curve for each test group and wiki. Except zhwiki, users are more likely to stay longer on visited pages, which implies the results in test group are more relevant for jawiki and thwiki.
clickedResults %>%
split(.$`test group`) %>%
map_df(function(df) {
seconds <- c(0, 10, 20, 30, 40, 50, 60, 90, 120, 150, 180, 210, 240, 300, 360, 420)
seshs <- as.data.frame(do.call(cbind, lapply(seconds, function(second) {
return(sum(df$dwell_time >= second))
})))
names(seshs) <- seconds
return(cbind(`test group` = head(df$`test group`, 1), n = seshs$`0`, seshs, `450`=seshs$`420`))
}) %>%
gather(seconds, visits, -c(`test group`, n)) %>%
mutate(seconds = as.numeric(seconds)) %>%
group_by(`test group`, seconds) %>%
mutate(proportion = visits/n) %>%
ungroup() %>%
ggplot(aes(group=`test group`, color=`test group`)) +
geom_step(aes(x = seconds, y = proportion), direction = "hv") +
scale_x_continuous(name = "T (Dwell Time in seconds)", breaks=c(0, 10, 20, 30, 40, 50, 60, 90, 120, 150, 180, 210, 240, 300, 360, 420))+
scale_y_continuous("Proportion of visits longer than T (P%)", labels = scales::percent_format(),
breaks = seq(0, 1, 0.1)) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
clickedResults %>%
split(list(.$wiki, .$`test group`)) %>%
map_df(function(df) {
seconds <- c(0, 10, 20, 30, 40, 50, 60, 90, 120, 150, 180, 210, 240, 300, 360, 420)
seshs <- as.data.frame(do.call(cbind, lapply(seconds, function(second) {
return(sum(df$dwell_time >= second))
})))
names(seshs) <- seconds
return(cbind(wiki=head(df$wiki, 1), `test group` = head(df$`test group`, 1), n = seshs$`0`, seshs, `450`=seshs$`420`))
}) %>%
gather(seconds, visits, -c(wiki, `test group`, n)) %>%
mutate(seconds = as.numeric(seconds)) %>%
group_by(wiki, `test group`, seconds) %>%
mutate(proportion = visits/n) %>%
ungroup() %>%
ggplot(aes(group=`test group`, color=`test group`)) +
geom_step(aes(x = seconds, y = proportion), direction = "hv") +
scale_x_continuous(name = "T (Dwell Time in seconds)", breaks=c(0, 10, 20, 30, 40, 50, 60, 90, 120, 150, 180, 210, 240, 300, 360, 420))+
scale_y_continuous("Proportion of visits longer than T (P%)", labels = scales::percent_format(),
breaks = seq(0, 1, 0.1)) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom") +
facet_wrap(~ wiki, ncol=1, nrow =3)
In Figures 11 and 12, we can see that users in the test group are more likely to scroll on the visited pages, but the differences are not statistically significant.
scroll_overall <- clickedResults %>%
group_by(`test group`) %>%
summarize(scrolls=sum(scroll), visits=n(), proportion = sum(scroll)/n()) %>%
ungroup
scroll_overall <- cbind(
scroll_overall,
as.data.frame(
binom:::binom.bayes(
scroll_overall$scrolls,
n = scroll_overall$visits)[, c("mean", "lower", "upper")]
)
)
scroll_overall %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Proportion of visits",
title = "Proportion of visits with scroll by test group") +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
theme(legend.position = "bottom")
scroll_overall <- clickedResults %>%
group_by(wiki, `test group`) %>%
summarize(scrolls=sum(scroll), visits=n(), proportion = sum(scroll)/n()) %>%
ungroup
scroll_overall <- cbind(
scroll_overall,
as.data.frame(
binom:::binom.bayes(
scroll_overall$scrolls,
n = scroll_overall$visits)[, c("mean", "lower", "upper")]
)
)
scroll_overall %>%
ggplot(aes(x = 1, y = mean, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
labs(x = NULL, y = "Proportion of visits",
title = "Proportion of visits with scroll by test group and wiki") +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
facet_wrap(~ wiki, ncol = 3) +
theme(legend.position = "bottom")
First, we tokenized queries from zhwiki, jawiki and thwiki with jieba, tinysegmenter and elasticsearch termvectors api separately. Then we filter out stop words.
# See tokenize.R for more details
library(jiebaR)
library(ropencc)
#thwiki
load("../data/tokens_th.RData")
tokens_th <- output; rm(output)
stopword_th <- jsonlite::fromJSON("../resource/th.json")
tokens_th <- filter_segment(tokens_th,stopword_th)
#zhwiki
queries_zh <- searches[searches$wiki=="zhwiki", c("page_id","query")]
queries_zh$query <- gsub("[[:punct:]]", " ", queries_zh$query)
stopword_zh <- jsonlite::fromJSON("../resource/zh.json")
ccst = converter(S2T)
stopword_zh <- union(ccst[stopword_zh], stopword_zh)
mixseg = worker(bylines = TRUE)
tokens_zh <- mixseg[queries_zh$query]
tokens_zh <- filter_segment(tokens_zh,stopword_zh)
names(tokens_zh) <- queries_zh$page_id
#jawiki
tokens_ja <- jsonlite::fromJSON("../data/tokens_ja.json")
stopword_ja <- jsonlite::fromJSON("../resource/ja.json")
tokens_ja <- filter_segment(tokens_ja,stopword_ja)
tokens_ja <- lapply(tokens_ja, function(x) gsub(" +","",x)) # strip space before/after string
We consider two queries as a reformulation if 1) they are from the same search session and share at least one result, or 2) they are from the same search session and share at least one word.
library(igraph)
all_tokens <- c(tokens_ja, tokens_zh, tokens_th)
overlapping_results <- function(x) {
if (all(is.na(x))) {
return(diag(length(x)))
}
input <- strsplit(stringr::str_replace_all(x, "[\\[\\]]", ""), ",")
output <- vapply(input, function(y) {
temp <- vapply(input, function(z) { length(intersect(z, y)) }, 0L)
temp[is.na(x)] <- 0L
return(temp)
}, rep(0L, length(input)))
diag(output) <- 1L
return(output)
}
reformulation <- function(result_pids, page_ids, all_tokens) {
if(length(page_ids)==1){
return(data.frame(n_search=1, n_reformulate="0"))
}
# result overlap
overlaps_res <- overlapping_results(result_pids)
# tokens overlap
this_tokens <- all_tokens[page_ids]
overlaps_token <- vapply(this_tokens, function(y) {
temp <- vapply(this_tokens, function(z) { length(intersect(z, y)) }, 0L)
temp[is.na(this_tokens)] <- 0L
return(temp)
}, rep(0L, length(this_tokens)))
diag(overlaps_token) <- 1L
overlaps_all <- (overlaps_res + overlaps_token) > 0
diag(overlaps_all) <- 0L
graph_object <- graph_from_adjacency_matrix(overlaps_all, mode = c("undirected"), diag = F)
csize <- clusters(graph_object)$csize
return(data.frame(n_search=length(csize), n_reformulate=ifelse(length(csize)>1,paste(csize-1, collapse=',' ), as.character(csize-1))))
}
reform_times <- searches %>%
group_by(wiki, `test group`, session_id, search_id) %>%
do(reformulation(.$`result page IDs`, .$page_id, all_tokens))
We grouped connected searches together using the rules above, then we have 51354 total search groups.
reform_times %>% ungroup %>%
mutate(n_search=ifelse(n_search>=10,"10+",n_search)) %>%
group_by(n_search) %>%
summarise(n_session=n()) %>%
mutate(prop=n_session/sum(n_session)) %>%
ggplot(aes(x = n_search, y = prop)) +
geom_bar(stat = "identity") +
scale_y_continuous("Proportion of search sessions", labels = scales::percent_format()) +
scale_x_discrete("Number of Search Groups", limits = c(1:9, "10+")) +
geom_text(aes(label = sprintf("%.1f%%", 100*prop)), nudge_y = 0.025) +
ggthemes::theme_tufte(base_family = "Gill Sans", base_size = 14) +
ggtitle("Proportion of search sessions making N search groups")
In Figure 14 and Figure 16, we can see that test group users are less likely to reformulate their queries. Figure 15 and Figure 17 show that zhwiki had the largest discrepancy between control and test group.
reformulation_counts <- reform_times %>%
group_by(`test group`) %>%
summarize(`searches with query reformulations` = sum(as.numeric(unlist(lapply(n_reformulate, function(x) strsplit(x, ",")))) > 0),
searches = sum(n_search),
proportion = `searches with query reformulations`/searches) %>%
ungroup
reformulation_counts <- cbind(
reformulation_counts,
as.data.frame(
binom:::binom.bayes(
reformulation_counts$`searches with query reformulations`,
n = reformulation_counts$searches,
tol = .Machine$double.eps^0.1)[, c("mean", "lower", "upper")]
)
)
reformulation_counts %>%
ggplot(aes(x = `test group`, y = `mean`, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper)) +
scale_x_discrete(limits = levels(events$test_group)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_brewer("Test Group", palette = "Set1", guide = FALSE) +
labs(x = NULL, y = "% of searches with query reformulations",
title = "Searches with reformulated queries by test group",
subtitle = "Queries were grouped when they shared common results or common key words") +
geom_text(aes(label = sprintf("%.2f%%", 100 * proportion),
vjust = "bottom", hjust = "center"), nudge_x = 0.1)
reformulation_counts <- reform_times %>%
group_by(wiki, `test group`) %>%
summarize(`searches with query reformulations` = sum(as.numeric(unlist(lapply(n_reformulate, function(x) strsplit(x, ",")))) > 0),
searches = sum(n_search),
proportion = `searches with query reformulations`/searches) %>%
ungroup
reformulation_counts <- cbind(
reformulation_counts,
as.data.frame(
binom:::binom.bayes(
reformulation_counts$`searches with query reformulations`,
n = reformulation_counts$searches)[, c("mean", "lower", "upper")]
)
)
reformulation_counts %>%
ggplot(aes(x = 1, y = `mean`, color = `test group`)) +
geom_pointrange(aes(ymin = lower, ymax = upper), position = position_dodge(width = 1)) +
scale_y_continuous(labels = scales::percent_format(), expand = c(0.01, 0.01)) +
scale_color_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
labs(x = NULL, y = "% of searches with query reformulations",
title = "Searches with reformulated queries by test group and wiki",
subtitle = "Queries were grouped when they shared common results or common key words") +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), y = upper + 0.0025, vjust = "bottom"),
position = position_dodge(width = 1)) +
theme(legend.position = "bottom")+
facet_wrap(~ wiki, ncol = 3)
count_reformulation <- function(n_reformulate){
temp <- as.numeric(unlist(lapply(n_reformulate, function(x) strsplit(x, ","))))
temp <- ifelse(temp>=3, "3+", temp)
output <- as.data.frame(table(temp))
output$proportion <- output$Freq/sum(output$Freq)
colnames(output)[1] <- "query reformulations"
return(output)
}
reform_times %>%
group_by(`test group`) %>%
do(count_reformulation(.$n_reformulate)) %>%
ggplot(aes(x = `query reformulations`, y = proportion, fill = `test group`)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), vjust = -0.5),
position = position_dodge(width = 1)) +
labs(y = "Proportion of searches", x = "Approximate number of query reformulations per grouped search",
title = "Number of query reformulations by test group",
subtitle = "Queries were grouped when they shared common results or common key words") +
theme(legend.position = "bottom")
reform_times %>%
group_by(wiki, `test group`) %>%
do(count_reformulation(.$n_reformulate)) %>%
ggplot(aes(x = `query reformulations`, y = proportion, fill = `test group`)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer("Test Group", palette = "Set1", guide = guide_legend(ncol = 2)) +
geom_text(aes(label = sprintf("%.1f%%", 100 * proportion), vjust = -0.5),
position = position_dodge(width = 1)) +
labs(y = "Proportion of searches", x = "Approximate number of query reformulations per grouped search",
title = "Number of query reformulations by test group and wiki",
subtitle = "Queries were grouped when they shared common results or common key words") +
theme(legend.position = "bottom")+
facet_wrap(~ wiki, ncol = 3)
For the test group, we observed a much better ZRR but slightly worse PaulScores, large decrease in clickthrough rate, and fewer users clicked on the first result first, indicating that we are showing users worse results. However, dwell-time and query reformulation analysis show that users may like the results they are getting in some aspect. We recommend deploying BM25 for all wikis but not reindexing projects in space-less languages for now.
This test revealed the performance of BM25 is unsatisfactory on space-less languages: some CirrusSearch components are dependent on the presence of spaces. We agree that the current behavior (forcing a proximity match on every query) is far from ideal but given the outcome of the test we decided not to move forward without prior work on space-less languages. We need better tokenization, and we need to track and fix all the components that make the bold assumption of the presence of spaces to activate/deactivate features.
The relatively large decrease in engagement and relevancy for zhwiki may be the result of tokenizer behavior. Chinese is the sole language in this test where we do not have a custom analysis chain. We emit only unigrams, so any page that randomly has all the same characters as the query in it will be returned. This can greatly decreases ZRR without any increase in search result quality or relevance. In English this would be roughly similar to returning any page that had all the same letters as the query.
The query reformulation analysis in this report is not ideal. Firstly, finding out shared tokens could not detect reformulated queries when users fix a typo in space-less languages. For example, when users modify their queries from “灯龙” to “灯笼” (lantern) they are fixing a typo, but tokenizers would take them as two different words. Secondly, we found that many users like to try their queries in different languages. Without enabling search across wikis in different languages, we are unable to detect this kind of reformulation.