Overall Acceptance Criterion in Online A/B Testing:

Overall Acceptance Criterion in Online A/B Testing:
Classical and Promising Approaches [My name is Alexey Drutsa and] I present you my talk on “Overall Acceptance Criterion in Online A/B Testing: Classical and Promising Approaches”. This talk is a brief overview of classical and novel approaches to obtain an OAC. Alexey Drutsa

A/B Testing Methodology
I will start with reminding the key points of A/B testing.

A/B testing overview Variant for A users X(ωA1) X(ωA2)
Group A X(ωA2) ξA = avgω in AX(ω) X(ωA3) … Overall Evaluation Criterion (OEC) for the group A e.g., current production version e.g., X(u) is the number of sessions of the user u Let there are some users of a web service. First, we split them randomly into 2 groups. Second, we expose them to one of two variants of the service (for example, the current production version of the service and its update). Then, we calculate a key metric for each experimental unit, for instance, we calculate the number of sessions for each user. Finally, we calculate the Overall Evaluation Criterion for each group as the mean value, obtaining, for instance, sessions-per-user metric. Calculate the OEC for each group as the mean value Calculate a key metric X for each experimental unit Expose to one of two variants of the service Split them randomly Variant for B X(ωB1) ξB = avgω in BX(ω) Group B X(ωB2) Overall Evaluation Criterion (OEC) for the group B X(ωB3) users … e.g., evaluated update

A/B testing overview Statistical Δ VS 0 significance test
The difference in the OEC between the variants is calculated Δ = ξB – ξA A statistical significance test is applied (e.g., Student’s t-test) ξA = avgω in AX(ω) Then, having the OEC value for each group, we calculate the difference between them and compare it with zero. Thus, we decide whether the evaluated update of the service is positive or negative. Finally, a statistical significance test is applied to determine whether the difference is caused by a noise or the treatment effect. Usually, the state-of-the-art Student’s t-test is used. Calculate the OEC for each group as the mean value Δ VS 0 Statistical significance test the evaluated update is positive or negative the difference is caused by a noise or the treatment effect ξB = avgω in BX(ω)

Sensitivity The ability of the metric to detect the statistically significant difference when the treatment effect exists is referred to as the sensitivity. The higher the sensitivity, the smaller changes of the service could be detected by an A/B experiment. Improving the sensitivity is a great challenge. The ability of the metric to detect the statistically significant difference when the treatment effect exists is referred to as the sensitivity. The higher the sensitivity, the smaller changes of the service could be detected by an A/B experiment. Improving the sensitivity is a great challenge.

A/B testing: key components
Key metric X of an experimental unit ω in Ω Evaluation statistic ξ over experimental units Ω Statistical significance test To sum up, we have three following components in A/B testing evaluation: A key metric, an evaluation statistic, and a statistical test. The first two are referred to as the Overall Evaluation Criterion; All the three are referred to as the Overall Acceptance Criterion. Overall Evaluation Criterion (OEC) [Kohavi et al., DMKD, 2009] Overall Acceptance Criterion (OAC) [Drutsa et al., CIKM, 2015]

Classical Overall Acceptance Criteria
Now we are ready to consider classical overall acceptance criteria

Classical OACs: key components
Key metric X of an experimental unit ω in Ω Evaluation statistic ξ over experimental units Ω Statistical significance test Usually, in A/B testing, one considers some engagement metrics of user loyalty or activity: Like the number of sessions, the absence time, the CTR etc…. The evaluation statistic is always the average value and the statistical test is the T-test or the Bootstrap. Student’s t-test or Bootstrap Engagement metrics of user loyalty and activity The average value

User engagement with a web service
Actions of a user on a web service (for instance, a search engine): : the user submits a query to the search engine Q : the user clicks on a link on a search engine result page C >30 mins >30 mins Let us consider user actions on a web service (for instance, a search engine). Let they be either a submission of a query, or a click on a search result link. Then, we define a session as a sequence of user actions whose dwell times are less than 30 minutes. And also we define an absence as the period of absence between two consecutive user sessions time Q C Q Q C C C Q C Q C C Q Q C absence #1 absence #2 session #1 session #2 session #3 Session: a sequence of user actions whose dwell times are less than 30 minutes Absence: the period of absence between two consecutive user sessions

Experimental units (Ω)
Examples of corresponding key metrics (X) User The number of sessions of a user The number of queries of a user The number of clicks of a user Session The duration of a session The number of queries in a session Absence The absence time (the duration of an absence) Query The number of clicks after a query Click The dwell time of a click The following experimental units could be considered: Users are standard experimental units. However one could also use: sessions, absences, queries, clicks, etc. Examples of corresponding key metrics are: [READ THE COLUMN ON THE SLIDE]

Key metrics (X) of user engagement
Quantify user loyalty: the number of sessions of a user; the absence time (the duration of an absence in seconds); etc. Quantify user activity: the number of queries of a user; the number of clicks of a user; the presence time of a user (the sum of durations of a user’s sessions); the average number of clicks of a user’s queries (CTR); Etc. 16 user engagement metrics were studied and compared in [Drutsa et al., CIKM’2015] Popular key metrics of user engagement represent two aspects: The loyalty aspect of user engagement: for instance, different variants of the number of sessions and the absence time. The activity aspect of user engagement: such metrics are mostly connected with queries, clicks and the presence time. 16 user engagement metrics were studied and compared in our study presented at CIKM’2015

Novel OACs: Overview Now we are ready to consider novel overall acceptance criteria.

Key components of an OAC:
Key metric X of an experimental unit ω in Ω Evaluation statistic ξ over experimental units Ω Statistical significance test Engagement metrics, for instance, Number of sessions per day Student’s t-test or Bootstrap The average value Classical OACs Let us remind the key components of an Overall Acceptance Criterion: a key metric, an evaluation statistic, and a statistical test. In the classical case, we deals with the following list of them. However, each of these components has an alternative. First, instead of average or total user behavior during an experiment (for instance, the total number of sessions or the number of sessions per day), we can consider other aspects of behavior, e.g., its periodicity. Next, instead of one metric, one could consider machine learned combination of metrics, e.g., predicted future user behavior. Second, the average value can be replaced by any other known evaluation statistic, including, the median, quantiles, the entropy, etc. Moreover, instead of the difference Δ of evaluation statistics over each user group, any other joint statistic over both groups could be considered. Finally, there exists a wide range of statistical tests that could be used in A/B testing. Further, I will discuss the first four alternatives that were considered in recent studies in Yandex. (a) instead of average or total user behavior during an experiment, consider other aspects of behavior, e.g., its periodicity; (c) instead of the average value over each user group, consider other evaluation statistics, e.g., median, quantile, entropy, etc; Other statistical tests could be considered, e.g.: Mann-Whitney U test, Tarone-Ware test, Logrank test, etc. (b) instead of one metric, consider machine learned combination of metrics, e.g., predicted future behavior; (d) instead of the difference Δ = ξB – ξA of statistics over each user group, consider one statistic over both groups.

Novel OACs: User Engagement Periodicity
Let us consider metrics of user engagement periodicity

Measuring periodicity via DFT
Let be, for instance, a time series of the number of sessions of a user in each day of a considered N-day period. Then we apply the Discrete Fourier transform (DFT) obtaining Fourier coefficients , such that It is the decomposition of the source series x into the sine waves : Let X be, for instance, a time series of the number of sessions of a user in each day of a considered N-day period. Then we apply the Discrete Fourier transform obtaining Fourier coefficients of the following form. It is the decomposition of the source series into the following sine waves. the frequency of the sine wave k= … … …

the k-th Fourier coefficient
Amplitudes and phases The sine wave the k-th Fourier coefficient the k-th amplitude encodes the magnitude of this wave The k-th Fourier coefficient encodes how the sine wave presents in the source series. It can be decomposed into two components: (1) The amplitude, which encodes the magnitude of this wave (2) And the phase, which encodes how this wave is shifted encodes how the sine wave . presents in the source series x… the k-th phase encodes how this wave is shifted

Amplitude series and periodicity pattern
The periodicity of user behavior is encoded by the amplitude series: Since is the average daily value over , the normalized amplitude series (the periodicity pattern) So, the amplitude series carries the proportions between the magnitudes of the sine waves of different periodicity and disregards their shifts encoded by phases. Thus, the amplitude series is the main component of the DFT that encodes user periodicity. In order to separate this periodicity from the total amount of user engagement, we normalize the amplitude series by its zeroth component and obtain the normalized amplitude series (or the periodicity pattern). Learn more on the periodicity patterns in our study presented at WSDM’2015. is independent of the total number of sessions over all time period, i.e.: Learn more on the periodicity patterns in [Drutsa et al., WSDM’2015]

Examples #1 (permanent usage)
A source time series The corresponding amplitude series In order to understand the periodicity pattern, let us consider the following examples of source time series. In the case of permanent usage, the 0-th amplitude dominates among others.

Examples #2 (single usage)
A source time series The corresponding amplitude series In the case of single usage, all amplitudes are of the same order.

Examples #3 (half-time absence)
A source time series The corresponding amplitude series In the case of half-time absence, the 1-st amplitude dominates among others except the zeroth one.

Examples #4 (weekly periodicity)
A source time series The corresponding amplitude series In the case of weekly periodicity, the 4-th, 8-th, 12-th amplitudes dominates among others (except the zeroth one).

Sign-aware periodicity metrics
The amplitude metrics of user engagement are known to be well sensitive to service changes [Drutsa et al., WSDM’2015]. But they could not be used to determine, whether the treatment effect is positive or negative. This sign-agnostic issue could be overcome by paying attention to the phase of the corresponding DFT sine wave [Drutsa, SIGIR’2015]. The amplitude metrics of user engagement are known to be well sensitive to service changes. But they could not be used to determine, whether the treatment effect is positive or negative. This sign-agnostic issue could be overcome by paying attention to the phase of the corresponding DFT sine wave. Note, how in the case of the sine wave of the 1-st frequency, the phase determine the direction of the trend in this sine wave.

Novel OACs: Predicted Future Engagement
Let us consider how prediction of the future user behavior can improve a key metric.

Sensitivity improvement
Any evaluation metric is calculated based on an A/B experiment period [0, T): ξA = ξA(T)= avguAX(u,T) and ξB = ξB(T) = avguBX(u,T) One way to improve the sensitivity of an A/B experiment is to conduct it longer, i.e., to increase the length T of the period. => The main idea: Any evaluation metric is calculated based on the A/B experiment period. For user engagement metrics, one way to improve the sensitivity of an A/B experiment is to conduct it longer, that is, to increase the length of the period. Thus, we get the main idea: Virtually increase the duration of the A/B experiment by peeking into the future user behavior. Next, I describe our approach that is referred to as Future User Behavior Prediction Approach. Virtually increase the duration of the A/B experiment by peeking into the future user behavior Future User Behavior Prediction Approach (FUBPA) [Drutsa et al., WWW’2015]

Future User Behavior Prediction Approach
Forecast time period [0,T + TF) time T-th day (T+TF)-th day X(u,T+TF) Period of the A/B experiment [0,T) Let’s consider an A/B experiment. Then, in standard A/B testing, we calculate the measure for each individual user after the A/B experiment period. We know, that calculation of the measure in some longer period may increase the sensitivity of the experiment. Instead of conducting the experiment longer, we predict the measure for this extended period. In order to do so, we get some set of features of user behavior from the experiment’s period and utilize them to predict the value of the measure in the future. Thus, we use the predicted value of a measure instead of its actual value. ≈ X(u,T) F(u) – features X(u,T+TF) = PX(F(u),T+TF) Classical A/B testing: ξA(T)= avguAX(u,T) and ξB(T) = avguBX(u,T) With FUBPA: ξA(T)= avguAX(u,T+TF) and ξB(T) = avguBX(u,T+TF)

Novel OACs: Diversity and Extreme Cases
Next, the following Overall Acceptance Criteria are able to detect changes in diversity and extreme cases of user behavior.

Example: two Gamma distributions
DA(x)=λe-λx DB(x)=4λ2xe-2λx Both have the same mean value: E(X | V = A) = E(X | V = B) = 1/λ Thus, the treatment effect is detected by the key metric X, but it is not observed via the mean values over the user groups. Let us consider an example of two Gamma distributions. Both of them have the same mean value. Thus, the treatment effect is detected by the key metric X, but it is not observed via the mean values over the user groups.

P(X ≤ qγ) ≥ γ and P(X > qγ) ≤ 1 – γ
Quantiles γ-quantile is the value qγ defined by: P(X ≤ qγ) ≥ γ and P(X > qγ) ≤ 1 – γ A quantile quantifies user behaviors that are far from the mean. For instance: the 0.25-quantile of the number of sessions describes users that use the service rarely, the 0.75-quantile of it quantifies frequent users. This may be definitely important for web companies that fight for new, rare users in a competitive environment or choose to preserve and increase engagement of loyal, regular users. A quantile quantifies user behaviors that are far from the mean. It may help a web service to detect the treatment effect of an evaluated update on users either with lower, or higher engagement than an "average" user: For instance, the (first) 0.25-quantile of the number of sessions describes users that use the service rarely, while the (third) 0.75-quantile of it quantifies frequent users. This may be definitely important for web companies that fight for new, rare users in a competitive environment or choose to preserve and increase engagement of loyal, regular users.

the expectation E( I(X) ) of the information I(X) = – ln P( X ).
Entropy Entropy (the Shannon entropy) is defined as the expectation E( I(X) ) of the information I(X) = – ln P( X ). It is a measure of diversity of user engagement with a web service. The entropy statistic may help web services to detect, whether the update increases or decreases the variety of different types of user behavior. Different evaluation statistics (including entropy and quantiles) were compared in [Drutsa et al., CIKM’2015] Entropy is defined as the expectation of the information. Entropy is a measure of ``chaos" or unpredictability of information content. In our case, entropy is a measure of diversity of user engagement with a web service. The entropy statistic may help web services to detect, whether the update increases or decreases the variety of different types of user behavior. Different evaluation statistics (including entropy and quantiles) were compared in our study presented at CIKM’2015.

Novel OACs: Optimal Distribution Decomposition Approach
Finally, we consider the Optimal Distribution Decomposition Approach

Motivation of the ODD approach
Why we restrict ourselves? The statistic Δ = ξB – ξA is the difference of the absolute characteristics (i.e., OECs) of the key metric’s distributions {X(ω)}ω in A and {X(ω)}ω in B in the variants A and B. We can develop a relative statistic, which directly compares two distributions instead: Ξ ({X(ω)}ω in A, {X(ω)}ω in B). A relative measurement may be more sensitive than the absolute metric like the mean value. The mean value is not always proportional to money! If X is some reward (it is additive), then the mean make sense. If X is the dwell time of a click, reward is not proportional to X! We would like to measure the portion of “successes” by the distribution of X. The difference Delta, which compares the versions of the system, is based on the scalar characteristic, the mean value of the key metric over the version. In this way, we represent any version of a system as a point (its mean value) in a universal scale. On the other hand, we could develop a relative metric, which does not restrict itself to the difference of individual (absolute) scores of the versions. If we don’t stick on individual absolute scores for each version, we obtain an additional degree of freedom, which potentially may be used to improve the sensitivity of the metric. The mean value is not always proportional to money! For example, If X is some reward (it is additive), then the mean make sense. But, If X is the dwell time of a click, reward is not proportional to X! We would like to measure the portion of “successes” by the distribution of X.

Latent binary effect variable
Let V be the variant variable (i.e., V = A or B). Then, we assume that X does not directly depend on the system version V, but there is only indirect dependency via latent binary effect variable Y, defined by: P(X | Y, V) = P (X | Y) Instead of Δ we calculate the change of the effect variable Y: α = pB – pA pC := E(Y | V = C) = P (Y = 1 | V = C), C = A or B Let V be the variant variable (i.e., V = A or B). We assume that, in each individual online experiment, there is a latent binary variable Y (a function of the observed events), which takes two values 1 and 0 and is responsible for the change in the distribution of the key metric X under the evaluated system update. Intuitively, the two values 1 and 0 may correspond to extreme types of events: absolute successes and absolute failures. We follow the intuition that, if the key metric is able to detect the treatment effect, then this effect is reflected in the change of Y. Therefore, we assume that X does not directly depend on the system version V, but there is only indirect dependency via Y. In other words, the following equality of conditional distributions holds. In this way, there is no causal effect of the treatment that is not conditioned by Y. So, the classical causal scheme is as follows. We replace it by the scheme with the effect variable. Hence, Instead of Δ we calculate the change of the effect variable Y. Classical causal scheme V X Causal scheme with effect variable Y

How to evaluate α: problem formalization
We are given: DA(x) = P(X = x | V = A), DB(x) = P(X = x | V = B). We should find: scalars pA, pB in [0,1]; distributions F0(x), F1(x), such that: DA(x) = pA F1(x) + (1 – pA) F0(x) for all x; DB(x) = pB F1(x) + (1 – pB) F0(x) for all x; where F1(x) = P(X = x | Y = 1) and F0(x) = P(X = x | Y = 0) To apply this general approach, we need to evaluate alpha. We have two distributions of X conditioned by the two values of V. We need to find: - the weights: the probabilities p_A, p_B that Y equals 1 conditioned by the two versions A and B. - and the distributions F_0 and F_1 the distributions of X conditioned by the corresponding values of Y: zero and one respectively. Then the density D_A, by the total probability formula, consists of two components: each corresponding to a value of Y. And we obtain a similar decomposition for the density D_B. Note that the solution of this system is usually not unique. Now I will describe the set of solutions.

Set of solutions For fixed pA, pB we have: F0 = (1/α) [pB DA – pA DB]
The set P of all solutions (pA, pB) is defined as: {pB ≥ M pA and pB ≥ 1 – m (1 – pA)} {its image under symmetry w.r.t. (1/2,1/2)} M = sup DB/DA and m = inf DB/DA Let’s fix the values of the weights p_A, p_B. We obtain a system of linear equations with respect to F_zero and F_one, which has the following unique solution (provided that p_A and p_B are not the same, otherwise alpha in the denominator equals zero). So far, the basis F is determined by the values of p_A, p_B. However, we also need F to be non-negative. Therefore, not any pair of values is allowed. You can see in the picture the set of point in the zero-one square, which provide non-negative distributions F. This set consists of two convex components. One of them is defined by the system of two linear inequalities. Another one is centrosymmetric to the former one. Here M-big is the maximum of the ratio of the two initial densities and m-small is the minimum. Different bases F correspond two points of the grey area. And alpha is the signed distance from the point to the diagonal. Since the set of possible solutions is infinite, we need to choose one of them.

ODD: optimal distribution decomposition
Task: choose the optimal basis F0(x), F1(x) such that: |α| is minimal or, equivalently, |F0 - F1| is maximal Solution (for α > 0): p0A = (1 – m) / (M – m) p0B = M (1 – m) / (M – m) α0+ = [ (M – 1)-1 + (1 – m)-1 ]-1 Learn more in [Nikolaev et al., KDD’2015] Now we finally arrive at the definition of the optimal distribution decomposition approach. We formulate the following optimization problem: find a basis F_zero, F_one that minimizes the absolute value of alpha. It is equivalent to the maximization of the difference between F_1 and F_0 since it is proportional to the difference between D_A and D_B with coefficient alpha. The motivation of the optimization task is to prove that the initial distribution are different even in the worst case of the basis, which reveals the minimal possible difference between the initial distributions. It can be shown that This optimization problem has exactly two solutions captured in the picture. The second point corresponds to the same basis, where F_0 and F_1 are interchanged. The optimal alpha equals the harmonic mean of the differences between Ms and one. Learn more about the ODD approach in our study presented at KDD’2015.

Conclusions We made an overview of a classical approach to obtain Overall Acceptance Criteria based on metrics of user engagement with a web service. We considered novel and promising OACs proposed in recent studies on A/B testing, including: Fourier amplitudes as key metrics that quantify periodicity aspect of user behavior. Prediction of future user behavior as a method to combine several observed features into one interpretable key metric. Entropy and quantiles as evaluation statistics that quantify the diversity and extreme cases of user engagement metrics. Optimal Distribution Decomposition as a method to obtain a relative measure of system changes observed in an A/B experiment. We made an overview of a classical approach to obtain Overall Acceptance Criteria based on metrics of user engagement with a web service. We considered novel and promising OACs proposed in recent studies on A/B testing, including: Fourier amplitudes as key metrics that quantify periodicity aspect of user behavior. Prediction of future user behavior as a method to combine several observed features into one interpretable key metric. Entropy and quantiles as evaluation statistics that quantify the diversity and extreme cases of user engagement metrics. Optimal Distribution Decomposition as a method to obtain a relative measure of system changes observed in an A/B experiment.

References Drutsa A., Gusev G., and Serdyukov P., "Engagement Periodicity in Search Engine Usage: Analysis and Its Application to Search Quality Evaluation", WSDM'2015; Drutsa A., Gusev G., and Serdyukov P., "Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments", WWW'2015; Drutsa A., "Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation", SIGIR'2015; Nikolaev K., Drutsa A., Gladkikh E., Ulianov A., Gusev G., and Serdyukov P., "Extreme States Distribution Decomposition Method for Search Engine Online Evaluation", KDD'2015; Drutsa A., Ufliand A., and Gusev G., "Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics", CIKM'2015. Our recent studies.

Thank you! Questions?

Overall Acceptance Criterion in Online A/B Testing:

Similar presentations

Presentation on theme: "Overall Acceptance Criterion in Online A/B Testing:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overall Acceptance Criterion in Online A/B Testing:

Similar presentations

Presentation on theme: "Overall Acceptance Criterion in Online A/B Testing:"— Presentation transcript:

Similar presentations

About project

Feedback