If you can’t trust the result of an experiment, you can’t trust the decisions you make based on it; and you’d be surprised how often you can’t trust the result of an experiment
we need to be tracking the data integrity of our experiments very closely. So, if there is an issue, we know about it sooner rather than later.
우리는 우리 실험의 데이터 무결성을 매우 면밀히 추적할 필요가 있습니다. 그래서, 문제가 있다면, 우리는 그것에 대해 더 빨리 알게 될 것입니다.
One essential (and simple) check anyone can perform is the Sample Ratio Mismatch (SRM) check. You don’t have to be an analyst or data scientist to do this.
누구나 수행할 수 있는 필수(간단한) 검사 중 하나는 샘플 비율 불일치(SRM) 검사입니다. 이를 위해 분석가나 데이터 과학자가 될 필요는 없습니다.
In this article, I’ll give you a practical overview of what SRM is and the various ways to check it. With this knowledge, you can go away and integrate it into your own experiment process.
이 기사에서는 SRM이 무엇인지, 이를 확인하는 다양한 방법에 대한 실질적인 개요를 제공합니다. 이 지식을 사용하여 자신의 실험 프로세스에 통합할 수 있습니다
Before we start, I must stress one point: while SRM catches many problems, it doesn’t highlight every possible problem with your setup. This is just the minimum you should be doing.
SRM은 많은 문제를 포착하지만 가능한 모든 문제를 알려주지는 않습니다.
What’s a Sample Ratio Mismatch, anyway?
기대값이 50/50 분할인 A/B 검정이 있다고 가정합니다.
하지만 실제로는 .,, 이렇다고 쳤을때
표본크기가 50대 50이 아닙니다.
이것이 우리가 의미하는 {표본 비율 불일치} 즉, {표본 비율 불일치}입니다.
표본의 비율이 우리의 예상과 일치하지 않습니다. (50/50 분할).
이렇게 비뚤어지면 테스트가 무효화될 수 있습니다.
Rules when checking for SRM
Before we go further, we need to identify a couple of rules to follow.
먼저, 이 두가지 룰을 기억하세요
The first of these rules is to prioritise SRM checks with “users” rather “visits”. That’s because it’s “users” who are assigned to experiments. In comparison, “visits” are the number of “sessions” these users have made.
첫번째는 "방문"이 아닌 "사용자"로 SRM 확인을 우선시합니다.
실험에 할당된 것은 "사용자"이기 때문입니다. 이에 비해 "방문"은 이러한 사용자가 만든 "세션"의 수입니다.
We might actually expect a skew in visits if a variation encourages a user to return more (or less) often.
방문은 사용자가 만들어 내는 차이로 인하여 편향이 생길수 있습니다.
그렇다고 해서 "벙문"이 있는 SRM은 "사용자"가 있는 SRM을 해결하는 무조건적인 해결책은 아닙니다.
Anyway, onto our second rule for checking Sample Ratio Mismatch. And that’s this: we need to be looking for problems frequently. Checking for SRM is not a one and done activity. We need to be checking our tests pretty much as soon as they launch, and then we should be checking regularly.
어쨌든, 샘플 비율 불일치를 확인하기 위한 두 번째 규칙으로 넘어갑니다.
바로 자주 문제를 찾으려고 해야 한다는 것입니다.
SRM 확인 작업은 알아서 되는 작업이 아닙니다.
테스트가 시작되자마자 바로 확인해야 합니다. 그리고 정기적으로 확인해야 합니다.
New experiments should be treated like intensive care patients for at least the first week of launch.
This rule immediately gives us a problem to overcome because the traffic volumes being interrogated could be really low. But there are ways to increase our certainty levels. We’ll cover this a little later in the article. First, let’s look at the easiest ways to identify a problem.
이에 대해서는 기사에서 잠시 후에 다루겠습니다.
Checking for glaring problems
눈에 띄는 문제를 확인하기
The first and easiest way to identify a problem is by looking at the test assignment numbers.
먼저, 할당된 수들을 보는것입니다.
Some problems are so glaring; you can raise the alarm straight away without needing to do any math. For instance, if you see 1,000 users in one group and 100 in the other, you know there’s a problem.
한쪽에는 1000 한쪽에는 100이 있다면 ? 바로 알수 있습니다.
It might seem an obvious thing to point out, but it’s important not to assume that checking SRM is some sort of major activity. Just being able to easily see the numbers quickly could save you some major time.
What’s more, once you develop a habit of regularly looking at these sorts of numbers, you develop an eye for spotting mismatches — especially the big ones.
But suppose it’s closer? Suppose you have 10,000 users in the control group and 9,500 users in the variation. What do you do then?
게다가 이런 종류의 숫자를 정기적으로 보는 습관이 생기면, 불일치, 특히 큰 숫자를 찾아내는 안목이 생깁니다.
하지만 가까이 있다고 가정해봅시다. 당신이 통제집단에 1만 명의 사용자가 있고, 변이집단에 9,500명의 사용자가 있다고 가정해봅시다.
Sample ratio formula
We can run a simple calculation to find the sample ratios.
First, get the total sum of users assigned to the experiment…
실험에 할당된 고객의 전체를 세고
total_users_in_test = users_in_control + users_in_variation
…and then work out the percentage of users in each group.
퍼센트를 그룹별로 각각 구해줍니다.
control = users_in_control / total_users_in_test
variation = users_variation / total_users_in_test
control = 10,000 / (10,000 + 9,000) = 0.5263
variation = 9,500 / (10,000 + 9,000) = 0.4737
So, 52.63% of users are in the control group, and 47.37% are in the variation. Now, we might expect to see a certain amount of mismatch during the early days of an experiment, but the above looks pretty suspicious.
How can we improve our level of certainty? It’s time to employ some statistics...
따라서 사용자의 52.63%가 control 그룹에 속하고 47.37%가 variation에 속합니다.
자, 우리는 실험 초기에 어느 정도의 불일치를 예상할 수 있지만, 위의 것들은 꽤 의심스러워 보입니다.
통계를 씁시다 !
Chi-Squared test of independence 카이제곱 테스트 of 독립성
Statistics are ways of describing data in useful ways.
We have two samples: the traffic volumes for control and variation groups. It would be useful to know the likelihood that the differences between the numbers are outside of normal chance.
숫자 간의 차이가 정상 확률을 벗어날 가능성을 알려주기에 유용합니다.
We can do this using the chi-squared test of independence. This test tells us that, given two samples, the probability that the samples are independent.
카이제곱 독립 검정을 사용하여 이를 수행할 수 있습니다.
이 테스트는 두 개의 샘플이 주어졌을 때 샘플이 독립적일 확률을 알려줍니다.
You actually don’t need to know the formula for calculating Chi (though I’ll go through it a little later), as it’s super simple using python:
실제로 Chi를 계산하는 공식을 알 필요는 없습니다(조금 나중에 살펴보겠지만). Python을 사용하면 매우 간단합니다.
observed = [ 170471, 171662 ]
total_traffic= sum(observed)
expected = [ total_traffic/2, total_traffic/2 ]from scipy.stats import chisquarechi = chisquare(observed, f_exp=expected)
print(chi)
The “observed” variable lists two values: number of users in control and number of users in variation.
The “expected” variable is also a list of two values: the number of users we expect for each group.
The volume we’re expecting is half of the total traffic in each. That’s the total traffic, divided by two.
We need the “chi-square” module from scipy.stats. After that, we just feed the numbers in. The output we get is:
{observed} 변수에는 control 사용자 수와 variation 사용자 수의 두 가지 값이 나열됩니다.
{expected} 변수는 또한 두 개의 값, 즉 각 그룹에 대해 우리가 기대하는 사용자 수의 목록입니다.
우리가 예상하는 양은 각각의 총 트래픽의 절반입니다. 우리는 scipy.stats의 {chi-square} 모듈이 필요합니다.
그 후에, 우리는 단지 숫자를 입력합니다. 우리가 얻는 출력은 다음과 같습니다.
Power_divergenceResult(statistic=4.145992932573005, pvalue=0.041733172643879435)
This is a tuple, where the first item is the chi-squared statistic, and the second item is the p-value. Normally, one would look for a p-value of 0.05 or less to determine independence (and, in our case, proof of SRM).
The problem with 0.05 is that it’s not strict enough for our purposes. Using this might give us a false signal of a problem. Michael Lindon (part of the Optimizely team) goes into detail about this in the following article.
What we need is to be stricter for our test of independence. A value below 0.01 should be enough. With our Python example, we can write up a conditional statement to make it easier to read:
if chi[1] < 0.01:
print('Warning. SRM may be present.')
else:
print('Probably no SRM.')
By the way, chi[1] means we access the p-value from the tuple. Our full script looks like this.
Using our example, we don’t have hard evidence of SRM. One more thing to add is that I like to look at cumulative views. The more traffic we have, the closer the two groups should align in terms of the sample split:
Image by author. Screen of abdecisions.com
Above is a pretty typical view of sample ratios. These cumulative views can be useful for determining when SRM began occurring if a defect is introduced partway through the runtime of an experiment.
A spreadsheet view
I realise that many reading this may not be familiar with python. So, here’s an example using a spreadsheet (works in Google Sheets or Excel):
Using the CHITEST formula, we can pass the two sets of values: observed, and expected.
Image by author. Chi Test using a spreadsheet
=CHITEST(observed_cell_range,expected_cell_range)
The first range of numbers is the “observed” values for control and variation. The second range is for the “expected” (total divided by two).
The output of the formula is the P-value. After that, the rules are the same as our python example: less than 0.01 indicates a possible SRM.
A deeper look at Chi-Squared (optional)
For those who really want to dig into the Chi-Test formula, here it is:
Image by author. Chi statistic = Sum of observed-expected squared, divided by the expected
The chi statistic is the sum of each of the observed values minus the expected values squared, divided by the expected.
Huh? Let’s use a spreadsheet to break this down:
Image by author. Spreadsheet with the values of the formula
The columns:
- Observed: the control and variation traffic volumes, respectively.
- Expected: the expected values for each— i.e. the observed total divided by 2.
- Difference: the observed value minus the expected
- Difference Squared: the difference multiplied by itself
- Difference Squared / Expected: difference squared divided by the expected.
Each row is a variation group. Eg. control and variation (A/B).
After this, we’re ready to find the Chi Statistic, which is just a sum of the Diff Squared/Expected (as in the formula above):
Image by author. Sum of difference squared divided by expected
How do we get a p-value from this? Well, there’s one more thing we need first: that’s the degree of freedom, which is calculated as:
Degree of Freedom = (rows − 1) × (columns − 1)
In our case, our degree of freedom is 1 (two rows for test and control and two columns for observed and expected).
We can then use a p-value table to locate our score and find the associated p-value:
Image by author. P-Value table
Or use a spreadsheet function:
Using CHISQ.DIST.RT to find the p-value
=CHISQ.DIST.RT( chi_statistic, degrees_of_freedom)
This function gives us the precise p-value. Our final spreadsheet looks like this:
Image by author. P-Value from CHISQ.DIST.RT
Of course, you can circumvent all of this by using the CHITEST function as shown earlier.
Summary
Checking the validity of the traffic is super easy using Chi. There’s really no excuse not to do it. Besides, you could really save yourself a lot of time by doing this simple check.
Once you’re doing this regularly, you can start going further by adding more checks for data validation. This is really the beginning. But it’s an essential beginning.
Having said all of that, be wary of crying wolf. That can be as damaging to your experimentation process as data validity issues themselves. I’m usually wary of declaring a problem during the first day of the experiment launch unless there’s a glaring issue.
Do this often, and you’ll become an expert at determining SRM issues early.
About me
I’m Iqbal Ali, writer of comics, former Head of Optimisation at Trainline, creator of the A/B Decisions tool, and a freelance CRO specialist and consultant.
I help companies with their experimentation programs, designing, developing, training, and setting up experiment processes.
Here’s my LinkedIn if you want to connect. Or follow me here on Medium.
About me
I’m Iqbal Ali, writer of comics, former Head of Optimisation at Trainline, creator of the A/B Decisions tool, and a freelance CRO specialist and consultant.
I help companies with their experimentation programs, designing, developing, training, and setting up experiment processes.
Here’s my LinkedIn if you want to connect. Or follow me here on Medium.
'Data Analysis > Growth Hacking' 카테고리의 다른 글
스타트업을 위한 실험 문서 템플릿 + 작성 가이드라인 (펌) (0) | 2023.01.10 |
---|---|
데이터 분석 방법론 - AAARR (0) | 2023.01.05 |
A/B test 관련 학습자료 (0) | 2022.12.08 |
4. Google Analytics 에서 말하는 Session 세션이란 무엇인가 ? 세션을 만드는 방법은 ? (0) | 2022.12.06 |
3. 데이터 분석가가 지표를 설정하는 법 (0) | 2022.12.04 |
댓글