It’s measurement error. It’s baked into the process. These data aren’t precise indicators, they are general tendencies based on measurement of central tendency, which masks variability.
Any time you are talking about what “should” happen and start factoring important things out, like defense and park factors. You are going to create measurement error. It doesn’t mean the data are bad or useless, it means that people should not use the data as an absolute measure. Because defense and park factors always matter because they are real things that effect the outcome of an AB.