Tuesday, May 29, 2007

Lab stats - LNDD 6x more anabolic AAFs than UCLA

Over at DPF, folks have been poking around the per-lab AAF statistics, no doubt provoked by the 300% higher rate claimed for LNDD in the Landis opening statement. We first looked at lab stats back in September, using the 2005 data. These are new observations using the 2006 data.

First, Nomad takes a look:

I finally went back to look at the data more closely. Incidentally, a direct link is:

http://www.wada-ama.org/rtecontent/documen...BSTATS_2006.pdf

The distribution looks vaguely gamma, but the fit wasn't all that great. Anyway, some interesting things:

The mean percent AAFs was 2.16, while the percent total AAFs was 1.96. That's probably mainly because UCLA, which does 3x as many tests as anybody else, has an AAF rate of .99. However, in general the larger labs seem to have fairly mainstream results, with Paris and Ghent, Belgium being exceptions.

The variance of the AAF rates is 2.14, standard error is 1.47. So talking about being 1/2 the average rate doesn't really mean the same as being 2x the average rate. Being as low as .48 is still within 1, while LNDD, at 5.41 is 2.3 or so over. Of course, this distribution isn't normal, having no left tail, so we can't really talk in terms of confidence intervals directly. Somebody with more stats knowledge than me can probably tell us more on this, but it's pretty academic.

Cycling has an A sample AAF rate of 4.17%, over a full percent higher than any Olympic sport. Some of those are TUEs, and some are multiples on the same athlete as in the case of a longitudinal T/E study. But the same can be said of other sports, so the higher rate is definitely a bad thing.

LNDD finds 12% of the total AAFs in the system, more than any other lab. They only do 4% of the total tests. In contrast, the next highest lab in terms of % total AAFs is UCLA at 9.4% of AAFs, but it does 20% of the tests. (OK, that should've been easy math from their AAF rate being half as much the average.)

LNDD finds 10% of all testosterone AAFs, which is more than one would expect from their share of all tests. Their 12% rate seems to be at least somewhat a consequence of a lot of cannabis use, (nearly 20% of the total) though UCLA and Montreal have a lot of those, too.


Then, N.B.O.L:
Under the category of strange statistics.

Paris reports 2.38% of all the tests that it runs as positive for anabolics.

UCLA reports 0.37% of all the tests that it runs as positive for anabolics.

28 comments:

Anonymous said...

i wish i payed more attention in stats now! anybody tell me what this means? without the bell shaped curves

anne lydsay said...

Can someone please confirm that all of these labs being compared have exactly the same criteria for a negative test? If not the analysis is like comparing apples to elephants.

anne lydsay said...

Hmmm, just looked at the original data, not even comparing the same sports, data includes Archery, Badminton, Handball, Luge, Sailing etc. I wonder if someone could extract the data lab by lab for just cycling, might mean something then.

anne lydsay said...

Looking further into the data it seems to tell me that cycling seems to have 2.5 times the positve results to your average sport. Us cyclists of the world should be ashamed by the statistic for our sport, no wonder people think we lack credibility....

Cheryl from Maryland said...

Anne, WADA allows each lab to set its own criteria for a positive test. Lawyer friends and family tell me the lack of a standard throughout the labs would have thrown this out of court in the US. Judge Hue - is this your take on this, or am I getting bad legal advice?

bill hue said...

Foundation is important in a US court, so important that a piece of evidence that lacks foundation will be excluded from evidence at trial.

In this arbitration hearing, the very first dispute between the lawyers was Matt Barnett's objection as to "foundation", which drew a very sharp response from Suh, that "We both agreed not to object to foundation".

Consequently, lack of common positivity criteria was not a reason to exclude the evidence or dismiss the case in arbitration, as might have been the case in a US Court (i.e. what rule was violated, there are so many that in essence, there are none) because the parties agreed that the issue would simply be decided by the finder of fact.

Lack of standardization of positivity criteria
goes to whether there was a violation of the WADA Code and might be explored on the issue of whether International Laboratory Standards as applied to the critera occured.

Anonymous said...

Anne,

In my view the fact that the positivity criteria varies is at least part of the point in looking at the variation in AAF rates between labs. If the criteria was standard, and the labs were all equally competent, then the only variation should be due to tested population. I doubt if we would have nearly the variation in rates if that was the case.

It would be great if we could somehow control for 2 of those 3 factors in order to determine the effect of criteria alone, or lab competence alone. WADA probably can, at least to some extent, by looking at the raw results and normalizing to some standard positivity standard. I'm skeptical that they do this.

Nomad

Anonymous said...

I'm not saying that steroids are not a problem, but I find it hard to believe that 1 out of 42 people tested at Paris had used anobolics (the catagory that testosterone falls under).

-wds

Anonymous said...

I'm not a statistician, but I don't think you can fairly condemn LNDD based on the statistics alone.

1. If you look at total AAF per laboratory for Olympic sports, it's true that the Paris lab (5.16% adverse) has a much higher adverse percentage than average (1.86%). However, the labs on this list are all over the map: Belgium at 3.89%, Rome at 2.65%, Tokyo at 0.48%. Los Angeles, supposedly the "gold standard" for labs, is at 1.24%. It's not like all of the other labs are reporting something close to the 1.86% overall average and Paris is hanging out there by itself with a result far from the average.

2. We can't tell from the numbers whether the population of olympic athletes being tested by Paris can fairly be compared to the population of athletes being tested by other labs. It's already been noted here that cyclists have a higher AAF percentage than other athletes. Maybe Paris is testing more cyclists than does the average lab. Maybe Paris is testing cyclists at a time of year (or during events) when they're more likely to cheat. Maybe Paris tests a higher percentage of elite cyclists, and maybe elite cyclists are more likely to cheat than cyclists who race in second or third tier events. There's no way to know from these statistics.

3. Let's assume for the moment that there is some meaning to the raw statistics, and that the higher AAF percentage for the French lab means that they will find AAFs in cases where other labs would not. Does this mean that the French lab is doing something wrong? Not necessarily. Maybe the other labs are getting it wrong, and failing to catch the cheaters that the French lab is spotting. Depending on your point of view, a lab might be condemned for having too high a percentage or too low a percentage. It's not necessarily a badge of honor to be "average" in this particular area.

OK, I admit that the statistics are INTERESTING. You can find some really strange stuff in there, if you're willing to look. For example: what's with the 7.36% AAF percentage for Billard Sports? Or if you want to put the LNDD in a bad light: they did a little over 4% of the Olympic sport testing, but found over 30% of the glucocorticosteroid violations! Anyone want to explain THAT to me?

bill hue said...

This is an interesting debate and exporation of the issue in a civilized discussion format. That is refreshing and a compliment to all posters on either side of the issues. I'm learning from all of you and thank you for the opportunity to do that in a safe and repectful environment.

Sincerely,
Bill Hue

Anonymous said...

I have a question. Would it be possible for a magazine or newspaper to "setup" a blind trial using these various labs, by setting up a competitive event and then sending split samples to all of the WADA labs without letting the labs know that other labs are testing the samples? By putting in samples that will be known positives and blanks that are known to be negative, we could really see if there is a large difference in results from lab to lab. It would make an interesting expose for a magazine or newspaper to report on the results of such a blind trial.

Georgian said...

" I have a question. Would it be possible for a magazine or newspaper to "setup" a blind trial using these various labs, by setting up a competitive event and then sending split samples to all of the WADA labs without letting the labs know that other labs are testing the samples?"

And include cortisone in some of those samples.

Anonymous said...

As to the blind tests, I seem to recall hearing that WADA does send spiked samples through the labs periodically to make sure they catch them. I also vaguely recall thinking that the frequency of such tests wasn't all that often. I think mainly they rely on the process for the quality control, rather than results of known samples. Hence the urine blanks, reference mixtures, etc. To really be confident with results it seems to me you'd have to send clean as well as spiked samples, and find some samples that are dirty in the sense of difficult to separate peaks, etc.

As to how much we can really read into these statistics, you're right, it's not much. Particularly with the small sample size sports, such as billiards, you're almost certainly seeing normal probabilistic variation. That being said, while one shouldn't expect any lab AAF rates to be exactly the 1.96 average, a large majority of them are within one unit of standard error.

I think it's safe to say that something is different at the labs with high rates, but it's impossible to say if it's because of tested population, positivity criteria, or lab practices. Without some detailed information about the first 2, one can't legitimately impugn the third.

What you can say about the contribution of lab practices after seeing the information that came out during the trial is, of course, entirely up to you.

Nomad

Anonymous said...

Nomad: a simple z-test is used to determine the difference between two percentages (i.e., the number of AAfs over the total number of tests).

This is a parametric test so it does require a normal (i.e., bell-curve) population distribution and an assumption of random sample selection.

Statistcians frequently say that the z-test is robust to departures from normality which is their way of saying the test works even when the population is not normally distributed (and we wonder why people dislike stats folks).

The sample selction criteria is more difficult but since these are all "elite" athletes in some sport I am not overly concerned about this.

When I run a set of z-test comapring the following labs: Montreal (I'm Canadian), Paris, Rome, LA and Belgium the most striking result is the fact that Paris differs from all except Belgium and is higher in all cases.

Likewise, the LA lab is significantly lower then all others.

Moreover, the z-test (two tailed with a margin of error of 95%) requires a score of 1.96 to be significantly different. The Paris labs z-scores when compared with Montreal (Where the French open will be tested) and LA is 14.08 and 27.8 respectively.

The Paris lab is statistically different from all others except Belgium.

Stephen

Anonymous said...

Stephen,

Thanks for the reply. It all sounds so familiar, but just outside my recollection to actually do the math. Summers always do that to me.

Anyway, is it possible to back into the maximum score at which a lab is still significantly different from Paris at the .95 level? My intuition tells me that it would be around 3, but there are several labs right around there. If the division is at, say, 3.3, then there really aren't very many labs in the same range as LNDD.

Nomad

Anonymous said...

Nomad: it is possible to find the point at which one lab is different from LNDD but it is not trivial.

Two data points go into the calculation: the total number of observations and the number of yes responses (in this case the AAFs).

This means that the value changes with the total number of observations. Which means that each lab, when compared with the LNDD will have a different value since they each did a different number of tests in total.

I will try to find a shorthand for this but it will take a bit of time.

Cheers,

Stephen

Anonymous said...

Nomad: here is the "simple" way to determine how much the LNDD deviates from the "norm."

If we assume that the full years results are essentially representative of the population we know that

198,143 tests gave 3887 AAFs for 1.96%

Now if we cocmpare the results of the LNDD to this population it is significantly different at the 95% limit z = 21.564

To make the LNDD number of AAFs not significantly different from this "population" it would have to reduce the number of found AAFs to 190, yes a difference of 264.

I alas must stop here as the more detailed calculatuons will both tax my brain and take me away from my day job.

Cheers,

Stephen

Anonymous said...

Stephen,

If you drop UCLA as an outlier how does that change the results.

Nomad said that the distribution was somewhat gamma. If that is true, do you need to use a difference significance test, and/or how much error/bias does using using the normal z test introduce?

Anonymous said...

Stephen -

Aren't the z-values for many of the other labs ALSO off the scale? For example, Los Angeles (which is held up as a "model" lab).

You assume that the full year's results are representative of the population. But there's no evidence supporting this assumption. I'm not a statistician, but since we have so little data on the populations tested by each lab, can you really conclude anything meaningful from these statistics?

Larry

Anonymous said...

Anonymous 11:05 - The question posed was simply - does the LNDD get more AAFs then we would expect given the behaviour of the overall system. Thus the use of the full sample as an approximation to the population. If we start dumping any sites because we think they are outliers then we are applying a send of judgments on the data that puts us on an awfully slippery slope.


Larry: you are right, I made an assumption and told you what it was. Remember there are facts here - each site did x number of tests and reported y number of AAFs. At this point we make assumptions about the quality of the data and do the analysis or not.

Cheers,

Stephen

Anonymous said...

Stephen,

Drop UCLA anyway, just to see how powerful it influences the results.

If the results are the same then you can have more confidence in them. Isn't that what you do for all statistical analysis, determine the influence of powerful outliers.

Remember you may already have a non-normal (gamma) distribution.

Anonymous said...

Late comment to this. High level statistics make my head hurt; when I need to understand a statistical analysis I call a math prof who advises researchers on their project designs and results analysis. so I won't try to take that on.

It seems to me that the most relevant data set would be created by the rgular submittal of sets of uniform samples to the accredited labs and a tracking of their results. The sample sets should be made up of "random" selections from a standard large set of samples - some having no banned substances, others having one of more. Labs could not compare notes before turning in their results. This would enable the accrediting organization to evaluate the accuracy of a lab's results and put their status as an approved lab on the line.

The information on the testing should be readily available to the public. This would eliminate the problem inherent in statistical analysis: an outlyer may attract attention, and be unlikely, but that does not mean that it is false. All it dows is raise questions. As far as LNDD goes, the testimny on its operatinos and, especially, the failure to remove the lifting rings from the magnet help explain why its performance vaires so much.
pcrosby

Anonymous said...

Anonymous 12:09

I will go on record and say that I don't like to remove outliers for the simple reason that I rarely have enough evidence to say "observation "x" is an outlier because ..." . By this I mean I don't usually have the knoweldge to say why it is an outlier and thus would not sleep well at night having removed an unwanted/unliked data point. Call me a chicken if you like, it takes a braver person then me to remove data.

That said, when I remove the LA data from the sample we still see that the LNDD results are still significantly different (z = 18.859) and that now we would have to bring the number of LNDD AAFs down to 211 to make the difference NS.

There will be no more playing with the data on my end.

Cheers,

Stephen

Anonymous said...

The comments regarding independent verification or accreditation procedures to establish effective objective comparisons and inter-lab controls are all based upon the faulty assumption that WADA or the labs in question are interested in such a thing.

Believe me, the knowledge exists within the organizations, but there is no collective will to make it happen. If there were, the USADA and LNDD would have presented those findings as proof that that the lab produces reliable results and Landis would have had much less of a case.

Independently verified control testing or accreditation will not happen until those funding WADA, the respective ADA's of the nations in question and the labs involved demand that it happens, or until WADA gets new leadership.

Anonymous said...

Stephen,

Thanks for trying.

I hope you don't feel too unclean!

Anonymous said...

Anon 1:11,

Your observation is absolutely right. If WADA had a system like I described, then it would have trotted it out to show that the LNDD procedures did not have an effect on the reliability of its testing.

Where you are wrong, I think, is in implying a passive role for those reading this site. Although it could be tilting at windmills, I think that an effort needs to be made to apply pressure as effectively as we can to see that reasonable changes are made. WADA might complain about the cost, but it could not argue that such a testing regime would not benefit its goal, its results and, in the long run, reduce its costs by undercutting challenges to its testing. Assuming that its tests and criteria for a positive are valid, which may be a reach.

I'm a bit of a blind squirrel, but I am fumbling around trying to find an entry/pressure point in the federal process that could be used. I will report on any nuts I find.
pcrosby

Anonymous said...

PCrosby, I think WADA is not interested in having all of their labs operate in the same way. I think that they're interested in having all of their labs meet certain minimum standards, which means (I think) that all labs be able to detect performance-enhancing drugs present in urine above certain specified thresholds. If any given lab has the expertise and equipment to exceed those minimum standards, then WADA would encourage this also.

I come to this conclusion from reading the WADA technical document at http://www.wada-ama.org/rtecontent/document/perf_limits_2.pdf.

I think this is a reasonable position for WADA to take. WADA should want to encourage world labs to advance the science and the state of the art. It seems right to recognize that some labs are going to have superior resources and will be able to outperform other labs -- so long as the minimum standards are set in a way that's fair to the athlete and reasonably protects the sport from dopers.

This is another reason why I'm reluctant to condemn LNDD based solely on the statistics. Even assuming that each lab tests a random sampling of athletes, we can expect that some labs are going to do a superior job of catching cheats than other labs, given their superior resources and technology.

(Yes, from the testimony in the Landis case, we have reason to doubt that LNDD is a "superior" lab. But for the moment, we're just trying to figure out what we can learn from the statistics.)

So ... in your ideal test, I think we'd want to see first if all labs can similarly detect prohibited substances present in urine above WADA's set minimums.

Larry

Ken (EnvironmentalChemistry.com) said...

Anon 3:31 PM,

Since we have beat the dancing monkey metaphor to death, I totally approve of your use of Bill Hue's blind squirrel metaphor. :)

Personally I think our best pressure point would be getting the NY Times to follow the LA Times lead on the WADA story and finally do so real reporting on the lack of due process within WADA USADA and inconsistent lab procedures at WADA accredited labs.

The more big reputable newspapers there are following this story the easier it will be to affect change.