**H**i Darin,

I noticed on your recent post that you considered Figure 1 of the Sept 27 JAMA article to be a "mathematical artifact". Take a look at this chart on Dr Bennett’s blog which he maintains is “proof” of a correlation between CD4 cell count and HIV Viral load… he calculates a magic R value of .93 calculated from Figure 1.

I am leaning towards thinking you are correct in your analysis, but Bennett seems to have a point (or five -- not to be gratuitously humorous about the matter because there is nothing at all funny about AIDS, really.)

Any comment? Thanks.

A concerned reader.

Dear Concerned Reader,

He has NOT proved a correlation between viral load and CD4 cell loss in individual patients. He has proved a correlation between the viral load and CD4 cell loss for five values that represent the median CD4 cell loss for their respective viral load subgroups. It is statistical trickery of the most transparent kind.

Here is the best way I can put it: ANY data set such as the one under consideration, has a "line of best fit", these appear in Figure 3. ANY data set at all. The coeff. of determination (R^2) tells you how well the data set as a whole "fits" this line of best fit. In Figure 3, clearly they don't fit at all.

Now here is a more mathematical explanation of what I meant by "it's the statistical equivalent of squinting your eyes so hard you can't see any details anymore". Let's say you just take a more-or-less random data set (as Figure 3 almost is) and break it up into subgroups by intervals of the predictor variable. All of the data points in any one of these subgroups come from patients who presented roughly similar HIV viral load levels. But within any one of these subgroups, the data points are still more-or-less randomly scattered. But there IS a general pattern, because the line of best fit does have negative slope. In other words, if you look at the total set, the points do (very generally) slope down slightly. The same holds for each subgroup. For each subgroup, the data points are scattered, but (very generally) have a slight downward trend. The point is, (reflected by the R^2 values for the total data set AND for each subgroup) they don't "fit" that trend very well (if at all).

Now, we have decided to choose the "median" response for each subgroup. It does not take a rocket scientist to figure out that if you choose the median response for each subgroup, those medians are going to lie somewhere very close to the line of best fit. The only way the median point could lie FAR from the line of best fit was if the points had some strange distribution, like 2/3 of them very low and then a big jump and the other 1/3 way up high. Just a quick glance at Figure 3 shows they're not strangely distributed like this.

So, here is the net effect of considering the five points in Figure 1: you have a cloud of almost random data points; you plot the line of best fit through those points (which looks almost as absurd as some of the graphs in Ho/Wei); then you "choose" five data points out of the cloud which all happen to lie very close to the line of best fit. It's no surprise then that they all lie in a straight line and hence give a high correlation to each other. The "correlation" does not reflect any real correlation in the data set itself -- it's a mathematical artifact of the way the medians were chosen. It's the statistical equivalent of squinting your eyes real hard and picking five points with your finger. It's ridiculous.

So, all Figure 1 reflects is the slope of the line of best fit from Figure 3, with the lack of correlation obscured, and with some "error bars". The error bars have absolutely no biological meaning, they are confidence intervals for the median points. They are just saying, "look, the median point lies in here somewhere". So what??

Then people like Bennett point out the "simple linear relationship" in Figure 1 and claim it's evidence of some kind of "correlation". It's NOT reflecting the correlation of the total data set, it's just reflecting the small slope of the line of best fit. But every data set has such a slope value. You can do this trick to ANY data set that's more-or-less randomly distributed. It's so clear that the authors of the study couldn't just put the 4 clouds of data points upfront in the article, and just report the R^2 values in the abstract, they had to concoct Figure 1 to distract people at the start of reading the article, and then report the median values to give the idea there was some "correlation" in the abstract.

Actually, that wasn't good enough, because the median values were still too close to each other. They had to go back over each subgroup individually and run a different model with each one and I can hear their collective sighs of relief when they finally got numbers spaced out from each other a little more. (Meaning, more than 10-15 cells/mm^3/year difference between the most extreme groups.) Then they tried to "rescue" the R^2 = 0.04 by several ways, but could only get to at best 0.08 or 0.10. I can just seem them after they first saw the data and the actual R^2 values -- OMG, we have to put this in JAMA, what do we do??

This all might mean something if there were any reason to look at the subgroups this way. But I can't find any. The only reason I can find is to smooth the data out and have a nice looking graph like Figure 1. In my 9 Oct post, I point out why I think the boundaries chosen are arbitrary and why I don't think there's any good *biological* reason to group them this way. And biological reasons have to be the reason for choices like this. The reasons can't be purely mathematical or just arbitrary. This is all standard stuff. Do a google search on "subgroup analysis" (in quotes). You'll come up with a slew of articles on how to "misuse/abuse" subgroup analysis. This paper should go down in history as Exhibit A.

Darin C.
Brown received his Ph.D. in mathematics from the University of California,
Santa Barbara in 2004. His dissertation was in algebraic number theory, although
he tells us he also has "interests in Fuchsian groups, category theory, and
point-set topology". (*Fuchsian groups? Sounds exciting !*) His "mathematical
lineage traces to Stark and Chebyshev".
Dr. Brown is also the *wikimesiter* at the AIDS Wiki,
and recently became curator of the Memorial Serge Lang Archive, announced in the
Oct. issue of *The
Notices* of the American Mathematical Society

## Comments