Are meta analyses conducted by professional organizations more trustworthy?

gold standardA well done meta-analysis is the new gold standard for evaluating psychotherapies. Meta-analyses can overcome the limitations of any single randomized controlled trial (RCT) by systematically integrating results across studies and identifying and contrasting outliers. Meta-analyses have the potential to resolve inevitable contradictions in findings among trials. But meta analyses are constrained by the quality and quantity of available studies. The validity of meta analyses also depends on the level of adherence to established standards in their conduct and reporting, as well as the willingness of those doing a meta analysis to concede the limits of available evidence and refrain from going beyond it.

Yet, meta-analytic malpractice is widespread. Authors with agendas  strive to make their point more strongly than the evidence warrants. I have shown misuse of meta-analysis to claim that long-term psychoanalytic psychotherapy (LTPP) is more effective than briefer alternatives. And then there are claims of a radical American antiabortionist made via a meta analysis in British Journal of Psychiatry that abortion accounts for much of the psychiatric disturbance among women of childbearing age.

Funnel Plot

Funnel Plot

Meta-analyses often seem to have intimidating statistical complexity and bewildering graphic display of results. What are consumers to do, when they have neither the time nor the ability to interpret findings for themselves? Is there particular reassurance that a meta-analysis was commissioned by professional organization? Does being associated with a professional organization certify a meta-analysis as valid?

That is the question I am going to take up in this blog post. The article I am going to be discussing is available here.

Hart, S. L., Hoyt, M. A., Diefenbach, M., Anderson, D. R., Kilbourn, K. M., Craft, L. L., … & Stanton, A. L. (2012). Meta-analysis of efficacy of interventions for elevated depressive symptoms in adults diagnosed with cancer. Journal of the National Cancer Institute, 104(13), 990-1004.

In the abstract, the authors declare

 Our findings suggest that psychological and pharmacologic approaches can be targeted productively toward cancer patients with elevated depressive symptoms. Research is needed to maximize effectiveness, accessibility, and integration into clinical care of interventions for depressed cancer patients.

Translation: The evidence for the efficacy of psychological interventions for cancer patients with elevated depressive symptoms is impressive enough to justify dissemination of these treatments and integration into routine cancer care. Let’s get on with the rollout.

The authors did a systematic search, identifying

  • 7700 potentially relevant studies, narrowing this down to
  • 350 fulltext articles that they reviewed, from which they so selected
  • 14 trials from 15 published reports for further analysis.
  • 4 studies lacked the data for calculating effect sizes, even after attempts to contact the authors
  • 10 studies were at first included, but
  • 1 then had to be excluded as an extreme outlier in its claimed effect size, leaving
  • 9 studies to be entered in the meta analysis, but with one of them yielding 2 effect sizes.

The final effect sizes entered into the meta-analysis were 6 for what the authors considered psychotherapy from 5 different studies and 4 pharmacologic comparisons. I will concentrate on the 6 psychotherapy effect sizes that came from five different studies. You can find links to their abstracts or the actual study here.

Why were the authors left with so few studies? They had opened their article claiming over 500 unique trials of psychosocial interventions for cancer patients since 2005, of which 63% involved RCTs. But most evaluations of psychosocial intervention do not recruit patients having sufficient psychological distress or depressive symptoms to register an improvement. Where does that leave claims that psychological interventions are evidenced-based as effective? The literature is exceedingly mixed as to whether psychosocial interventions benefit cancer patients, at least those coming to clinical trials. So, the authors are left having to decide with these few studies recruiting patients on the basis of heightened depressive symptoms.

Independently evaluating the evidence.thumb on scale

Three of the 6 effect sizes classified as psychotherapeutic—including the 2 contributing most of the patients to the meta analysis—should have been excluded.

The three studies (1,2,3)  evaluated collaborative care for depression, which involves substantial reorganization of systems of care, not just providing psychotherapy. Patients assigned to the intervention groups of each of these studies received more medication and better monitoring. In the largest study, the low income patients assigned to the intervention group had to pay out-of-pocket care, whereas care was free for patients assigned to the intervention group. Not surprising, patients assigned to the intervention group got more and better care, including medication management. There was also a lot more support and encouragement being offered to the patients in the intervention conditions. Improvement that was specifically due to psychotherapy, and not something else,  in these three studies cannot be separated out.

I have done a number of meta-analyses and systematic reviews of collaborative care for depression. I do not consider such wholesale systemic interventions as psychotherapy, nor am I aware of other articles in which collaborative care has been treated as such.

Eliminating the collaborative care studies leaves effect sizes from only 2 small studies (4, 5).

One (4) contributed  2 effect sizes based on comparisons of 29 patients receiving cognitive behavior therapy (CBT) and 23 receiving supportive therapy to the same 26-patient no-treatment control group. There were problems in the way this study was handled.

  • The authors of the meta-analysis considered the supportive therapy group as an intervention, but supportive therapy is almost always considered a comparison/control group in psychotherapy studies.
  • The supportive therapy had better outcomes than CBT. If the supportive therapy were re-classified as a control comparison group, the CBT would have had a negative effect size, not the positive one that was entered into the meta-analysis.
  • Including two effect sizes violates the standard assumption for performing a meta analysis that all of the effect sizes being entered into are independent.

Basically, the authors of the meta-analysis are counting the wait-list control group twice in what was already a small number of effect sizes. Doing so strengthened the authors’ case that the evidence for psychotherapeutic intervention for depressive symptoms among cancer patients is strong.

The final study (5)  involved 45 patients being randomly assigned to either problem-solving or waitlist control, but results for only 37 patients were available for analyses. The study had a high risk of bias because analyses were not intent to treat.  It was seriously underpowered, with less than a 50% probability of detecting a positive effect even if it were present.

Null findings are likely with such a small study. If the authors reported null findings, the study would not likely be published because being too small to find anything but a no effect is a reasonable criticism. So we are more likely find positive results from such small studies in the published literature, but they probably will not be replicated in larger studies.

Once we eliminate the three interventions misclassified as psychotherapy and deal with use of the waitlist control group of one study counted twice as a comparator, we are left with only two small studies. Many authorities suggest this is insufficient for a meta-analysis, and it would certainly not serve as the basis for the sweeping conclusions which these authors wish to draw.

How the authors interpreted their findings

The authors declare that they find psychotherapeutic interventions to be

reliably superior in reducing depressive symptoms relative to control conditions.

They offer reassurance that they have checked for publication bias. They should have noted that tests for publication bias are low powered and not meaningful with such small numbers of studies.

They suddenly offer the startling conclusion without citation or further explanation.

The fail-safe N (the number of unpublished studies reporting statistically nonsignificant results needed to reduce the observed effect to statistical nonsignificance) of 106 confirms the relative stability of the observed effect size.

What?!  Suppose we accept the authors’ claim that they have five psychotherapeutic intervention effect sizes, not the two that I claim. How can they claim that there would have to be 106 null studies hiding in desk drawers to unseat their conclusion? Note that they already had to exclude five studies from consideration, four because they could not obtain basic data are from them, and one because the effects claimed for problem-solving therapy were too strong to be credible. So, this is a trimmed down group of studies.

In another of my blog posts  I indicated that clinical epidemiologists, as well as the esteemed Cochrane Collaboration reject the validity of failsafe N. I summarize some good arguments against failsafe N.  But just think about it: on the face of it, do you think results are so strong that it would take over 100 negative studies to change our assessment? This is a nonsensical bluff intended to create false confidence in the authors’ conclusion.

The authors perform a number of subgroup analyses that they claim show CBT to be superior to problem-solving therapy. But the subgroup analyses are inappropriate. For CBT, they take the effect sizes from two small studies in which the intervention and a control group differed in the control group not receiving the therapy. For PST, they take the effect sizes from the very different large collaborative care interventions that involved changing whole systems of care. Patients assigned to the intervention group got a lot more than just psychotherapy.

There is no basis for making such comparisons. The collaborative care studies, as I noted,sbm involve not only providing PST to some of the patients, but also medication management and free treatment when patients – who were low income – in the control condition had to pay for it and so received little. There are just too many confounds here. Recall from my previous blog posts that effect sizes do not characterize a treatment but rather a treatment in comparison to a control condition. The effect sizes that the authors cite are invalid for PST and the conditions of the collaborative care studies versus the small CBT studies are just too different.

https://www.youtube.com/watch?v=R-sLX5UZaxk

https://www.youtube.com/watch?v=R-sLX5UZaxk

Is you or ain’t you a meta analysis organized by Society of Behavioral Medicine?

 The authors wish to acknowledge the Society of Behavioral Medicine and its Evidence-Based Behavioral Medicine Committee, which organized the authorship group…Society of Behavioral Medicine, however, has not commissioned, supervised, sanctioned, approved, overseen, reviewed, or exercised editorial control over the publication’s content. Accordingly, any views of the authors set forth herein are solely those of the authors and not of the Society of Behavioral Medicine.

Let’s examine this denial in the context of other information. The authors included a recent President of SBM and other members of the leadership of the organization, including one person whom would soon be put forward as a presidential candidate.

The Spring/Summer 2012 SBM newsletter states

 The Collaboration between the EBBM SIG and the EBBM Committee (Chair: Paul B. Jacobsen, PhD) provided peer review throughout the planning process. At least two publications in high impact journals have already resulted from the work.

One of the articles to which the newsletter refers is the meta-analysis of interventions for depressive symptoms. I was a member of the EBBM Committee during the time of the writing of this. This and the earlier meta-analyses were inside jobs done by the SBM leadership. A number of the authors are advocates for screening for distress. Naysayers and skeptics on the EBBM Committee were excluded.

The committee did not openly solicit authors for this meta analysis in its meetings nor discuss progress. When I asked David Mohr, one of the eventual authors, why the article was not being discussed in the meetings, he said that the discussions were being held by telephone.

Notably missing from the authors of this meta-analysis is Paul Jacobson, who was head of the Evidence-based Medicine Committee during its writing. He has published meta-analyses and arguably is more of an expert on psychosocial intervention in cancer care than almost any of the authors. Why was he not among the authors? He is given credit only for offering “suggestions regarding the conceptualization and analysis” and for providing “peer review.”

It would have been exceedingly awkward if Jacobson was listed as an author. His CV notes that he was the recipient of $10 million from Pfizer to develop means of assuring quality of care by oncologists. So, he would have had to have a declaration of conflict of interest on a meta-analysis from SBM evaluating psychotherapy and antidepressants for cancer patients. That would not have looked good.

Just before the article was submitted for publication, I received a request from one of the authors asking permission for me to be mentioned in the acknowledgments. I was taken aback because I had never seen the manuscript and I refused.

I know, as Yogi Berra would say, we’re heading for déjà vu all over again. In earlier blodeja vug posts (1, 2)  I criticized a flawed meta-analysis done by this group concerning psychosocial interventions for pain. When I described that the meta-analysis as “commissioned” by SBM. I immediately got a call from the president asking me for a correction. I responded by posting a link to an email by one of the authors describing that meta-analysis, as well as this one, as “organized” by SBM.

So, we are asked to believe the article does not represent the views of SBM, only the authors, but these authors were hand-picked and include some of the leadership of SBM. Did the authors take off their hats as members of the governance of SBM during the writing of the paper?

The authors are not a group of graduate students who downloaded some free meta-analysis software. There were strong political considerations in their selection, but as a group, they have experience with meta-analyses. Journal of the National Cancer Institute (JNCI) is not a mysterious fly-by-night Journal that is not indexed in ISI Web of Science. To the contrary, it is a respected, high impact (JIF= 14.3).

As with the meta-analysis of long-term psychoanalytic psychotherapy with its accompanying editorial in JAMA, followed by publication of a clone in British Journal of Psychiatry, we have to ask did the authors have privileged access to publishing in JNCI with minimal peer-review? Could just anyone have gotten such a meta-analysis accepted there? After all, there are basic, serious misclassifications of the studies that provided most of the patients included in the meta-analysis for psychotherapeutic intervention. There are patently inappropriate comparisons of different therapies delivered in very different studies, some without the basis of random assignment. I speak for myself, not PLOS One, but if, as an Academic Editor I had received such a flawed manuscript, I would have recommended immediately sending it back to the authors without it going out for review.

Imagine that this meta-analysis were written/organized/commissioned/supported by pharmaceutical companies

What we have is exceedingly flawed meta-analysis that reaches a seemingly forgone conclusion promoting the dissemination and implementation of services by the members of an organization from which it came. The authors rely on an exceedingly small number of studies, bolstered by recruitment of some that are highly inappropriate to address the question whether psychotherapy improved depressive symptoms among cancer patients. Yet, the authors’ conclusions are a sweeping endorsement for psychotherapy in this context and unqualified by any restrictions. It is a classic use of meta-analysis for marketing purposes, branding of services being offered, not scientific evaluation. We will see more of these in future blog posts.

If the pharmaceutical industry had been involved, the risk of bias would have been obvious and skepticism would have been high.

But we are talking about a professional organization, not the pharmaceutical industry. We can see that the meta-analysis was flawed, but we should also consider whether that is because it was written with a conflict of interest.

There are now ample demonstrations that practice guidelines produced by professional organizations often serve their members’ interests at the expense of evidence. Formal standards  have been established for evaluating the process by which these organizations produce guidelines. When applied to particular guidelines, the process and outcome often comes up short.

So, we need to be just as skeptical about meta-analyses produced by professional organizations as we are about those produced by the pharmaceutical industry. No, Virginia, we cannot relax our guard, just because a meta-analysis has been done by a professional organization.

If this example does not convince you, please check out a critique of another one written/organized/commissioned/supported by the same group (1, 2).

Category: cancer, Conflict of interest, depression, psychotherapy, Uncategorized | Tagged , , , | 1 Comment

Keeping zombie ideas about personality and health awalkin’: A teaching example

Reverse engineer my criticisms of this article and you will discover a strategy to turn your own null findings into a publishable paper.

TYPE-D-HeartcurrentsHere’s a modest little study with null findings, at least before it got all gussied up for publication. It has no clear-cut clinical or public health implications. Yet, it is valuable as a teaching example showing how such studies get published. That’s why I found it interesting enough to blog about it at length.

 

van de Ven, M. O., Witteman, C. L., & Tiggelman, D. (2013). Effect of Type D personality on medication adherence in early adolescents with asthma. Journal of Psychosomatic Research, 75(6), 572-576. Abstract available here and fulltext here]

As I often do, I am going to get quite critical in this blog post, maybe even making some readers wince. But if you hang in there, you will see some strategies for publishing negative results as if they were positive that are widely used throughout personality, health, and positive psychology. Your critical skills will be sharpened, but you will also be able to reverse engineer my criticisms to get papers with null findings published.

Read on and you’ll see things that the reviewers at Journal of Psychosomatic Research apparently did not see, nor the editors, but they should have.  I have emailed the editors inviting them to join in this discussion and I am expecting them to respond. I have had lots of dealings with them and actually find them to be quite reasonable fellows. But peer review is imperfect, and one of the good things about blogging is that I can get the space to call out when it fails us.

The study examined whether some measures of negative emotion predicted adherence in early adolescents with asthma. A measure of negative affectivity (sample item: “I often make a fuss about unimportant things”) and what was termed social inhibition (sample item “I would rather keep other people at a distance”) were examined separately and when combined in a categorical measure of Type D personality (the D in Type D stands for distress).

Type D personality studies were once flourishing, even getting coverage in Time and Newsweek and discussion by Dr. Oz.  The claim was that a Type D personality predicted death among congestive heart failure patients so well that clinicians should begin screening for it. Type D was supposed to be a stable personality trait, so it was not  clear what clinicians could do with the information from screening. But I will be discussing in a later blog post why the whole area of research can itself be declared dead because of fundamental, inescapable problems in the conception and measurement of Type D. When I do that, I will draw on an article co-authored with Niels de Voorgd,  “Are we witnessing the decline effect in the Type D personality literature?”

John Ioannidis providing an approving commentary on my paper with Niels, with the provocative title of “Scientific inbreeding and same-team replication: Type D personality as an example.” Among the ideas attributable to Ioannidis are that most positive findings are false, as well as that most “discoveries” are subsequently proven to be false or at least exaggerated. He calls for a greater value being given to replication, rather than discovery.

Yet in his commentary on our paper, he uses the Type D personality literature as a case example of how the replication process can go awry. A false credibility for a hypothesis is created by false replications. He documented is significant inbreeding of investigators of type D personality: a quite small number of connected investigators are associated with studies with statistically improbable positive findings. And then he introduced some concepts that can be used to understand processes by which the small group could have undue influence on replication attempts by others:

… Obedient replication, where investigators feel that the prevailing school of thought is so dominant that finding consistent results is perceived as a sign of being a good scientist and there is no room for dissenting results and objections; or obliged replication, where the proponents of the original theory are so strong in shaping the literature and controlling the publication venues that they can largely select and mold the results, wording, and interpretation of studies eventually published.

Ioannidis’ commentary also predicted that regardless of any merits, our arguments would be studiously ignored and even suppressed by proponents of Type D personality. Vested interests use the review process to do that with articles that are inconvenient and embarrassing. Reviewing manuscripts has its advantages in terms of controlling the content of what is ultimately published.

Don’t get me wrong. Niels and I really did not expect everyone to immediately stop doing Type D research just because we published this article. After all, a lot of data have already been collected. In Europe, where most Type D personality data get collected, PhD students are waiting to publish their Type P articles in order to complete their dissertations.

We were very open to having Type D personality researchers pointing out why we wrong very wrongwere wrong, very wrong, and even stupidly wrong. But that is not what we are not seeing. Instead, it is like our article never appeared, with little trace of it in terms of citations even in, ah,  Journal of Psychosomatic Research, where our article and Ioannidis’ commentary appeared. According to ISI Web of Science, our article has been cited an overall whopping 6 times as of April 2014. And there have been lots more Type D studies published since our article first appeared.

Anyway, the authors of the study under discussion adopted what has become known as the “standardized method” (that means that they don’t have to justify it) for identifying “categorical” Type D personality. They took their two continuous measures of negative affectivity and social inhibition and split (dichotomized) them. They then crossed them, creating a four cell, 2 x 2 matrix.

Chart One2

 Next, they then selected out the high/high quadrant for comparison to the three other groups combined as one.

Chart 2

So, the authors made the “standardized” assumption that only the difference between a high/high group and everyone else was interesting. That means that persons who are low/low will be treated just the same as persons who are high in negative affectivity and low in social inhibition. Those who were low in negative affectivity but high in social inhibition are simply treated the same as those who are low on both variables. The authors apparently did not even bother to check– no one usually does– whether some of the people who were high in negative affectivity and low in social inhibition actually had higher scores on negative affectivity than those assigned to the high/high group.

I have been doing my own studies and reviews of personality and abnormal behavior for decades. I am not aware of any other example where personality types are created in which the high/high group is compared to everybody else lumped together. As we will see in a later blog, there are lots of reasons not to do this, but for Type D personality, it is the “standardized” method.

Adherence was measured twice in this study. At one point we readers are told that negative emotion variables were also assessed twice, but the second assessment never comes up again.

The abstract concludes that

categorical Type D personality predicts medication adherence of adolescents with asthma over time, [but] dimensional analyses suggest this is due to negative affectivity only, and not to the combination of negative affectivity and social inhibition.

Let’s see how Type D personality was made to look like a predictor and what was done wrong to achieve this. To enlarge Table 2 just below, double click on it.

table page TypeDJPR-page-0

Some interesting things about Table 2 that reviewers apparently missed:

  • At time T1, adherence was not related to negative affectivity, social inhibition, or Type D personality. There is not much prediction going on here.
  • At time T2, adherence was related to the earlier measured negative affectivity, but not to social inhibition or Type D personality.

Okay, if the authors were searching for significant associations, we have one, only one, here. But why should we ignore the failure of personality variables to predict adherence measured at the same time and concentrate on the prediction of later adherence? Basically, the authors have examined 2×3=6 associations, and seem to be getting ready to make a fuss about the one that proved significant, but was not predicted to stand alone.

Most likely this statistical significance is due to chance– it certainly was not replicated in same-time assessments of negative affectivity and adherence at T1. But this Association seems to be the only basis of claiming one of these negative emotion variables are actually predictors.

  • Adherence at time T2 is strongly predicted by adherence at time T1.

The authors apparently don’t consider this particularly interesting, but it is the strongest association in the data set. They want instead to predict change in adherence from T1 to T2 from trait negative emotion. But why should change in the relatively stable adherence be predicted by negative emotion when negative emotion does not predict adherence measured at the same time?

We need to keep in mind that these adolescents have been diagnosed with diabetes for a while. They are being assessed for adherence at two arbitrary time points. This no indication that something has happened in between those points that might strongly affect their adherence. So, we are trying to predict fluctuations in a relatively stable adherence from a trait, not any upward or downward spiral.

Next, some things we are not told that might further change our opinions about what the authors say is going on in their study.

magician_rabbit_hatLike pulling a rabbit out of a hat, the authors suddenly tell us that they measured self-reported depressive symptoms. The introduction explains this article is about negative affectivity, social inhibition or Type D personality, but only mentions depression in passing. So, depression was never given the explanatory status that the authors give to these other three variables. Why not?

Readers should have been shown the correlation of depression with the other three negative emotion variables. We could expect from a large literature that the correlation is quite high, probably as high as their respective reliabilities allow—as good, or as bad as it gets.

This no particular reason why this study could not have focused on depressive symptoms as predictors of later adherence, but maybe that story would not have been so interesting, in terms of results.

Actually, most of the explanations offered in the introduction as to why measures of negative emotion should be related to adherence would seem to apply to depression. Just go back to the explanations and substitute depression for whatever variables being discussed. See, doesn’t depression work as well?

One of the problems in using measures of negative emotion to predict other things is that these measures are related so much to each other that we can’t count on them to measure only the variable we are trying to emphasize and not something else.

Proponents of Type D personality like these authors want to assert that their favored variable does something that depression does not do in terms of predictions. But in actual data sets, it may prove tough to draw such distinctions because depressive symptoms are so highly correlated with components of Type D.

Some previous investigators of negative emotion have thrown up their hands in despair, complaining about the “crud factor” or “big mess” of intercorrelated measures of negative emotion ruining their ability to test their seemingly elegant ideas about supposedly distinctly different negative emotion variables. When one of the first Type D papers was published,   an insightful commentary complained that the concept was entering an already crowded field of negative emotion variables and asked whether we really needed another one.

In this study, the authors depressive symptoms with the self-report Hospital Anxiety and Depression Scale (HADS) The name of the scale suggests that it separately measures anxiety and depression. Who can argue with the authority of a scale’s name? But using a variety of simple and complicated statistical techniques like different variants of factor analysis, investigators have not been able to show consistently that the separates subscales for anxiety and depression actually measure something different from each other– or that the two scales should not be combined into a general measure of negative emotion/distress.

So talk about measuring “depressive symptoms” with the HADS is wrong, or at least inaccurate. But there are a lot of HADS data sets out there, and so it would be inconvenient to acknowledge what we said in the title of another Journal of Psychosomatic Research article,

The Hospital Anxiety and Depression Scale elivs citings(HADS) is dead, but like Elvis, there will still be citings.

Back to this article, if readers had gotten to see the basic correlations of depression with the other variables in Table 2, we might have seen how high the correlation of depression was with negative affectivity. This would have sent us off in a very different direction than the authors took.

To put my concerns in simple form,  data that are available to the authors but hidden from the readers’ view probably do not allow making the clean kind of distinctions that the authors would need to make if they are going to pursue their intended storyline.

Depressive symptoms are like the heal in rigged American wrestling matches, a foil for Type D personality..

Depressive symptoms are like the heal in rigged American wrestling matches, a foil for Type D personality.

But, uh, measures of depressive symptoms show up all

Type D personality is the face, intended to win against depressive symptoms.

Type D personality is the face, intended to win against depressive symptoms.

the time in studies of Type D personality. Think of such studies as if they are like the rigged American wrestling matches. Depressive symptoms are the heel (or rudo in lucha libre) that always shows up as a looking mean and threatening contender, but most always loses to the face, Type D personality. Read on and find out how supposedly head-to-head comparisons are rigged so this dependably happens.

The authors  eventually tell us that they assessed (1) asthma duration, (2) control, and (3) severity. But we were not allowed to examine whether any of these variables were related to the other variables in Table 2. So, we cannot see whether it is appropriate to consider them as “control variables” or more accurately, confounds.

There is good reason to doubt that these asthma variables are suitable “control variables” or candidates for a confounding variable in predicting adherence.

First, for asthma control to serve as a “control variable” we must assume that it is not an effect of adherence. If it is, it makes no sense to try to eliminate asthma control’s influence on adherence with statistics. It sure seems logical that if these teenagers adhere well to what they are supposed to do to deal with their asthma, asthma control will be better.

Simply put, if we can reasonably suspect that asthma control is a daughter of adherence, we cannot keep treating it as if it is the mother that needs to be controlled in order to figure out what is going on. So there is first a theoretical or simple logical objection to treating asthma control as a “control” variable.

Second, authors are not free to simply designate whatever variables they would like as control variables and throw them into multiple regression equations to control a confound. This is done all the time in the published literature, but it is WRONG!

Rather, authors are supposed to check first and determine if two conditions are met. The variables should be significantly related to the predictor variables. In the case of this study, asthma control should be shown to be associated with one or all of the negative emotion variables. But then the authors would also have to show that it was also related to subsequent adherence. If both conditions are not met, the variable should not be included as a control variable.

Reviewers should have insisted on seeing these associations among asthma duration, control, severity, and adherence. While the reviewers were at it, they should have required that the correlations be available to other readers, if the article is to be published.

We need to move on. I am already taxing readers’ patience with what is becoming a longread. But if I have really got you hooked into thinking about the appropriateness of controlling for particular confounds, you can digress to a wonderful slide show telling more.

So far, we have examined a table of basic correlations, not finding some things that we really need to decide what is going on here, but we seem to be getting into trouble. But multivariate analyses will be brought in to save this effort.

The magic of misapplied multivariate regression.

The authors deftly save their storyline and get a publishable paper with “significant” findings in two brief paragraphs

The decrease in adherence between T1 and T2 was predicted by categorical Type D personality (Table 3), and this remained significant after controlling for demographic and clinical information and for depressive symptoms. Adolescents with a Type D personality showed a larger decrease in adherence rates fromT1 to T2 than adolescents without a Type D personality.

And

The results of testing the dimensions NA and SI separately as well as their interaction showed that there was a main effect of NA on changes in adherence over time (Table 4), and this remained significant after controlling for demographic and clinical information and for depressive symptoms. Higher scores on NA at T1 predicted a stronger decrease in adherence over time. Neither SI nor the interaction between NA and SI predicted changes in adherence.

Wow! But before we congratulate the authors and join in the celebration we should note a few things. From now on in the article, they are going to be discussing their multivariate regressions, not the basically null findings obtained with the simple bivariate correlations. But these regression equations do not undo the basic findings with the bivariate correlations. Type D personality did not predict adherence, but it only appears to do so in the context of some arbitrariy and ill-chosen covariates. But now they can claim that Type  D won the match fair and square, without cheating.

But don’t get down on these authors. They probably even believe in their results. They have merely following the strong precedent of what almost everybody else seems do in the published literature. They did not get caught by the reviewers or editors of Journal of Psychosomatic Research.

Whatever happened to depressive symptoms as a contender for predicting adherence? It was not let into the ring until after Type D personality and its components had secured the match. These other variables got to do all the predicting they could do, and only then depressive symptoms were entered the ring. That is what happens when you have highly correlated variables and manipulate the match by picking one to go first.

And there is a second trick guaranteeing that Type D will win over depressive symptoms. Recall that to be called Type D personality, research subjects had to be high on negative affectivity and high on social inhibition. Scoring high on two (imperfectly reliable) measures of negative emotion usually bests scoring high on only (imperfectly reliable) one. But if the authors had used two measures of depressive symptoms, they could have had a more even match.

The big question: so what?

Type D personality is not so much a theory, as a tried-and-true method for getting flawed analyses published. Look at what the authors of this paper said about it in the introduction in their discussion. They really did not present a theory, but rather cited precedent and made some unsubstantiated speculations about why past results may have been obtained.

Any theory about Type D personality and adherence really does not make predictions with substantial clinical and public health implications. Think about it: if this study had worked out as the authors intended, what difference would it have made? Type D personality is supposedly a stable trait, and so the authors could not have proposed psychological interventions to change it. That has been done and does not work in other contexts.

What, then, could authors have proposed, other than that more research is needed? Should the mothers of these teenagers be warned that their adolescents had Type D personality and so might have trouble with their adherence? Why not just focus on the adherence problems, if they are actually there, and not get caught up in blaming the teens’ personality?

But Type D has been thung.

Because the authors have been saying in lots of articles that they have been studying Type D, it is tough to get heard saying “No, pal, you have studying statistical mischief. Type D does not exist except for statistical mischief.” Type D has been thung, and who can undo that?

Thing (v). to thing, thinging.   1. To create an object by defining a boundary around some portion of reality separating it from everything else and then labeling that portion of reality with a name.

One of the greatest human skills is the ability to thing. We are thinging beings. We thing all the time.

And

Yes, yes, you might think, but we are not really “thinging.” After all trees, branches and leaves already existed before we named them. We are not creating things we are just labeling things that already exist. Ahhh…but that is the question. Did the things that we named exist before they were named? Or more precisely, in what sense did they exist before they were named, and how did their existence change after they were named?

…And confused part-whole relationships become science and publishable.

Once we have convincingly thung Type D personality, we can fool ourselves and convince others about there being a sharp distinction with the similarly thung “depressive symptoms.”

Boundaries between concepts are real because we make them so, just like between Canada and the United States, even if particular items are arbitrarily assigned to one or the other questionnaire. Without our thinging, we do not as easily forget the various items come from the same “crud factor,” “big mess,” and could have been lumped or split in other ways.

Part-whole relationships become entities interacting with entities in the most sciencey and publishable ways. See for instance

Romppel, et al. (2012). Type D personality and persistence of depressive symptoms in a German cohort of cardiac patients. Journal of Affective Disorders, 136(3), 1183-1187.

Which compares the effectiveness of Type D as a screening tool compared to established measures of depressive symptoms measured with the (ugh) HADS for predicting subsequent HADS depression.

Lo and behold, Type D personality works and we have a screening measure on our hands! Aside from the other advantages that I noted for Type D as a predictor, negative affectivity items going into the ype D categorization are phrased as if they refer to enduring characteristics whereas items on the HADS are phrased to refer to the last week.

Let us get out of the mesmerizing realm of psychological assessment. Suppose we ask a question about whether someone ate meatballs last week or whether they generally eat meatballs. Which would question you guess better predicts meatball consumption over the next year?

And then there is

Michal,  et al. (2011). Type D personality is independently associated with major psychosocial stressors and increased health care utilization in the general population. Journal of Afective Disorders, 134(1), 396-403.

Which finds in a sample of 2495 subjects that

Individuals with Type D had an increased risk for clinically significant depression, panic disorder, somatization and alcohol abuse. After adjustment for these mental disorders Type D was still robustly associated with all major psychosocial stressors. The strongest associations emerged for feelings of social isolation and for traumatic events. After comprehensive adjustment Type D still remained associated with increased help seeking behavior and utilization of health care, especially of mental health care.

The main limitation is the reliance on self-report measures and the lack of information about the medical history and clinical diagnosis of the participants.

Yup, relied on self-report questionnaires in multivariate analyses, not interview-based diagnosis and the measure of “depression” or “depressive symptoms” asked about the last 2 weeks.

2-15-13-Zombie-run_full_600Keeping zombie ideas awalkin’

How did the study of negative emotion and adherence get published with basically null findings? With chutzpah and by the authors following the formulaic D personality  strategy for getting published. This study did not really obtain significant findings, but the availability of the precedent of  many studies of type D personality  to support claims they achieved a conceptual replication, even if not an empirical one. And these claims were very likely evaluated by members of the type D community making similar claims. In his commentary, Ioannidis pointed to how null Type D findings are gussied up with “approached significance” or, better, was “independently related to blah, blah, when x,y, and z are controlled.”

Strong precedents are often are confused with validity, and the availability of past claims relaxes the standards for making subsequent claims.

The authors were only doing what authors try to do, their damnedest to get their article published. Maybe the reviewers are from the Type D community and can cite the authority of hundreds of studies were only doing what the community tries to do– keep the cheering going for the power of Type D personality and adding another study to the hundreds. But where were the editors of Journal of Psychosomatic Research?

Just because the journal published our paper, for which we remain grateful, I do not assume that they will require authors who submit new papers to agree with us. But you would think, if the editors are committed to the advancement of science, they would request that authors of manuscripts at least relate their findings to the existing conversation, particularly in the Journal of Psychosomatic Research. Authors should dispute our paper before going about their business. If it does not happen in this journal, how can we expect to happen elsewhere?

 

Category: advice to junior researchers, Peer review, Publication bias | Tagged , , | 5 Comments

Bambi meets Godzilla: Independent evaluation of the superiority of long-term psychodynamic therapy

“As I was saying before I was so rudely interrupted…”— William Connor

Bambi-meets-Godzilla-513d504906daa_hiresThis is the second post in a two-part series about claims made in meta-analyses in JAMA and more recently elsewhere that long-term psychodynamic therapy is superior to shorter psychotherapies. This post was intended to be uploaded a while ago. But it got shelved until now because I felt the need to respond to hyped and distorted media coverage of a study in Lancet of CBT for persons with schizophrenia with two posts examining what turned out to be that significant, but not overwhelming study.

That digression probably helped to change the tide of opinion about that Lancet study. When I first posted about it, media coverage was dominated by wild claims about CBT having been shown to be as effective antipsychotic medication. Coverage in Science was headlined “Schizophrenia: Time to Flush the Meds?“ Alternative perspectives were largely limited to a restrained, soft-toned note of skepticism at Mental Elf and louder complaints at Keith Law’s Dystopia.

Here and elsewhere, I challenged the media coverage. I showed that this study actually had null findings of no difference between CBT and treatment as usual at the end of the intervention. Despite all the previous enthusiasm shown for the study back then, no one is now responding to my request on Twitter to come forward if they still believes the results that were claimed for the study A single holdout has been persisting with comments at my other blog site, but he increasingly seems like Hiroo Onoda still wandering in the jungle after the battle is lost.

Stay tuned for for what could be Keith Laws and my article at the new Lancet Psychiatry and likely a slew of letters at Lancet.

There is certainly more about the JAMA meta-analysis worth writing about. It was accompanied by a gushy editorial and praised by people like Peter Kramer who argued we can’t argue  with the authority of JAMA. Yet when I took a look  at the JAMA paper, it proved to be

a bizarre meta-analysis compar[ing]  1053 patients assigned to LLTP to 257 patients assigned to a control condition, only 36 of whom were receiving an evidence based therapy for their condition.

My Mind the Brain post drew heavily on a critique I co-authored with Aaron T Beck, Brett Thombs, and others.

In this post I will continue describing what happened next:

  • Leichsenring and Rabung responded to critics, dodging basic criticisms and condemning that those who reject their claims are bringing in biases of their own.

Yup, I was accused of being a part of a plot of advocates  of cognitive therapy trying to beat down  legitimate claims of long-term psychoanalysis and psychodynamic therapy being superior. Those who thought I was part of a plot against cognitive therapy during my analyses of the Lancet study, please note.

  • Leichsenring and Rabung renewed their claims in another meta-analysis in British Journal of Psychiatry for which 10 of the 11 studies were already included in the JAMA meta-analysis.

When you read my account of this recycling, you will probably wonder why they were allowed to do this. I will give some reasons that do not reflect well on that journal. But then Harvard Review of Psychiatry offered continuing education credits from those who wanted to learn how to interpret a meta-analysis from another recycling.

  • The long term psychodynamic/psychoanalytic community responded approvingly, echoing Leichsenring and Rabung’s assessment of skeptics.
  • The important question of whether long-term psychoanalytic psychotherapy is better than shorter term therapies got an independent evaluation by another group, which included the world-class meta-analyst and systematic reviewer, John Ioannidis.

When Ioannidis offers a critical analysis of conventional wisdom, it is generally worth paying attention.

Responses from Leichsenring and Rabung and the LTPP community to criticism

I’ve been a skeptical critic long enough not to expect that authors will agree with criticism or that they will substantially modify their conclusions, no matter how incisive criticisms are. But when there’s been such obvious miscalculations as Leichsenring and Rabung made that inflated results in their favor, I would expect at least some admission of error and adjustment of conclusions, even if not a retraction.

None was forthcoming in their  response to the criticisms of the JAMA paper or an extended follow up.

They admitted no computational error, not even for a claim that in one analysis, the effect size for LTPP was 6.9. They did concede that their effect size estimates were different and even larger than “commonly assessed.” But there was apparently nothing wrong with that, even if readers might be expecting something comparable to what is typically done in meta analyses.

Overall, one would never get a sense from Leichsenring and Rabung response that they had published one of the most wildest examples of meta-analysis malpractice that can be found in any recent high impact journal. Results were not in any way comparable to what a conventional, well done meta-analysis would produce. [Click here if you want a more detailed analysis.].

Leichsenring and Rabung’s response to our extended critique was also strange. For instance, we had offered the criticism that the small group of studies entered into the meta-analysis were so heterogeneous that any effect size could not be generalized back to any one of the studies what treatments going into the meta-analysis. To this they responded

Heterogeneity of control conditions and diagnoses (p. 210) are part of the discussion about effectiveness vs. efficacy, which is nearly as old as psychotherapy research itself. Although theoretically important for internal validity, the attempt to create truly homogeneous yet clinically relevant study populations leads psychotherapy ad absurdum.

Really? I guess the Cochrane Collaboration has been getting it all wrong in being so concerned about heterogeneity.

But basically Leichsenring and Rabung’s message was “why were we picking on them and their meta-analysis and not someone else?” They implied we had undisclosed conflict of interests in criticizing them:

It is quite ironic that the paper of Bhar et al. is published in close proximity to an editorial dealing with the unmasking of special interest groups [10], which are obviously not limited to somatic medicine and the pharmaceutical industry.

Really, Falk and Sven? Is it a conflict of interest for anyone but a psychoanalyst to have an opinion about a slick sell job of a meta analysis?

I joined  with other colleagues because I was deeply offended by your flaunting of standards, your abuse of statistics, your stubborn refusal to acknowledge that you had made mistakes. All in the prestigious JAMA and with a fawning editorial that becomes a further source of offense.  I wrote and continue to write about your work because I want to cultivate a gag response in others.

The authors went on to accuse us of being in some sort of a plot, of our joining

ranks with an interesting and surprising movement of others [9], who publish comments with relatively low empirical novelty but quite harsh language towards the Leichsenring and Rabung article in other journals, let alone internet blogs and pamphlets.

As of March, 2014 JAMA article has racked up over 180 citations according to ISI Web of Science, 454 according to Google Scholar. This unusually large discrepancy reflects in part proponents of psychoanalytic therapy making greater use of chapters in books, rather than peer reviewed articles. Across the LTPP literature, the Leichsenring and Rabung’s blatant miscalculations in calculating effect sizes are being uncritically accepted and praised. Recurrent themes get amplified in repetition. Although skepticism is expressed about LTPP being evaluated in RCTs and meta-analyses, the contradictory argument is made, usually in the same article, that the science is solid, effects are equal or larger than for evidence-based therapies, and critics and doubters gets slammed as having ulterior motives and undisclosed conflicts of interest.

A Psychology Today counterpoint from a University of Colorado Medical School psychologist to my complaint about the miscalculated effect sizes took the hype to a whole new level:

Indeed, the within-group effect sizes for long-term psychodynamic therapy were quite large (as a rough example: if psychiatric symptoms were SAT scores and long-term psychodynamic therapy were an SAT training program, the average student would expect to increase their score by somewhere around 90-180 points on each section).

You would think that this analogy was a cause for skepticism, but no.

Once is not enoughonce is not enough

Eighteen months after publication of the JAMA article, Leichsenring and Rabung published another meta-analysis in British Journal of Psychiatry. Ten of the 11 studies entered into it were already in the JAMA article. The article’s title identified it as an “update.”

The sole study “updating” the JAMA meta-analysis was a decade old and had been excluded from the JAMA analyses. Bateman and Fonagy comparing an 18 month “mentalization-based” therapy to structured clinical management, “a counseling model closest to a supportive approach with case management, advocacy support, and problem-oriented  psychotherapeutic  interventions (p.357).” This treatment was not manualized or evidence-based. The study did not add much except for further statistical and clinical heterogeneity and confusion.

Leichsenring and Rabung concluded from the redone meta-analysis

Results suggest that LTPP is superior to less intensive forms of psychotherapy in complex mental disorders.

Leichsenring and Rabung have continued to turn out redundant meta-analyses. Leichsenring’s article in Harvard Review of Psychiatry offers continuing education credit.

Learning Objectives: After participating in this educational activity, the reader should be better able to evaluate the empirical evidence for pre/post changes in psychoanalysis patients with complex mental disorders, and assess the limitations of the meta-analysis.

Updates and meta-analyses are justified by the passage of time and accumulation of new relevant studies. There was neither in the case of BJP or these other meta-analyses. The BJP editor should have recognized the manuscript as an attempt at duplicate publication, aimed at extending a publicity effort into new venues, not a publication justified by new science.

Critics predictably responded to the re-analysis. Only one of the Rapid Responses left on the Journal website made into the print edition and it was met a month later by an invited editorial in BJP from Jeremy Holmes, author of Introduction to Psychoanalysis– without the same strict word limits:

Leichsenring &Rabung7 found that long-term psychodynamic psychotherapies (LTPPs) produced large within-group effect sizes (average 0.8–1.2) comparable with those achieved by other psychotherapy modalities; that gains tended to accumulate even after therapy has finished, in contrast to non-psychotherapeutic treatments; and that a dose–effect pattern was present, with longer therapies producing greater and more sustained improvement.

The issue of insurance reimbursement was pushed with the reassurance:

Although expensive, psychodynamic psychiatry is able in some circumstances to ‘pay for itself’,9 thanks to offset costs of other expenses (medication, hospital stays, welfare payments, etc).

This extravagant claim was based on the single study added by Leichsenring and Rabungin their  BJP reanalysis, in which LTPP was compared to 18 months of “mentalization-based therapy” with case management, advocacy support, and problem-oriented psychotherapeutic interventions.

Elsewhere, Thombs and colleagues had noted a trial excluded by Leichsenring and Rabung because it was too short, found comparable outcomes for a mean of 232 LTPP sessions at an estimated cost of $29,000 to $40,600, according to the authors, to 9.8 sessions of a nurse-delivered solution-focused therapy at a cost of $735 to $980.

How did this recycling get into British Journal of Psychiatry? It is a matter of conjecture, but the editor at the time was Peter Tyrer, a practicing psychoanalyst. He is also a devout Catholic who gave space to American antiabortionists claiming that a significant portion of the mental health problems of women of childbearing age was due to abortion. He resisted a storm of criticism of the antiabortionists meta-analysis, which disproportionately featured their own flawed studies. He even recruited Ben Goldacre to manage the resulting crisis. Goldacre, however, joined with critics in denouncing the meta-analysis.

These are only two examples, but they are extraordinary. Maybe Tyrer had some sort of imperial sense of editorial discretion. But there are rules…

Too many flawed meta analysis by the same authors

A recent BMJ article has noted the prevalence of multiple meta-analyses of the same literature, and expressed caution about such clusters of meta-analyses often come from the same group.

Re-analyses from the same authors need special justification and risk perpetuating the same problems from one meta-analysis to another. If inadequacies, including miscalculation and biased, unconventional calculation of effect sizes require a reanalysis, it probably should be done by another group.

Moreover, if the inadequacies of the earlier analyses by particular authors are the rationale for the conducting another meta-analysis, the problems that led to the decision  to do so should be made explicit. Arguably, if the problems are sufficient to require a reanalysis, the earlier analyses should either be retracted or no longer uncritically cited. Instead, we have a pattern of laudatory self citation, minimization of any difficulties, and repeated meta-analyses lending false authority.

An independent reanalysis

The Dutch Health Care Insurance Board (CVZ) provided partial funding for an independent reanalysis of the evidence concerning the efficacy of LTPP. All the authors were Dutch, except for John Ioannidis, the Greek-American who is the author of numerous well-executed meta-analyses and systematic reviews. Some have proved game changing, like “Most Positive Findings Are False.”  Richard Smith, former Editor of BMJ endorses Ioannidis as

a brilliant researcher who has done more than anybody to identify serious problems with the publishing of science.

The resulting meta analysis is superb and a great example for graduate students to evaluate with objective criteria such as PRISMA or AMSTAR.

The authors defined LTPP as having at least 40 sessions and continuing for at least one year. This different from Leichsenring and Rabung’s requirement of at least 50 sessions, but the authors noted that weekly sessions may result in a total of less than 50 sessions in a year, allowing for patient and therapist vacations and missed sessions.

The authors struggled with the poor quality of the studies they were able to identify, but came up with an excellent solution, a sensitivity analysis. The meta-analysis was conducted with each of the poor quality studies included and then again with each of these studies excluded. As it turned out, whether the studies were included did not influence the results.

They explicitly rejected the validity of effect sizes calculated on the basis of within-group differences:

To reliably assess the effectiveness of any treatment, it is necessary to evaluate its outcomes compared to a control group. The change in severity or intensity of a mental disorder over time cannot be attributed solely to the treatment that took place during that time, unless the treatment is controlled for. This is especially so with long-term treatments where the course of symptoms may change (more or less) spontaneously over time, even in personality disorders that were previously thought to be stable and incurable, such as borderline personality disorder.

Results were unambiguous and negative, in terms of the efficacy of LTPP:

The recovery rate of various mental disorders was equal after LTPP or various control treatments, including treatments without a specialized psychotherapy component. Similarly, no statistically significant differences were found for the domains target problems, general psychiatric problems, personality pathology, social functioning, overall effectiveness or quality of life.

And

Control conditions were heterogeneous and frequently of low quality, e.g. without a specialized psychotherapy component. If anything, this suggests that LTPP is often compared against relatively ineffective “straw man” comparator… LTPP comparisons to specialized non-psychodynamic treatments, like dialectical behavior therapy and schema-focused therapy, suggest that LTPP might not be particularly effective.

The bottom line is that available evidence suggests that LTPP is not worthwhile, at least in terms of the conventional ways of evaluating therapies. The authors noted that many of the studies made comparisons between LTPP and a control condition, which is inappropriate if the critical question is whether LTPP is superior to other psychotherapies.

The authors included a provocative quote from Freud expressing doubt whether LTPP really produces much change:

One has the impression that one ought not to be surprised if it should turn out in the end that the difference between a person who has not been analyzed and the behavior of a person after he has been analyzed is not so thorough-going as we aim at making it and as we expect and maintain it to be´ (Freud, 1937/1961).

Bewildered  yet? Tips for evaluating other meta-analyses

I have been encouraging a healthy skepticism about the quality and credibility of articles published in even the most high-impact prestigious journals.  The push to secure insurance coverage for LTPP has produced a lot of bad science, both at the level of poorly done clinical trials intended to prove rather than test the efficacy of LTPP, as well as the horrific meta-analyses that take bizarre steps to ensure LTPP uber alles.

It is troubling to see that bad science repeatedly gets into prestigious, high impact journals with its flaws brazenly displayed. Not only that, is accompanied by laudatory editorials in various efforts to block and neutralize criticism. Anyone who has participated in the debate described in this blog has to be aware of the presence of an old boy network of aging psychoanalysts and their patients that seeks to control the evidence that is available concerning LTPP. Few will articulate readers’ dilemma as clearly has Peter Kramer, but many readers suffer a discomfort knowing something is wrong with these clinical trials and meta-analyses, but they are intimidated by the sheer authority of JAMA, British Journal of Psychiatry, and Harvard Review of Psychiatry. Why, if there is such consensus about the efficacy of LTPP, who can argue?

Reviewing meta analyses of long-term psychoanalytic and psychodynamic psychotherapies reinforces some points that I have made and will be making in future blog posts.

  • We can’t necessarily decide on the authority and credibility of articles solely on the basis of the journals in which they appear, and an accompanying editorial does not necessarily give added reassurance.
  • We have standards such as CONSORT for guiding the reporting of clinical trials and the Cochrane collaboration risk of bias criteria for judging whether the reporting has a risk of bias.
  • We similarly have standards such as PRISM for guiding organization, conduct, and reporting of meta-analysis and systematic reviews and AMSTAR for their evaluation.

It’s a good idea to familiarize yourself with these readily available standards.

  • We should be wary of the use of meta-analysis for propaganda and marketing of particular interventions and services. Conflicts of interests unconfined to the usual consideration of whether there’s industry support for a clinical trial or meta-analysis.
  • We should be concerned about multiple meta-analysis coming from the same group of authors. We need to ask what justification there is for multiple publications journals and the be alert to uncritical self citation.

But what if one does not have time or inclination to scrutinize bad meta-analyses with formal rating scales? That is certainly true of a lot of consumers, whether they be clinicians, policymakers, or patients trying to make intelligent decisions about whether they really need long-term treatment. I think that they are some basic things to look for that can serve as a first screen of meta-analyses so the decision could be made whether to accept them, dismiss them, or subject them to further evaluation. The screen can be seen as a good rule-out tool and maybe the first stage of a two-stage process involving a further look, including going back to the original studies and other relevant meta-analyses.

When I pick up a meta-analysis, some first things that I look for that can be immediately seen as missing in Leichsenring and Rabung:

  1. Is there a reasonable number of reasonably sized trials making head-to-head comparisons between the two types of treatments that are being pitted against each other?
  2. Did the authors rely on conventional between-group calculation of effect sizes, rather than jumping to biased and easily misinterpreted within-group effect sizes?
  3. Aside from technical questions of statistical heterogeneity, does the lumping and splitting of both intervention and comparison groups make sense in terms of similarities and differences in interventions, patient characteristics, and clinical context?

It has taken years of discussion for all of the problems of Leichsenring and Rabung’s meta-analysis to become apparent, but I think their failure to meet basic criteria should have been apparent in a less than 30 minute close read of their JAMA paper. An effect size of 6.9? Come on!

Category: mental health care, psychotherapy | Tagged , | 4 Comments

Francis “Frank” J. Underwood From Netflix’s House of Cards: A Textbook Case of Antisocial Personality Disorder

I always like to take the opportunity to explain misunderstood psychiatric concepts or diagnoses, and to clarify when a psychiatric term is used incorrectly or prone to misinterpretation.  In today’s blog, I aim to do both of these things.

 

First, I’ll use the character of Frank Underwood as a “case study” to illustrate the misunderstood psychiatric diagnosis of Antisocial Personality Disorder.

 

houseofcards-660x374

Kevin Spacey in House of Cards, Image: Netflix

While enjoying the second season of House of Cards, I could not help but notice how Kevin Spacey’s character, Frank Underwood, meets a textbook definition of Antisocial Personality Disorder (ASPD).  Inspired by Spacey’s tremendous performance, I thought I would venture forth and use this example of a central character in a drama to illustrate this misunderstood and, often, underestimated psychiatric disorder.

Individuals with antisocial personality disorder (or sociopaths) are difficult and dangerous; they deny, lie, and contribute to all manner of mayhem in our communities and societies. They know full well what is going on around them and know the difference between right and wrong (and hence are fully responsible for their own behaviors) yet are simply unconcerned about such moral dilemmas.

Below is the “textbook” definition of ASPD interspersed with examples from the life of Frank Underwood, which perfectly illustrate the elements of this disorder.

 

SPOILER ALERT: For those of you who have not been on a streaming binge and watched all of Season 2 yet, consider yourself warned.


Antisocial Personality Disorder 301.7 (From the DSM V): 

A) A pervasive pattern of disregard for and violation of the rights of others,  occurring since age 15 years, as indicated by three (or more) of the following

1) Failure to conform to social norms with respect to lawful behaviors, as indicated by repeatedly performing acts that are grounds for arrest.

 

Image: Netflix

Image: Netflix

Murder. Not once, but at least two times (that we know of).  He swiftly pushed Zoe Barnes into the path of an oncoming metro train. Let’s not forget this was a woman with whom he had had a physical relationship with and a (sort of) emotional intimacy.  No doubt, this personal history contributed to Barnes’ poor judgment and her letting down her guard; she suspected he was a murderer but still underestimated what he was truly capable of. Frank leveraged her miscalculation to his favor.

In addition to murder, let’s not forget the unlawful behaviors carried out, on his orders, by those who work for him – e.g. vanquishing the remaining reporters who tried to expose him for what he truly is.

 

2)  Deceitfulness, as indicated by repeated lying, use of aliases, or conning others for personal profit or pleasure

 

Image: Netflix

Image: Netflix

Honestly, I found it hard to keep track of the web of lies Frank wove during Season 2. What was notable was the sincerity with which he told many of these lies, almost as though in the moment he believed them himself. He repeatedly lied so he could drive a wedge in the previously tight relationship between the Billionaire, Raymond Tusk, and the President – a wedge he created, on purpose (and at much cost and hassle to the American tax payer!) to further his own goal of becoming President. 

 Then there was the web of lies told to cover the fact that his wife Claire’s (played by Robin Wright) abortion had nothing to do with her alleged rape by General McGinnis, but more to do with the inconvenience of Underwood’s political campaign timings.

A final example is the strategic drama he created (along with Claire) to cover her affair with Galloway.  Again, there was no inkling of any remorse or feelings that they should be held accountable for their actions.  Instead there was only a rigid entitlement:  How dare anyone get in the way of me becoming president?

 

3)  Impulsivity or failure to plan ahead

 

Underwood has a degree of impulse control.  In fact, his ability to plot, scheme, and plan has served him well with regards to his political posturing and career.  This is not the case for many with ASPD.  Those without means, education, or status can be dangerously impulsive, and this behavior often leaves them in jail, prison, or dead.

 

4)  Irritability and aggressiveness, as indicated by repeated physical fights or assaults

Image: Netflix

Image: Netflix

 

See point #3.  He is aggressive and violent but has probably learnt, over time, to become more measured in his actions.  Repeated irritable outbursts and acts of physical aggression are not compatible with life in political office.

 

5)  Reckless disregard for safety of self or others

 

 See point #1.

 

6)  Consistent irresponsibility, as indicated by repeated failure to sustain consistent work behavior or honor financial obligations

 

 Did Frank Underwood honor any of his obligations or duties associated with being the Vice President of the United States of America?  Did he use his powers to be of service to the American people or to his country?  No.  His days and nights appeared to be utterly consumed with one goal…to become president of the United States.  At any cost.

 

7)  Lack of remorse, as indicated by being indifferent to or rationalizing having hurt, mistreated or stolen from another.

 

Image: Netflix

Image: Netflix

This was best illustrated in his reaction to the murder of  Zoe Barnes.  It was business as usual.  Not a hair out of place, no loss of appetite or sleep.  No remorse, no guilt or angst. She was getting in his way as he tried to forge a path to the presidency, so he got rid of her and never thought about it again. Her murder was no more of an incident than flicking lint from his jacket lapel.  In fact, he was so cool after the event that it makes me wonder about his psychopathic tendencies, but that would be a whole other blog for another day.

 

B) Individual is at least 18 years old

 

C) There is evidence of conduct disorder with onset before age 15 years

 

Who knows what skeletons lie in the Frank Underwood closet?

 

D) The occurrence of antisocial behavior is not exclusively during the course of schizophrenia or bipolar disorder.

 

One final point that is not done justice in the brief description above (more details can be found here) – those with ASPD are able to be utterly charismatic, charming, and almost bewitching. This characteristic is one Spacey has down to a tee in his performance.

Image Credit: Melinda Sue Gordon

Image Credit: Melinda Sue Gordon

When Frank wants something or needs to manipulate someone, he is able to “switch on” the charm in an instant.  He conveys to others that he cares deeply about them by flashing an infectious smile and being gracious and attentive.

And, as season 2 showed, there were many who fell prey to his deceit…not least of all the President of the free world.  Perhaps nowhere is his charisma more evident that in the perverse loyalty of those in his inner circle; all turn a blind eye to what he is capable of and appear to be utterly captivated by his personality and presence.

 

My second point: The term “antisocial” is used incorrectly or prone to misinterpretation. 

 

The seriousness of ASPD leads me to my next point – the confusing usage of the term “antisocial.” Antisocial is often used in lay language to indicate someone who is shy and unwilling or unable to associate in a normal or friendly way with other people. While this is a legitimate definition of the word, I have never been a fan of how this one word can be used in such opposing ways. I would advocate that we reserve this word for individuals with personality disorders associated with the features described above. People who are described as “antisocial” because they are shy are (typically) not dangerous.  This is in sharp contrast to the definition of antisocial widely used in mental health terminology. In this context antisocial goes hand in hand with being “antisociety” and is a disorder associated with much more sinister and outright dangerous and reckless behavior.

 

At this point, many of you might be saying, well who cares about these individuals?  They are just evil, so why bother to make a psychiatric case about them?  Just lock them up and throw away the key!

 

But the situation is vastly more complicated than that.

 

ASPD is common.  For the reasons outlined above (their lies, deception, and charm) sociopaths are not always easy to detect, yet ASPD is associated with huge costs to our society that extend well beyond the individual who has the disorder. We have to stay curious about ASPD – about how the disorder develops, how to detect it, how to manage it – as our societies pay for its consequences on many levels, economically, socially, and emotionally.

And when someone with ASPD ends up in a position of unparalleled power? Well, who knows what the consequences could be.

 

Category: Commentary, Psychiatry, Uncategorized | Tagged , , , , , , | 14 Comments

Much ado about a modest and misrepresented study of CBT for schizophrenia: Part 2

We all hope that an alternative can be found to treating persons with schizophrenia with only modestly effective antipsychotic medication that has serious side effects. Persons with schizophrenia and their families deserve an alternative.

That is why reports of this study in the media were greeted with such uncritical enthusiasm. Maybe even the motive for the results of the study getting hyped and distorted, from exaggerated claims attributed to the authors to further amplification in the media.

But at the present time, CBT has not yet been shown to provide an effective alternative and it has not yet been shown to have effect equivalent to antipsychotic medication. The results of this trial do not at all change this unfortunate situation. And to promote CBT as if it had been shown to be an effective alternative would be premature and inaccurate, if not a cruel hoax.

This is the second of two posts about a quite significant, but not earth shattering Lancet study of CBT for persons with unmedicated schizophrenia.

In this continuation of my previous post at Mind the Brain, I will discuss

  • The missed opportunity the investigators had to make a more meaningful comparison between CBT and supportive counseling or befriending.
  • The thorny problem of separating out effects of CBT from the medication that most patients were receiving in both conditions remaining in follow-up at the end.
  • The investigators’ bad decision to include the final follow-up assessment in calculating effect sizes for CBT, a point at which results would have to be made up for most participants.
  • Lancet’s failure to enforce preregistration and how the investigative group may have exploited this in putting a spin on this trial.
  • The inappropriate voodoo statistics used to put a spin on this trial.
  • How applying the investigators own standards would lead us to conclude that this trial did not produce usable results.
  • What we can learn from this important but modest, exploratory study, what we need to be told more about it by the investigators in order to learn all we can.

At my secondary blog, Quick Thoughts, I distributed blame for the distorted media

Aside from misleading title, Guardian story had photo that better represented male erectile dysfunction than schizophrenia.

Aside from misleading title, Guardian story had photo that better represented male erectile dysfunction than schizophrenia.

coverage of this study. There is a lot to go around. I faulted the authors for exaggerations in their abstract and inaccurate statements to the media. I also criticized media that parroted other inaccurate media coverage without journalists bothering to actually read the article. Worse, a competition seems to develop to see who could generate the most outrageous headlines. In the end, both Science and The Guardian owe consumers an apology for their distorted headlines. Click here to access this blog post and the interesting comments that it generated.

Why didn’t the investigators avoid an unresolvable interpretive mess by providing a more suitable control/comparison condition?

In my previous PLOS Mind the Brain post, I detailed how a messy composite of very different treatment settings was used for treatment as usual (TAU). This ruled out meaningful comparisons with the intervention group.

This investigator group should have provided a structured, supportive counseling as a conventional comparison/control condition. Or maybe befriending.

The sober-minded accompanying editorial in Lancet said as much.

Such a comparison group would have ensured uniform compensation for any inadequacies in the support available in the background TAU. Such a control group would also address the question whether any effects of CBT are due to nonspecific factors shared with supportive counseling. It would have allowed an opportunity to test whether the added training and expense of CBT is warranted by advantages over the presumably simpler and less expensive supportive counseling.

Why wasn’t this condition included? Tony Morrison’s interview with Lancet is quite revealing of the investigative group’s thinking that led to rejecting a supportive counseling comparison/control condition. You can find the interview here, and if you go to it, listen to what Professor Morrison is saying about halfway into the 12 minute interview.

What follows is a rough, but inexact transcription of what was said. You can compare it to the actual audiotape and decide whether I am introducing any inaccuracies.

The interviewer stated that he understood that patients often cannot tell the difference between supportive counseling and CBT, and conceded he struggled with this as well.

Tony Morrison agreed that there were good reasons why people sometimes struggle to distinguish the two because there is a reasonable amount of overlap.

The core components of supportive counseling approach are also required elements of CBT: developing a warm and trusting relationship with someone, being empathic, and nonjudgmental. CBT is also a talking therapy in a collaborative relationship.

The difference is that there are more specific elements to CBT that are represented within a supportive counseling approach.

One of those elements is that CBT is based on a cognitive behavioral model that assumes it is not the psychotic experiences that people have that are problematic, but people’s way of responding to these experiences. If people have a distressing explanation for hearing voices, that is bound to be associated with considerable distress. CBT helps patients develop more accurate and less distressing appraisal.

The interviewer noted there was no placebo given in this trial and the effects of placebo can be quite large. What were the implications of not using placebo?

Tony Morrison agreed that the interviewer was quite right that the effects of placebo are quite large in trials of treatment of people with schizophrenia.

He stated that it was difficult to do a placebo-controlled randomized trial with respect to psychological therapies because the standard approach to comparison group other than treatment usual would be something like supportive counseling or befriending, both of which might be viewed as having active ingredients like a good supportive relationship.

Not difficult to do a placebo-controlled randomized trial in this sense, Tony, but perhaps difficult to show that CBT has any advantage.

So, it sounds like that a nonspecific supportive counseling approach was rejected as a comparison/control group because it would reduce the possibility for finding significant effects for CBT. It is a pity that a quite mixed set of TAU conditions was chosen instead.

This decision introduced an uninterpretable mess of complexity, including fundamental differences in the treatment as usual depending on participant personal characteristics, with the likelihood that participants would simply be thrown out of some of the TAU.

Ignoring is not enough: what was to be done with the patients receiving medication?

When the follow-up was over most remaining participants in both groups— 10 out of 17– had received medication. That leaves us uncertain whether any effects of participants receiving CBT can be distinguished from effects of taking medication.

This confounding of psychotherapy and medication cannot readily be corrected in such a small sample.

We know from other studies that nonadherence is a serious problem with antipsychotic medication. The investigative group emphasized this as the rationale for trial. But we do not know which participants in each group received antipsychotic medication.

For instance, in the control condition, were patients receiving medication found mainly in the richly resourced early intervention sites? Or were they from the poorer traditional community sites where support for adherence would be lower because contact is lower? And, importantly, how did it come about that patients supposedly receiving only CBT began taking this medication? Was it under different circumstances and with different support for adherence than when medication was given in the TAU?

With a considerably larger sample, multivariate modeling method would allow a sensitivity analysis, with participants in both groups subdivided between those receiving and non-receiving antipsychotics.  That would not undo the patients who ended up taking medication having not been randomized to doing so. But, it would nonetheless have been informative. Yet, with only 17 patients remaining each group in the last follow-up, such methods would be indecisive and even inappropriate.

It is a plausible hypothesis that patients enrolled in a trial offering CBT without antipsychotic medication would, if they later accepted the medication, be more adherent. But this hypothesis cannot be tested in the study, nor explored within limited information that was provided to readers.

A decision not to follow some of the patients after the end of the intervention.

The investigators indicated that limited resources prevented following many of the patients beyond the intervention. If so, the investigators should have simply ended the main analyses at the point beyond which follow-up became highly selective. I challenge anyone to find a precedent in the literature where investigators stopped follow-up, but then averaged outcomes across all assessment periods, including one where most participants were not even being followed!

The most reliable and revealing approach to this problem would be to calculating effect sizes for the last point at which an effort was made to follow all participants, the end of the intervention. That would be consistent with the primary analysis for almost all other trials in the psychotherapy and pharmacological literatures. It would also avoid problems in estimating effect sizes for the last follow-up when the data for most patients would have to be made up.

But if that were done, it would have been obvious that this was a null trial with no significant effects for CBT.

The failure of Lancet to enforce preregistration for this trial.

Preregistration of the design of trials including  pre-specification of outcomes came about because of the vast evidence that many investigators do not report key aspects of their original design and do not report key outcomes if they are not favorable to the intervention. Ben Goldacre has taught us not to trust pharmaceutical drug trials that are not preregistered, and neither should we trust psychotherapy trials that are not.

Preregistration, if it is uniformly enforced, provides a safeguard of the integrity of results of trials, reducing the possibility of investigators redefining primary outcomes after results are known.

Preregistration is a now requirement for publication in many journals, including Lancet.

Guidelines for Lancet state:

We require the registration of all interventional trials, whether early or late phase, in a primary register that participates in WHO’s International Clinical Trial Registry Platform (see Lancet 2007; 369: 1909-11). We also encourage full public disclosure of the minimum 20-item trial registration dataset at the time of registration and before recruitment of the first participant (see Lancet 2006; 367: 1631-35).

This trial was not preregistered. The required preregistration of this trial(http://www.controlled-trials.com/ISRCTN29607432/) occurred after recruitment had already started. The “preregistration” appeared at the official website on October 21, 2010, yet recruitment started at February 15, 2010. Presumably a lot could be learned in that time period, and adjustments made in the protocol. We are just not told.

Even then, the preregistration failed to commit the investigators to which primary outcome evaluated at which time point will serve as the main evaluation for the trial. That too defeats the purpose of preregistration.

punchYou pays yer money and you takes yer choice.

The overall PANSS score is designated in the registration as the primary outcome, but no particular time point is selected, allowing the investigators some wiggle room. How much? The PANSS is assessed six times, and then there is the overall mean, which the authors preferred, bringing a total of seven assessments to choose from.

 

The preregistration also indicates that effect size of .8 is expected. That is quite unrealistic, unprecedented in the existing literature, including a meta-analyses. Claiming such a large effect size justifies having  smaller sample. That means that the trial will be highly underpowered from the get-go in terms of being able to generate reliable effect sizes.

Mumbo-jumbo, voodoo statistics used to create the appearance of significant effects.

In order to average outcomes across all assessment points, multivariate statistics were used to invent data for the majority of patients who were no longer being followed at the last assessment. Recall that that was when most patients already had been lost to follow-up. It is also when the largest differences appeared between the intervention and control group. Those differences seem to of been due to inexplicable deterioration in the minority of control patients still around to be assessed. The between-group differences were thus due to the control group looking bad, not any impressive results for the intervention group.

Multivariate statistics cannot work magic with a small sample with so few participants remaining the last follow-up.

The Lancet article reported a seemingly sophisticated plan for data analysis:

Covariates included site, sex, age, and the baseline value of the relevant outcome measure. Use of these models allowed for analysis of all available data, in the assumption that data were missing at random, conditional on adjustment for centre, age, sex, and baseline scores. The missing-at-random assumption seems to be the most realistic, in view of the planned variation in maximum follow-up times and the many other factors likely to affect drop-out; additionally, the assumption is routinely used in analyses of data from longitudinal trials.

chadsworth

Surely you jest! was an expression favored by Chatsworth T. Osborne, Jr., millionaire dilettante in The Many Loves of Dobie Gillis

Surely, you jest.

Anyone smart enough to write this kind of text is smart enough to know that it is an absurd plan for analyzing data in which many patients will not be followed after the end of intervention and with such a small sample size to begin with. It preserves the illusion of a required intent to treat analysis in the face of most participants have not been lost to follow-up.

Sure, multilevel analysis allows compensation for the loss of some participants from follow-up, but requires a much larger sample. Furthermore, the necessary assumption that the participants who were not available are missing at random is neither plausible nor testable with in a small sample. Again, one can test the assumption that missing is at random, but that require a starting sample size at least four or five times as large. Surely the statistician for this project knew that.

And then there is the issue of including control for the four covariates in analyses for which there are only 17 participants in the two groups at the end data point being analyzed.

As I noted in my last blog post, from the start, there were such huge differences among participants that summary statistics based on all of them were not applied to individual participants or subgroups.

  • Participants in the control group came from a very different settings with which their personal characteristics were associated. Particular participants came from certain settings and got treated differently.
  • Most participants were no longer around the final assessment, but we do not know how that is related to personal characteristics.
  • Most participants who were around in both the intervention and control group had accepted medication, and this is not random.
  • There was a strange deterioration going on in the control group.

Yup, the investigators are asking us to believe that being lost to follow-up was random, and so participants that were still around could be randomly replaced with participants who had been lost, without affecting the results.

With only 17 participants per group, we cannot even assess whether the intended effects of randomization had occurred, in terms of equalizing baseline differences between intervention and control groups. We do not even have the statistical power to detect whether baseline differences between the two groups might determine differences in the smaller numbers still available.

We know from lots of other clinical trials that when you start with so few patients has this trial did, uncontrolled baseline differences can still prove more potent than any intervention. That is why many clinical trialists refuse to accept any studies with less than 35 to 50 participants per cell.

Overall, this is sheer voodoo, statistical malpractice that should have been caught by the reviewers at Lancet. But it does make for putting an impressive spin on an otherwise null trial.

night tripperClassic voodoo rock music, among the best ever rock according to Rolling Stone, available here to accompany your rereading of the results of the Lancet paper.

Judging the trial by the investigators’ own standards

A meta-analysis by the last author Paul Hutton and colleagues argued that trials of antipsychotic medication with more than 20% attrition do not produce usable data. Hutton cite some authorities

Medical epidemiologists and CONSORT statement authors Kenneth Schulz and David Grimes, writing in The Lancet in 2002, stated: _a trial would be unlikely to successfully withstand challenges to its validity with losses of more than 20% [Sackett et al., 2000]_

But then, in what would be a damning critique of the present study, Hutton et al declares

Although more sophisticated ways of dealing with missing continuous data exist, all require data normally unavailable to review authors (e.g. individual data or summary data for completers only). No approach is likely to produce credible results when more than half the summary outcome data are missing.

Ok, Paul, fair is fair, are the results of your Lancet trial not credible? You seem to have tofair is fair concede this. What do you make of Tony Morrison’s claims to the media?

 

Then there is the YouTube presentation from 2012 from first author Tony Morrison himself.  He similarly argued that if a trial retains only half of the participants who initially enrolled, most of the resulting data are made up and results are not credible.

The 2012 presentation by Morrison also dismisses any mean differences between active medication and a placebo of  less than 15 points as clinically insignificant.

Okay, if we accept these criteria, what we say about a difference in CBT trial this claim to be only 6.5 after loss of most participants enrolled in the study, putting aside for a moment the objection that that even this is an exaggerated estimates?

We should have known ahead of time what can and cannot learned from an small, underpowered exploratory study.

In a pair of now classic methodological papers [1,2], esteemed clinical trialist Helena Kraemer and her colleagues defended the value, indeed the necessity of small preliminary exploratory, feasibility studies before conducting larger clinical trials.

A pilot study can be used to evaluate the feasibility of recruitment, randomization, retention, assessment procedures, new methods, and implementation of the novel intervention. A pilot study is not a hypothesis testing study. Safety, efficacy and effectiveness are not evaluated in a pilot. Contrary to tradition, a pilot study does not provide a meaningful effect size estimate for planning subsequent studies due to the imprecision inherent in data from small samples. Feasibility results do not necessarily generalize beyond the inclusion and exclusion criteria of the pilot design.

However, they warned against accepting effect sizes from such trials because they are underpowered. It would be unfair to judge the efficacy of intervention based on negative findings with a grossly underpowered trial. Yet, it would be just as unacceptable to judge an intervention favorably on the result of unexpected positive findings with the sample size was small. Such positive findings typically do not replicate and can easily be the result of chance, unmeasured baseline differences between intervention and control groups, and flexible rules of analysis and interpretation by investigators. And Kraemer and her colleagues did not even deal with small clinical trials in which most participants were no longer available for follow-up.

Even if this trial cannot tell us much about the efficacy of CBT for persons with unmedicated schizophrenia. We can still learn a lot, particularly if the investigators give us information that they had promised in their preregistration.

The preregistration promised information important for evaluating the trial that is not delivered in the Lancet paper. Importantly, we are given no information from the log that recorded all the treatments received by the control group and the intervention group.

This information is essential to evaluating whether group differences are really do to receiving or not receiving the intervention as intended, or influenced by uncontrolled treatment including medication was selective dropout. This information could shed light on when and how patients accepted antipsychotic medication and whether the conditions were different between groups.

We cannot torture the data from this study to reveal whether or not CBT was efficacious. But we can learn much about what would need to be done differently in a larger trial if results were to be interpretable. Given what we learned this trial, but would have to be done about participants deciding to take antipsychotic medication after randomization? Surely we could not refuse them that option.

More information would expose just how difficult it is to find suitable participants and community settings and enroll them in such a study. Certainly the investigators should have learned not to rely on such diverse settings as they did in the present study.

Hopefully in the next trial, investigators will have the courage to really test whether CBT has clinically significant advantages over supportive therapy or befriending. There would be risk would to investigator egos and bragging rights, but that would be compensated by the prospect of producing results that persons with schizophrenia and their families, as well as clinicians and policymakers could believe.

 

Category: antipsychotics, psychotherapy, schizophrenia | Tagged , | Leave a comment