Consumer Reports Readers Rate Psychotherapy


This file was culled over the past 2 weeks, since the release of the Consumer Reports article on a survey they conducted on psychotherapy. Unless noted otherwise, all words were written by Martin E. P. Seligman, in reply to people's questions on the Consumer Reports article. Dr. Seligman is a professor at the University of Pennsylvania.

It is provided here only to enhance public debate about this topic.

Enjoy, -John


From - Martin E. P. Seligman

Subject - Sampling and the CR survey

Bill Follettee wonders if there are sampling problems in the Consumer Reports article. Lest one think that this study was anything other than scrupulously done, I want to present what I could detect in the way of sampling flaws.

Consumer Reports (CR henceforward) included a supplementary survey about psychotherapy and drugs in one version of its 1994 annual questionnaire, along with its customary inquiries about appliances and services. One-hundred eighty-thousand readers received this version which included approximately 100 questions about automobiles and about mental health. CR asked readers to fill out the mental health section "if at any time over the past three years you experienced stress or other emotional problems for which you sought help from any of the following: friends, relatives, or a member of the clergy; a mental health professional like a psychologist or a psychiatrist; your family doctor; or a support group." Twenty-two thousand readers responded. Of these, approximately 7000 subscribers responded to the mental health questions. Of these 7000, about 3000 had just talked to friends, relatives or clergy. 4100 went to some combination of mental health professionals, family doctors, and support groups. Of these, 2900 saw a mental health professional: psychologists (37%) were the most frequently seen mental health professional, followed by psychiatrists (22%), social workers (14%) and marriage counselors (9%). Other mental health professionals made up 18%. 1300 joined self-help groups and about 1000 saw family physicians. The respondents, as a whole, were highly educated, predominantly middle class, about half were women, and they had a median age of 46.

Sampling. This survey is, as far as I have been able to determine, the most extensive study of psychotherapy effectiveness on record. The sample is not representative of the United States as a whole, but my guess is that it is roughly representative of the middle class and educated population who make up the bulk of psychotherapy patients. Importantly, the sample represents people who chose to go to treatment for their problems, not people who do not "believe in" psychotherapy or drugs. The CR sample, moreover, is probably weighted toward "problem solvers," people who actively try to do something about what troubles them.

Sampling. Is there a bias such that those respondents who succeed in treatment selectively return their questionnaires? CR, not surprisingly, has gone to considerable lengths to find out if its reader's surveys have sampling bias.The annual questionnaires are lengthy and can run to one hundred questions or more. Moreover, the respondent not only devotes a good deal of her own time to filling these out, but also pays her own postage and is not compensated. So the return rate is rather low, although the 12% return rate for this survey was normal for the annual questionnaire. But it is still possible that respondents might differ systematically from the readership as a whole. For the mental health survey (and for their annual questionnaires generally) CR conducted a "validation survey," in which postage was paid and the respondent compensated. This resulted in a return rate of around 40% as opposed to the 13% uncompensated return rate, and there were no differences between data from the two samples.

The possiblity of two other kinds of sampling bias, however, is notable, particularly with respect to the remarkably good results for AA. First, since AA encourages lifetime membership, a preponderance of successes, rather than dropouts would be more likely in the three year time slice ("have you had help in the last three years?"). Second, AA failures are often completely dysfunctional and thus much less likely to be reading Consumer Reports and filling out extensive readers' surveys, than, say, psychotherapy failures who were unsuccessfully treated for anxiety.

A similar kind of sampling bias, to a lesser degree, cannot be overlooked for other kinds of treatment failures. At any rate, it is quite possible that there was a large oversampling of successful AA cases and a smaller oversampling of successful treatment for problems other than alcoholism.

Could the benefits of long-term treatment be an artifact of sampling bias? Suppose that people who are doing well in treatment selectively remain in treatment and people who are doing poorly drop out earlier. In other words, the early drop-outs are mostly people who fail to improve, but later drop-outs are mostly people whose problem resolves. CR disconfirmed this possibility empirically: Respondents reported not only when they left treatment but why, including leaving because their problem was resolved. The drop-out rates because the problem was resolved were uniform across duration of treatment (less than one month, 60%; 1-2 months, 66%; 3-6 months, 67%, 7-11 months, 67%; 1-2 years, 67%; over two years, 68%).

A more sweeping limit on generalizability comes from the fact that the entire sample chose their treatment. To one degree or another, each person believed that psychotherapy and/or drugs would help him. To one degree or another, each person acknowledged that he had a problem and believed that the particular mental health professional he saw and the particular modality of treatment he chose would help him. One cannot argue compellingly from this survey that treatment by a mental health professional would prove as helpful to troubled people who deny their problems and who do not believe in and do not choose treatment.


Subject - face validity and CR

Neil Jacobson worries about the reliability and face validity of the measures in the CR survey. The reliabilites were pretty good as I recall (I don't have the data set in my possession), but it cannot be denied that the items were primarily "face valid." I think there is something to such a worry, but I don't see how any large-scale effectiveness study can be done without a preponderance of merely face valid items--and still be short enough to get a reasonable response rate.

It is possible to go entirely overboard, as Kazdin is represented as doing by Neil, in denying the usefulness of some face valid items. It would be a fool's errand to bother to get more than face validity on such CR items as "Was your therapist easy to confide in?" or "Did you check out other therapists before selecting this one?" or even " I felt my therapy helped being productive with work (from made things a lot better to made things a lot worse)." I suppose I can see a generation of research assistants finding that this actually correlates about .55 with ingeniously measured behavioral indices, but I would not be enthusiastic about setting anyone such a task.

There is some "beef" in Neil's worry, but I would put the cavil slightly differently:

Inadequate Outcome Measures. CR's indexes of improvement were molar. Responses like "made things a lot better" to the question "How much did therapy help you with the specific problems that led you to therapy?" tap into gross processes. More molecular assessment of improvement, for example, "how often have you cried in the last two weeks," or "how many ounces of alcohol did you have yesterday" would increase the validity of the method. Such detail would, of course, make the survey more cumbersome.

A variant of this objection is that the outcome measures were insensitive. This objection looms large for the failure to find that any modality of therapy did better than any other modality of therapy, or any drug for that matter, for any disorder. Perhaps if more detailed, disorder-specific measures were used, the Dodo bird hypothesis would have been disconfirmed.

A third variant of this objection is that the outcome measures were poorly normed. Questions like "How satisfied were you with this therapist's treatment of your problem? Completely satisfied, very satisfied, fairly well satisfied, somewhat dissatisfied, very dissatisfied, completely dissatisfied," and "How would you describe your overall emotional state? very poor: I barely managed to deal with things; fairly poor: Life was usually pretty tough for me; so-so: I had my ups and downs; quite good: I had no serious concerns; very good: Life was much the way I wanted it to be" are seat-of-the-pants items which depend almost entirely on face validity, rather than several generations of norming. So the conclusion that 90% of those people who started off "very poor" or "fairly poor" wound up in the "very good," "fairly good," or "so-so" categories does not guarantee that they had returned to normality in any strong psychometric sense. The addition of extensively normed questionnaires like the Beck Depression Inventory would strengthen the survey method (and make it more cumbersome).


Subject - "Satisfaction" and Cr

Jennifer Lish wonders if this was a reader "satisfaction" survey. At the risk of repeating my previous posts: It was not. Satisfaction was one periperal question, but the heart of the survey consisted of "How much therapy helped (from "made things a lot better" to "made things a lot worse") and in what areas (specific problem that led to therapy, relations to others, productivity, coping with stress, enjoying life more, growth and insight, self-esteem and confidence, raising low mood)." Here are some details (If you want the original questionnaire I suggest phoning Mark Kotkin 914 378 2253, the principal CR analyst, since I assume it is proprietary.)

Twenty-six questions were asked about mental health professionals, and parallel, but less detailed questions were asked about physicians, medications and self-help groups:

  • What kind of therapist
  • What presenting problem (e.g., general anxiety, panic, phobia, depression, low mood, alcohol or drugs, grief, weight, eating disorders, marital or sexual problems, children or family, work, stress)
  • Emotional state at outset (from very poor to very good)
  • Emotional state now (from very poor to very good)
  • Group versus individual therapy
  • Duration and frequency of therapy
  • Modality (psychodynamic, behavioral, cognitive, feminist)
  • Cost
  • Health care plan and limitations on coverage
  • Therapist competence
  • How much therapy helped (from "made things a lot better" to "made things a lot worse") and in what areas (specific problem that led to therapy, relations to others, productivity, coping with stress, enjoying life more, growth and insight, self-esteem and confidence, raising low mood)
  • Satisfaction with therapy
  • Reasons for termination (problems resolved or more manageable, felt further treatment wouldn't help, therapist recommended termination, a new therapist, concerns about therapist's competence, cost, and problems with insurance coverage).

CR's analysts decided that no single measure of therapy effectiveness would do and so created a multivariate measure. This composite had three subscales consisting of:

  • a) specific improvement ("how much did treatment help with the specific problem that led you to therapy: made things a lot better; made things somewhat better; made no difference; made things somewhat worse; made things a lot worse; not sure"),
  • b) satisfaction ("Overall how satisfied were you with this therapist's treatment of your problems: completely satisfied; very satisfied; fairly well satisfied; somewhat satisfied; very dissatisfied; completely dissatisfied"), and
  • c) global improvement (how respondents described their "overall emotional state" at the time of the survey compared to the start of treatment: "very poor: I barely managed to deal with things; fairly poor: Life was usually pretty tough for me; so-so: I had my ups and downs; quite good: I had no serious complaints; very good: Life was much the way I liked it to be")

General Functioning. The CR study measured self-reported changes in well-being, insight, and growth, in addition to improvement on the presenting problem. The main findings held for these "general functionality" measures as well as for symptom relief: for example, long term treatment produced better quality of life scores than short term treatment, and mental health professionals did better than family doctors on general function scores as well as symptom reduction for treatment which lasted longer than six months. Since improvement in general functioning, as well as symptom relief, is almost always a goal of actual treatment, but rarely of efficacy studies, the CR study adds to our knowledge of how treatment does beyond merely eliminating symptoms.


Subject - retrospection and CR

Robyn Dawes seems to think that the CR sampling method is seriously compromised by being retrospective. He also suggests that the problems with oversampling successful AA cases also applies in some unspecified (perhaps Robyn would be good enough to make this argument explicit) way to the CR conclusions as a whole. While I think retrospection is a mild flaw, and the AA results are highly suspect, I do not think either issues affect the validity of the main CR conslusions:

Retrospective. The CR respondents reported retrospectively on their emotional states. While a one-time survey is highly cost-effective, it is necessarily retrospective. Retrospective reports are less valid than concurrent observation, although an exception is worth noting: waiting for the rosy afterglow of a newly completed therapy to dissipate, as the CR study does, may make for a more sober evaluation.The retrospective method does not allow longitudinal observation of improvement across time in the same individuals. Thus the benefits of long term psychotherapy are an inference from comparing different individuals' improvements cross-sectionally. A prospective study would allow comparison of the same individuals' improvements over time.

Retrospective observation is a flaw, but it may introduce random rather than systematic noise in the study of psychotherapy effectiveness. The distortions introduced by retrospection could go either in the rosier or more dire direction, but only further research will tell us if the distortions of retrospection are random or systematic.

It is noteworthy that Consumer Reports generally uses two methods: one is the laboratory test, in which, for example, a car is crashed into a wall at five miles per hour and damage to the bumper is measured. The other is the reader's survey. These two methods parallel the efficacy study and the effectiveness study, respectively, in many ways. If retrospection was a fatal flaw, CR would have given up reader's survey method long ago, since reliability of used cars and satisfaction with airlines, physicians, and insurance companies depends on retrospection. Regardless, the survey method could be markedly improved, however, by being longitudinal, in the same way as an efficacy study is. Both self-report and diagnosis could be done before and after therapy as well as a thorough follow up carried out. But retrospective reports of emotional states will always be with us, since even in a prospective study that begins with a diagnostic interview, the patient retrospectively reports on her (presumably) less troubled emotional state before the diagnosis.

AA sampling bias.The possiblity of two other kinds of sampling bias, however, is notable, particularly with respect to the remarkably good results for AA. First, since AA encourages lifetime membership, a preponderance of successes, rather than dropouts would be more likely in the three year time slice ("have you had help in the last three years?"). Second, AA failures are often completely dysfunctional and thus much less likely to be reading Consumer Reports and filling out extensive readers' surveys, than, say, psychotherapy failures who were unsuccessfully treated for anxiety.

A similar kind of sampling bias, to a lesser degree, cannot be overlooked for other kinds of treatment failures. At any rate, it is quite possible that there was a large oversampling of successful AA cases and a smaller oversampling of successful treatment for problems other than alcoholism.

Could the benefits of long-term treatment be an artifact of sampling bias? Suppose that people who are doing well in treatment selectively remain in treatment and people who are doing poorly drop out earlier. In other words, the early drop-outs are mostly people who fail to improve, but later drop-outs are mostly people whose problem resolves. CR disconfirmed this possibility empirically: Respondents reported not only when they left treatment but why, including leaving because their problem was resolved. The drop-out rates because the problem was resolved were uniform across duration of treatment (less than one month, 60%; 1-2 months, 66%; 3-6 months, 67%, 7-11 months, 67%; 1-2 years, 67%; over two years, 68%).


Subject - multivariate CR

Howard Berenbaum suggested that the CR survey was a "reader satisfaction" survey and did not address the question of whether the specific problem--disorder--was ameliorated.

Not so: CR's analysts decided that no single measure of therapy effectiveness would do and so created a multivariate measure. This composite had three subscales consisting of:

  • a) specific improvement ("how much did treatment help with the specific problem that led you to therapy: made things a lot better; made things somewhat better; made no difference; made things somewhat worse; made things a lot worse; not sure"),
  • b) satisfaction ("Overall how satisfied were you with this therapist's treatment of your problems: completely satisfied; very satisfied; fairly well satisfied; somewhat satisfied; very dissatisfied; completely dissatisfied"), and
  • c) global improvement (how respondents described their "overall emotional state" at the time of the survey compared to the start of treatment: "very poor: I barely managed to deal with things; fairly poor: Life was usually pretty tough for me; so-so: I had my ups and downs; quite good: I had no serious complaints; very good: Life was much the way I liked it to be").

Each was transformed and weighted equally on a 0-100 scale, resulting in a 0-300 scale for effectiveness. The statistical analysis was largely multiple regression, with initial severity and duration of treatment (the two biggest effects) partialled out. Stringent levels of statistical significance were used.

All of the major results held for each of these subscales separately and the composite.


Subject - CR and Control Groups

Doris Aaronson raises the important objection to the CR article that there are no control groups. Not unsurprisingly, CR (and I) have thought quite a lot about this issue:

No Control Groups. The overall improvement rates were strikingly high across the entire spectrum of treatments and disorders in the CR study. The vast majority of people who were feeling very poorly or fairly poorly when they entered therapy made "substantial" (now feeling fairly or very good) or "some" (now feeling so-so) gains. Perhaps the best news for patients was that those with severe problems got, on average, much better. While this may be a ceiling effect, it is a ceiling effect with teeth. It means that if you have a patient with a severe disorder now, the chances are quite good that he will be much better within three years. But methodologically, such high rates of improvement are a yellow flag, cautioning us that global improvement over time alone, rather than with treatment or medication, may be the underlying mechanism.

More generally because there are no control groups, the CR study cannot tell us directly whether talking to sympathetic friends or merely letting time pass would have produced just as much improvement as treatment by a mental health professional. The CR survey, unfortunately, did not ask those who just talked to friends and clergy to fill out detailed questions about the results.

This is a serious objection, but there are internal controls which perform many of the functions of control groups: First, marriage counselors do significantly worse than psychologists, psychiatrists, and social workers, in spite of no significant differences in kind of problem, severity of problem, or duration of treatment. Marriage counselors control for many of the nonspecifics such as therapeutic alliance, rapport, and attention, as well as for passage of time. Second, there is a dose-response curve with more therapy yielding more improvement. The first point in the dose-response curve approximates no treatment: people who have less than one month of treatment have on average an improvement score of 201, whereas people who have over two years of treatment have a score of 241. Third, psychotherapy does just as well as psychotherapy plus drugs for all disorders, and there is such a long history of placebo controls inferior to these drugs that one can infer that psychotherapy likely would have outperformed such controls had they been run. Fourth, family doctors do significantly worse than mental health professionals when treatment continues beyond six months. It might be objected that since total length of time in treatment, rather than total amount of contact is the covariate, comparing family doctors who don't see their patients weekly to mental health professionals--who see their patients once a week or more--is not fair. It is, of course, possible that if family doctors saw their patients as frequently as psychologists do, the two groups would do equally well. It was notable, however, that there were a significant number of complaints about family doctors: 22% of respondents said the doctor had not "provided emotional support," 15% said that the doctor "seemed uncomfortable discussing emotional issues," and 18% said the doctor was" too busy to spend time talking to me." At any rate, the CR survey shows that long-term family doctoring for emotional problems--as it is actually performed in the field--is inferior to long-term treatment by a mental health professional as it is actually performed in the field.

It is also relevant that the patients attributed their improvement to treatment and not time ("How much do you feel that treatment helped you in the following areas?"), and I conclude that the benefits of treatment are very unlikely to be caused by the mere passage of time. But I also conclude that the CR study could be improved by control groups matched for severity and kind of problem, that are not treated by mental health professionals (but beware of the fact that random assignment will not occur). This would allow the Bayesian inference that psychotherapy does better than talking to friends, seeing an astrologer, or going to church to be made more confidently.


Subject - Re: CR and Control Groups

Surely a prospective effectiveness study would be superior to a retrospective effectiveness study. I see the CR study as a serious beginning that might justifiably inspire our scientific community and funding agencies, as well as apa, to undertake more ideal versions: prospective, blind diagnosis, extensive validated measures, control groups, and the like.

But as I suggest in my AP article and so far on the net, these less-than-ideal conditions do not easily explain away the major, and robust, results of the CR study as it stands.


Subject - Re: Consumer Reports article

As you can see from my verbose responses about control groups, face validity, retrospection, etc., I consider the CR piece to be quite rigorously done. In fact, I consider it to be the most extensive and best carried out study on the *effectiveness*--not the *efficacy*--of psychotherapy ever published.

I have a long methodological article coming out in the December American Psychologist about this topic.

Perhaps you can make explicit any thoughts you have on why it is not rigorous and we can all debate them.


Subject - Re: CR and Control Groups

Jim Wood (University of Texas at El Paso) asked:

1.  You mention that you are trying to persuade the CR attorneys to
 release the study data.  Does that mean that _American Psychologist_
 has agreed to publish your methodological analysis of the study
 even though the data are unavailable for examination by
 other researchers?  Isn't this a switch from normal APA policy
 regarding public availability of data?

Seligman replied:

I have no idea what APA policy is on methodological critiques of such studies. I seem to recall that as an author of a research report, I must make my data public if asked. MY AP article is not a research report. You should direct your question to Ray Fowler.

 2. You have said that you consider the CR report the best (or perhaps
 the "most important") study ever conducted on the "effectiveness" of
 psychotherapy.  Does this mean that you consider its findings more
 reliable or important than the findings of effectiveness studies that
 have used (a) control groups, (b) random assignment, (c) standardized
 measures, of known reliability and validity, and (d) contemporaneous,
 rather than retrospective, data collection?  Or would you say that
 any "effectiveness" study with these characteristics is therefore
 really an "efficacy" study?  It's my impression, at least in medicine,
 that "effectiveness" studies can have all these characteristics,
 which I'd consider pretty desirable.

The CR article is IMHO a very serious piece of research which merits careful scrutiny by our community. Because I know of no psychotherapy effectiveness studies which meet Jim Wood's desiderata, or even comes close, the CR study is about the best we have that I know of. In the AP article I outline desiderata for such a future study very much like Jim's. I hope the CR study will inspire our community to carry out such an ideal study of psychotherapy effectiveness.

Last reviewed: By John M. Grohol, Psy.D. on 9 Oct 2013
    Published on PsychCentral.com. All rights reserved.

 

 

The willingness to accept responsibility for one's own life is the source from which self-respect springs.
-- Joan Didion