Psychiatry, Psychology and Predictive Criteria

Psychiatry, Psychology and Predictive Criteria

Write a 1- 2pg response paper discussing the utility of using psychological measures as suggested below in the diagnosis of mental illness: There is a movement in psychology to require the usage of psychological measures to aid in the diagnosis of some mental illnesses. While developing the DSM-V, there was some discussion of including results of specific measures as part of diagnostic criteria. Considering the two articles attached to draft this short paper. 

Diagnostic utility of the NAB List Learning test in Alzheimer's disease and amnestic mild cognitive impairment

Abstract
Measures of episodic memory are often used to identify Alzheimer's disease (AD) and mild cognitive impairment (MCI). The Neuropsychological Assessment Battery (NAB) List Learning test is a promising tool for the memory assessment of older adults due to its simplicity of administration, good psychometric properties, equivalent forms, and extensive normative data. This study examined the diagnostic utility of the NAB List Learning test for differentiating cognitively healthy, MCI, and AD groups. One hundred fifty-three participants (age: range, 57-94 years; M = 74 years; SD, 8 years; sex: 61% women) were diagnosed by a multidisciplinary consensus team as cognitively normal, amnestic MCI (aMCI; single and multiple domain), or AD, independent of NAB List Learning performance. In univariate analyses, receiver operating characteristics curve analyses were conducted for four demographically-corrected NAB List Learning variables. Additionally, multivariate ordinal logistic regression and fivefold cross-validation was used to create and validate a predictive model based on demographic variables and NAB List Learning test raw scores. At optimal cutoff scores, univariate sensitivity values ranged from .58 to .92 and univariate specificity values ranged from .52 to .97. Multivariate ordinal regression produced a model that classified individuals with 80% accuracy and good predictive power. (JINS, 2009, 15, 121-129.) [PUBLICATION ABSTRACT]



Full Text
			Translate Full text?


(ProQuest: ... denotes non-US-ASCII text omitted.)
INTRODUCTION
Alzheimer's disease (AD) compromises episodic memory systems, resulting in the earliest symptoms of the disease (Budson & Price, 2005). Measures of anterograde episodic memory are useful in quantifying memory impairment and identifying performance patterns consistent with AD or its prodromal phase, mild cognitive impairment (MCI; Blacker et al., 2007; Salmon et al., 2002).
List learning tests are commonly used measures of episodic memory that offer a means of evaluating a multitude of variables relevant to learning and memory. Some of the more common verbal list learning tasks are the California Verbal Learning Test (CVLT; Delis et al., 1987), Auditory Verbal Learning Test (AVLT; Rey, 1941, 1964), Hopkins Verbal Learning Test (HVLT; Brandt & Benedict, 2001), and the Word List Recall test from the Consortium to Establish a Registry for Alzheimer's Disease (CERAD; Morris et al., 1989). List learning tests have been shown to possess adequate sensitivity and specificity in differentiating participants with MCI ( Mdn: sensitivity = .67, specificity = .86) (Ivanoiu et al., 2005; Karrasch et al., 2005; Schrijnemaekers et al., 2006; Woodard et al., 2005) and AD (Mdn: sensitivity = .80, specificity = .89) from controls (Bertolucci et al., 2001; Derrer et al., 2001; Ivanoiu et al., 2005; Karrasch et al., 2005; Kuslansky et al., 2004; Salmon et al., 2002; Schoenberg et al., 2006), as well as AD from MCI (Mdn: sensitivity = .85, specificity = .83) (de Jager et al., 2003).
The current study was undertaken to evaluate the diagnostic utility of a new list learning test in a sample of older adults seen as part of a prospective study on aging and dementia. The Neuropsychological Assessment Battery (NAB; Stern & White, 2003a, b) is a recently-developed comprehensive neuropsychological battery that has been standardized for use with individuals ages 18 to 97. It contains several measures of episodic memory, including a List Learning test similar to other commonly used verbal list learning tests. The NAB List Learning test was developed to "create a three trial learning test to avoid the potential difficulties that five trial tasks represent for impaired individuals, include three semantic categories to allow for examination of the use of semantic clustering as a learning strategy, avoid sex, education, and other potential biases, and include both free recall and forced-choice recognition paradigms" (White & Stern, 2003, p. 24).
One major benefit of the NAB includes the fact that all of its 33 subtests, together encompassing the major domains of neuropsychological functioning, are co-normed on the same large sample of individuals (n = 1448), with demographic adjustments available for age, sex, and education. This normative group contains a large proportion of individuals ages 60 to 97 (n = 841), making it particularly well suited for use in dementia evaluations. Despite psychometric validation (White & Stern, 2003), its diagnostic utility has yet to be evaluated.
For the last several years, several NAB measures have been included in the standard research battery in the Boston University (BU) Alzheimer's Disease Core Center (ADCC) Research Registry. The BU ADCC recruits both healthy and cognitively impaired older adults for comprehensive yearly neurological and neuropsychological assessments. After each individual is assessed, a multidisciplinary consensus diagnostic conference is held to diagnose each individual based on accepted diagnostic criteria. Importantly, the NAB measures have yet to be included for consideration when the consensus team meets to diagnose study participants. Therefore, the current study setting offers optimal clinical conditions (i.e., without neuropathological confirmation) for evaluating the diagnostic utility of the NAB List Learning test. In other words, NAB performance can be judged against current clinical diagnostic criteria without the tautological error that occurs when the reference standard is based on the test under investigation. Samples of participants from the BU ADCC Registry were used to evaluate the utility of the NAB List Learning test in the diagnosis of amnestic (a)MCI and AD. As the diagnostic utility of the NAB List Learning test has yet to be examined empirically, the present study was considered exploratory.
METHOD
Participants
Participant data were drawn from an existing database--the BU ADCC Research Registry--and retrospectively analyzed. Participants were recruited from the greater Boston area through a variety of methods, including newspaper advertisements, physician referrals, community lectures, and referrals from other studies. Participants diagnosed as cognitively healthy controls consisted of community-dwelling older adults, many of whom have neither expressed concern about nor been evaluated clinically for cognitive difficulties. Data collection and diagnostic procedures have been described in detail elsewhere (see Ashendorf et al., 2008 and Jefferson et al., 2006). Briefly, after undergoing a comprehensive participant and informant interview, clinical history taking (i.e., psychosocial, medical), and assessment (i.e., neurological, neuropsychological), participants were diagnosed by a multidisciplinary consensus group that included at least two board certified neurologists, two neuropsychologists, and a nurse practitioner. Of an initial pool of 490 participants, 18 were excluded from the present study because English was not their primary language. An additional 172 were excluded because they were not diagnosed as control, aMCI, or AD. Of the remaining 300 participants, 153 completed all relevant portions of the NAB List Learning test. These 153 participants comprised the current sample, from which three groups were established: controls (i.e., cognitively normal older adults), participants diagnosed with single or multiple domain aMCI (based on Winblad et al., 2004), and participants diagnosed with possible or probable AD (based on NINCDS-ADRDA criteria; McKhann et al., 1984).
The sample consisted of 93 women (60.8%) and 60 men (39.2%), ranging in age from 57 to 94 (M = 73.9; SD = 8.1). There were 128 (83.7%) non-Hispanic Caucasian participants and 25 (16.3%) African American participants. The data used in the current study were collected between 2005 and 2007 at each participant's most recent assessment, which ranged from the first to ninth visit ( Mdn = 4.0) of their longitudinal participation.
Measures

Procedure
The BU ADCC Research Registry data collection procedures were approved by the Boston University Medical Center Institutional Review Board. All participants provided written informed consent to participate in the study. Participants were administered a comprehensive neuropsychological test battery designed for the assessment of individuals with known or suspected dementia, including all tests that make up the Uniform Data Set (UDS) of the National Alzheimer's Coordinating Center (Beekly et al., 2007; Morris et al., 2006). Neuropsychological assessment was carried out by a trained psychometrist in a single session. The identification of cognitive impairment in each of the domains assessed (language, memory, attention, visuospatial functioning, and executive functioning) was based on BU ADCC Research Registry procedures, which defined psychometric impairment a priori as a standardized score (e.g., Z-score, T-score) of greater than or equal to 1.5 standard deviation units below appropriate normative means on one or more "primary" variables. Primary variables in the memory domain include Trial 3 and Delayed Recall from the CERAD Word List, and both Immediate and Delayed portions of the Logical Memory and Visual Reproduction subtests from the Wechsler Memory Scales-Revised (WMS-R; Wechsler, 1987). WMS-R subtests were administered according to UDS procedures (e.g., only Story A from Logical Memory is administered) and no other WMS-R subtests were used.
In addition to neuropsychological testing, participant information was also obtained via clinical interview with the participant and a close informant, neurological evaluation, review of medical history, and informant questionnaires.
Diagnosis
The results from the "primary" neuropsychological variables were used by the multidisciplinary consensus team, along with social and medical history, neurological examination results, and self/informant report (i.e., interviews and questionnaires), to arrive at a diagnosis for each participant. Diagnoses were made based only on information obtained during the participant's most recent visit. The NAB List Learning test was not a "primary" neuropsychological variable, and thus, was not considered for diagnostic purposes by the multidisciplinary consensus team.
Data Analysis



RESULTS
A breakdown of the participant demographics among the three diagnostic groups is provided in Table 1. Table 1 also depicts the level of global impairment for each group, based on both Clinical Dementia Rating (CDR; Morris, 1993) Global Score and Mini-Mental State Exam (MMSE; Folstein et al., 1975) scores. Significant group differences were found on age (control < aMCI = AD), education (control > aMCI = AD), CDR Global Score, and average MMSE score (control > aMCI > AD).
Table 1.
Participant demographics and test results
Note.
aMCI = amnestic mild cognitive impairment; AD = Alzheimer's disease; AA = African American; CDR = Clinical Dementia Rating; MMSE = Mini-Mental Status Examination; NAB = Neuropsychological Assessment Battery.
a
Possible AD: n = 6; Probable AD: n = 20.
Univariate Analyses
Independent samples t tests demonstrated significant group differences on each of the four NAB List Learning tests (Table 1). ROC curve analyses for the NAB List Learning variables are presented in Table 2. The cutoff scores presented in Table 2 were chosen to maximize sensitivity and specificity, with equal emphasis on both (Youden, 1950). The individual NAB List Learning test variables were able to differentiate aMCI from controls ( Mdn: sensitivity = .73; specificity = .71), AD from controls (Mdn: sensitivity = .89; specificity = .94), and AD from aMCI (Mdn: sensitivity = .69; specificity = .78). Additional prevalence-free classification accuracy statistics (i.e., those independent of base rates, such as sensitivity, specificity, PLR, and NLR) for conventional cutoff scores are provided in Table 3.
Table 2.
Prevalence-free classification accuracy statistics for NAB List Learning variables at optimal cutoff scores
Note.
PLR = Positive Likelihood Ratio; NLR = Negative Likelihood Ratio; aMCI = Amnestic mild cognitive impairment; AD = Alzheimer's disease.
Table 3.
Prevalence-free diagnostic accuracy statistics for NAB List Learning variables at conventional cutoff scores
Note.
aMCI = Mild Cognitive Impairment; AD = Alzheimer's disease; Sn = Sensitivity; Sp = Specificity; PLR = Positive Likelihood Ratio; NLR = Negative Likelihood Ratio. Dashes represent a value of positive infinity due to a specificity of 1.00.
a
Includes both aMCI and AD.
Multivariate Analyses
Likelihood ratio and goodness-of-fit tests revealed that the multiple ordinal logistic regression model explained a significant portion of outcome variance and fit the data well, -2 Log Likelihood [chi]2 (4, n = 153) = 127.80; p < .01; Pearson Goodness of Fit [chi]2 (298, n = 153) = 216.86; p = 1.00. Of the four independent variables, List B Immediate Recall (parameter estimate = -0.05; 95% CI = -0.09 to -0.01; Wald (1, n = 153) = 5.30; p = .02) and List A Short Delay Recall (parameter estimate = -0.10; 95% CI = -0.15 to -0.05; Wald (1, n = 153) = 14.5; p < .001) were found to be the two that contributed significantly to the model. List A Immediate Recall (parameter estimate = -0.02; 95% CI = -0.07 to 0.02; Wald (1, n = 153) = 1.33; p = .25) and List A Long Delay Recall (parameter estimate = 0.03; 95% CI = -0.08 to 0.02, Wald (1, n = 153) = 1.11; p = .29) were not significant contributors to the model.
Cross-validation
The estimated classification accuracy of the model using cross-validation was 80% (95% CI = 72-88%). In identifying aMCI, the model yielded a sensitivity of .47 (95% CI = .17-.77) and a specificity of .91 (95% CI = .83-.99; PLR = 4.96; NLR = .59). In identifying AD, the model yielded a sensitivity of .65 (95% CI = .41-.89) and a specificity of .97 (95% CI = .94-.99; PLR = 21.18; NLR = .36). A frequency table of predicted by actual diagnosis is presented in Table 4. Table 5 presents the positive predictive power (PPP) and negative predictive power (NPP) of the ordinal model across a range of clinically relevant base rates.
Table 4.
Frequency of predicted diagnosis by actual consensus diagnosis
Note.
aMCI = Amnestic mild cognitive impairment; AD = Alzheimer's disease.
Table 5.
Positive and negative predictive power of the ordinal NAB List Learning model at various base rates
Note.
aMCI = Amnestic mild cognitive impairment; PPP = Positive Predictive Power; NPP = Negative Predictive Power; AD = Alzheimer's disease.
DISCUSSION
The results of this study show that the NAB List Learning test can differentiate between cognitively normal older adults and those with aMCI and AD. Univariate analyses showed that each of the four variables was able to make dichotomous classifications with sensitivity values ranging from .58 to .92 and specificity values ranging from .52 to .97. For instance, AD was differentiated from controls with over 90% sensitivity and specificity using a cutoff score of T < or = 37 on List A Short Delay Recall or T < or = 40 on List A Long Delay Recall. In addition, AD was differentiated from aMCI with over 70% sensitivity and 80% specificity using a cutoff score of T < or = 30 on List A Short Delay Recall (see Table 2).
The multivariate ordinal logistic regression model, which incorporated four NAB List Learning variables, yielded an overall accuracy estimate of 80% based on fivefold cross-validation. In particular, the model was able to identify participants diagnosed with aMCI and AD with high specificity (.91 for aMCI and .97 for AD), but lower sensitivity (.47 for aMCI and .65 for AD). Taking prevalence into account, the ordinal logistic regression model was found to perform best when ruling out aMCI or AD (i.e., higher NPP) at lower base rates and when ruling in aMCI or AD (i.e., higher PPP) at higher base rates (Table 5). More specifically, in settings with clinical base rates of aMCI and AD at 20% or below, good performance on the NAB List Learning test can yield high confidence (i.e., NPP [greater than or equal to] .87) that the patient would not be diagnosed as aMCI or AD by our consensus team. Similarly, in a setting with base rates of aMCI or AD around 50% or greater, as may be seen in a memory disorders clinic, poor performance on the NAB List Learning test can provide a high degree of confidence (i.e., PPP [greater than or equal to] .72) that the patient would be given a diagnosis of aMCI or AD by our consensus team.
It should be noted that the current sample excluded individuals who did not complete the entire NAB List Learning test, which, for some participants, was due to excessive cognitive impairment. In addition, the participants with AD in the current study were predominantly in the very mild (CDR = 0.5, n = 5 [19%]) to mild (CDR = 1.0; n = 13 [50%]) stages. Because the current sample is generally free from severe impairment, it may be a valid representation of the types of patients that clinicians are asked to evaluate for early diagnosis.
Of the four variables entered into the ordinal logistic regression model, only two were found to contribute significantly: List B Immediate Recall and List A Short Delay Recall. Despite these findings, the results do not necessarily suggest that the nonsignificant variables lack value in differentiating healthy controls from individuals with aMCI from those with AD; in fact, both List A Immediate Recall and List A Long Delay Recall, in isolation, can differentiate control, aMCI, and AD groups with sensitivity values ranging from .58 to .92 and specificity values ranging from .52 to .97 (see Table 2). However, the results do suggest that these nonsignificant variables do not lead to a significant increase in explanatory power beyond what can be attained after considering List B Immediate Recall and List A Short Delay Recall performance.
Despite the fact that the MCI and AD groups were older and less educated than the control group, these demographic differences are unlikely to be contributing to the current results. Although age and education differ across groups, the use of demographically-corrected normative data protects against their potential confounding influence. In other words, the use of demographically-corrected norms prevents age and education from being associated with the independent variables. In fact, the NAB Psychometric and Technical Manual (White & Stern, 2003) illustrates that age accounts for 0.0% of the variance and education accounts for 0.0% to 0.2% of the variance in scores on the independent variables that were used in the current study.
The classification accuracy of the NAB List Learning test compares favorably to published data on other list learning tests. For instance, the median sensitivity and specificity values of the individual NAB List Learning variables are generally on par with those seen in tests such as the CVLT, AVLT, HVLT, and the CERAD Word List (Table 6). More specifically, for example, a recent study found that long delay free recall on the CVLT differentiated AD from controls with a sensitivity of .98 and a specificity of .88 (Salmon et al., 2002), similar to the reported values of NAB List A Long Delay Recall in the current study (sensitivity = .92, specificity = .97). However, a major strength of the current study is that it validates a single model, developed using multiple ordinal logistic regression, that combines several list learning variables simultaneously to discriminate between three diagnostic groups (i.e., control, aMCI, and AD). One advantage of this ordinal logistic regression model is that it combines the NAB List Learning variables quantitatively, yielding results that can be integrated with applicable base rates to estimate diagnostic likelihood. The use of this model allows for an empirically-validated, quantitative method of combining important variables, as opposed to using clinical judgment for "profile" analysis, which may be susceptible to limitations in human cognitive processing, such as interpreting patterns among multiple neuropsychological test variables (Wedding & Faust, 1989).
Table 6.
Comparison of sensitivity and specificity between individual variables from the NAB and the AVLT, CERAD, CVLT, and HVLT
Note.
MCI = Mild cognitive impairment; CERAD = Consortium to Establish a Registry for Alzheimer's Disease; IR = Immediate Recall; DR = Delayed Recall; %Ret = Percent Retention; HVLT = Hopkins Verbal Learning Test; NAB = Neuropsychological Assessment Battery; SDR = Short Delay Recall; LDR = Long Delay Recall; AD = Alzheimer's disease; CVLT = California Verbal Learning Test; AVLT = Auditory Verbal Learning Test.
a
Karrasch et al. (2005).
b
Ivanoiu et al. (2005).
c
Woodard et al. (2005).
d
Schrijnemaekers et al. (2006).
e
Current study (see Table 2).
f
Bertolucci et al. (2001).
g
Derrer et al. (2001).
h
Salmon et al. (2002).
i
Kuslansky et al. (2004).
j
Schoenberg et al. (2006).
k
de Jager et al. (2003).
Although the current findings support the diagnostic utility of the NAB List Learning test, the generalizability of the current results is limited. For instance, the sample is highly educated; data were collected in a research setting where many individuals volunteered due to self-awareness of memory difficulties; and the specifics of the reference standard, such as the clinicians participating in the consensus team and the assessment protocol used, are unique to our setting. Although the sample contains a fair number of African American participants (16%), representation of other minority groups is lacking. An additional limitation is the fact that the NAB List Learning test was not directly compared with other list learning tests in the same sample, precluding more definitive statements about its diagnostic accuracy in relationship to alternate tests. Finally, the results are limited by the reference standard that was used to establish a diagnosis. Despite the documented advantages of actuarial approaches over subjective approaches to clinical decision making (Dawes et al., 1989; Grove et al., 2000), it is important to emphasize that the reference standard used in the current study is a multidisciplinary consensus diagnosis based on contemporary clinical diagnostic criteria, not neuropathological diagnosis. At the present time, diagnosis of definite AD requires neuropathological confirmation (McKhann et al., 1984). Consequently, the classification accuracy statistics reported herein cannot be interpreted to reflect the likelihood that a patient actually has AD; instead, they indicate the likelihood that this specific consensus diagnostic team would make a particular diagnosis when using the assessment methods described above. It should also be noted that the consensus diagnosis was made, in part, on the basis of other neuropsychological tests, some of which are methodologically and psychometrically similar to the NAB List Learning test. This may have introduced an inherent and unavoidable source of bias. However, the diagnoses was based on consensus after consideration of a wide range of information, thus reducing the likelihood that shared method variance between the NAB List Learning test and other episodic memory measures would have caused significant tautological concerns.
From a methodological standpoint, there are other limitations that require future study. The data were analyzed retrospectively and at various points in the longitudinal assessment of participants. An important line of future research would be to longitudinally follow individuals diagnosed with aMCI to prospectively examine whether NAB List Learning test performance is associated with AD progression. Because the current study does not include other dementia subtypes, future studies should also examine non-AD dementias. Finally, to limit the number of predictor variables in the ordinal logistic regression model, the NAB List Learning variables that are considered "secondary" or "descriptive" (White & Stern, 2003) were excluded. However, these additional variables may add additional diagnostic utility to the List Learning test, and future study is warranted.
Despite its limitations, the current study has several strengths. For instance, diagnostic accuracy statistics are provided for a large number of cutoff scores, providing users of the test considerable flexibility in interpreting test results. For example, depending on the desired purpose of the examination, users may wish to choose cutoff scores that place a higher value on sensitivity (e.g., clinical settings, where false positive errors may preferable to false negative errors) or specificity (e.g., research settings, where false negative errors may be preferable to false positive errors). Users of the test may choose to interpret results using traditional cutoff scores (e.g., Z-scores < or = 1.5 or 2.0), or to use the empirically-derived cutoff scores presented herein to emphasize sensitivity and specificity equally. In addition, test users may choose to examine each test variable individually, or to interpret the overall pattern of test scores using the multiple ordinal logistic regression model, which accounts for performance on the four primary NAB List Learning variables simultaneously. For the latter approach, positive and negative predictive values are provided for a range of base rates, allowing for a more individually tailored approach to test interpretation. An additional strength of the study was the lack of tautological error, as the NAB List Learning test was not used in diagnostic formulations. Instead, NAB List Learning performance was examined independently against the clinical "gold standard," a multidisciplinary consensus diagnostic conference.
The cross-validation of the ordinal logistic regression model allows for examination of the degree of precision in estimates of sensitivity, specificity, and overall accuracy. Based on the reported confidence intervals, there is a good degree of precision in the ordinal model's overall accuracy (accuracy = 80%; 95% CI = 72-88%) and in the model's specificity to the diagnosis of both aMCI (specificity = .91; 95% CI = .83-.99) and AD (specificity = .97; 95% CI = .94-.99). However, in examining the 95% confidence intervals surrounding the sensitivity estimates for both aMCI and AD, it is apparent that the sensitivity of the ordinal model is considerably lower and lacking precision. This may be due in part to the relatively small sizes of the clinical sample and in part due to the negative log-log link function that was used in the multiple ordinal logistic regression model. This link function makes an a priori assumption that the underlying distribution of the data is skewed toward "normality." In other words, the model was chosen based on the assumption that the prevalence of healthy controls is greater than the prevalence of individuals with aMCI and AD. As a result, the ordinal logistic regression model may be more prone to false negative errors (i.e., reduced sensitivity) than to false positive errors (i.e., reduced specificity). This decreased sensitivity to aMCI and AD may also reflect the fact that individuals with aMCI and AD perform similarly on measures of episodic memory, and that functional measures may be necessary to improve diagnostic sensitivity once a certain degree of cognitive decline has occurred in an individual. Although the current results present diagnostic accuracy statistics for the NAB List Learning test, it should be emphasized that a diagnosis of aMCI or AD cannot be made on the basis of a single neuropsychological instrument.
The current results demonstrate that the NAB List Learning test was able to classify older adults into cognitively normal, AD, and aMCI groups with accuracy levels similar to other published list learning tests (Bertolucci et al., 2001; de Jager et al., 2003; Derrer et al., 2001; Ivanoiu et al., 2005; Karrasch et al., 2005; Kuslansky et al., 2004; Salmon et al., 2002; Schoenberg et al., 2006; Schrijnemaekers et al., 2006; Woodard et al., 2005). The NAB List Learning test possesses a large and up-to-date set of demographically-corrected normative data ( n = 1441) and it was co-normed as part of a comprehensive neuropsychological test battery. In addition, it was developed to include two equivalent forms; in fact, in the NAB standardization sample (n = 1448), test form accounted for less than 1.5% of the total variance seen in List Learning performance (White & Stern, 2003), making it suitable for clinical re-evaluation and longitudinal research applications. The findings from the current study, along with the overall strengths of the NAB, suggest that the NAB List Learning test is an appropriate and clinically useful tool for the evaluation of older adults with known or suspected Alzheimer's disease. Although the current study did not directly compare the diagnostic utility of the NAB List Learning test to other list learning measures, the classification accuracy data presented herein are similar to those reported in the literature investigating the diagnostic utility of other list learning tests in control, MCI, and AD samples (see Table 6). Future research is warranted to make direct comparisons of diagnostic utility to other list learning instruments.
ACKNOWLEDGMENTS
The project described was supported by Grant Number M01 RR000533 and the CTSA Grant Number 1UL1RR025771 from the National Center for Research Resources (NCRR), a component of the National Institute of Health (NIH). Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH. This research was also supported by P30-AG13846 (Boston University Alzheimer's Disease Core Center), R03-AG026610 (ALJ), R03-AG027480 (ALJ), K12-HD043444 (ALJ), K23-AG030962 (ALJ), RO1-HG02213 (RCG), RO1-AG09029 (RCG), R01-MH080295 (RAS), and K24-AG027841 (RCG). Robert A. Stern is one of the developers of the NAB and receives royalties from its publisher, Psychological Assessment Resources Inc. Portions of this manuscript were presented at the International Conference on Alzheimer's Disease, Chicago, July 2008. With this exception, the information in this manuscript and the manuscript itself has never been published either electronically or in print. A scoring program based on the multiple ordinal regression model reported in this manuscript is available as an electronic addendum to this manuscript.


Limitations of Diagnostic Precision and Predictive Utility
in the Individual Case: A Challenge for Forensic Practice
David J. Cooke Æ Christine Michie
Received: 24 August 2007 / Accepted: 2 February 2009 / Published online: 11 March 2009
_ American Psychology-Law Society/Division 41 of the American Psychological Association 2009
Abstract  Knowledge of group tendencies may not assist
accurate predictions in the individual case. This has
importance for forensic decision making and for the
assessment tools routinely applied in forensic evaluations.
In this article, we applied Monte Carlo methods to examine
diagnostic agreement with different levels of inter-rater
agreement given the distributional characteristics of PCL-R
scores. Diagnostic agreement and score agreement were
substantially less than expected. In addition, we examined
the confidence intervals associated with individual predictions
of violent recidivism. On the basis of empirical
findings, statistical theory, and logic, we conclude that
predictions of future offending cannot be achieved in the
individual case with any degree of confidence. We discuss
the problems identified in relation to the PCL-R in terms of
the broader relevance to all instruments used in forensic
decision making.
There is an important disjunction between the perspective
of science and the perspective of the law; while science
seeks universal principles that apply across cases, the law
seeks to apply universal principles to the individual case.
Bridging these perspectives is a major challenge for psychology
(Faigman, 2007 ). It is recognized by statisticians
that knowledge of group tendencies—even when precise—
may not assist accurate evaluation of the individual case
(e.g., Colditz, 2001 ; Henderson & Keiding, 2005 ; Rockhill,
2001 ; Tam & Lopman, 2003 ). It is a statistical truism that
the mean of a distribution tells us about everyone, yet no
one. This has serious implications for the use of psychological
tests in forensic decision making. To illustrate these
limitations, we focus on one of the most widely used, and
perhaps the most extensively validated, test in the forensic
arena—the Psychopathy Checklist Revised (PCL-R1 ; Hare,
2003 ). We emphasize, however, that all psychological tests
used in the same way in the forensic arena will suffer from
similar limitations (e.g., VRAG, Quinsey, Harris, Rice, &
Cormier, 1998 ; Static-99, Hanson & Thornton, 1999 ;
COVR, Monahan et al., 2005 ).
Mental health professionals are frequently asked to
opine whether an individual might be violent in the future;
psychopathic personality disorder is an important risk
factor to consider (Hart, 1998 ). The PCL-R is the most
frequently used measure of psychopathic personality disorder;
it has been described as the ‘‘gold standard’’ for that
purpose (Edens, Skeem, Cruise, & Cauffman, 2001; as
cited in Hare, 2003 ). There can be little doubt that the PCLR
has made a major contribution to our understanding of
violence (Hart, 1998 ); nonetheless, it is important for the
field to consider both its strengths and its limitations.
Findings for this instrument will have implications for less
well-validated tools. In this introduction, we consider two
issues; first, the use of PCL-R scores in forensic practice
and second, the general problem of the precision of predictions
about an individual case.
D. J. Cooke (&) _ C. Michie
Department of Psychology, Glasgow Caledonian University,
Glasgow G4 0BA, UK
e-mail: djcooke@rgardens.vianw.co.uk
1 The PCL-R is a 20-item rating scale of traits and behaviors intended
for use in a range of forensic settings. Definitions of each item are
provided and evaluators rate the lifetime presence of each item on a
3-point scale (0 = absent, 1 = possibly or partially present, and
2 = definitely present) on the basis of an interview with the
participant and a review of case history information.
123
Law Hum Behav (2010) 34:259–274
DOI 10.1007/s10979-009-9176-x
PCL-R SCORES AND FORENSIC PRACTICE
 Much of the interest in the construct psychopathy comes
from the relationship between the PCL-R and future
criminal behavior (Lyon & Ogloff, 2000 ). Previous
research suggests that psychopathy—as assessed using the
Psychopathy Checklist-Revised (PCL-R; Hare, 1991 )—is
an important risk marker for criminal and violent behavior
(Douglas, Vincent, & Edens, 2006 ; Hart, 1998 ; Hart &
Hare, 1997 ; Hemphill, Hare, & Wong, 1998 ; Leistico,
Salekin, DeCoster, & Rogers, 2008 ; Salekin, Rogers, &
Sewell, 1996 ). In fact, the PCL-R has been lauded as an
‘‘unparalleled’’ single predictor of violence (Salekin et al.,
1996 ). Hart (1998 ) argued that failure to consider psychopathy
in a violence risk assessment may constitute
professional negligence. This empirical base has resulted in
the PCL-R being used, not merely to measure the trait
strength of psychopathy in an individual, but also to make
predictions about what he or she will do in the future (Hare,
1993 ). As we demonstrate formally below, this additional
step of prediction means that the potential for imprecision
in forensic evidence is greatly increased: It expands the
gulf between inferences about groups and inferences about
individuals.
The PCL-R has been incorporated into statutory or legal
decision making (Hare, 2003 ). Within England and Wales,
a PCL-R score above a cut-off of 25 or 30 can lead to
detention in either a Special Hospital or a prison (Maden &
Tyrer, 2003 ); in certain Canadian provinces parole boards
explicitly consider PCL-R scores (Hare, 2003 ), and in
Texas psychopathy assessments are mandated by statute for
sexual predator evaluation (Edens & Petrila, 2006 ).2  The
PCL-R plays a role in criminal sentencing, including
decisions regarding indefinite commitment and capital
punishment, institutional placement and treatment, conditional
release, juvenile transfer, child custody, witness
credibility, civil torts, and indeterminate civil commitment
(DeMatteo & Edens, 2006 ; Fitch & Ortega, 2000 ; Hart,
2001 ; Hemphill & Hart, 2002 ; Lyon & Ogloff, 2000 ;
Walsh & Walsh, 2006 ; Zinger & Forth, 1998 ). The PCL-R
is regarded by many as the best method for operationalizing
the construct of psychopathy. For example, Lyon and
Ogloff (2000 ) argued that ‘‘… it is critical that the assessment
is made using the PCL-R’’ (p. 166) when evidence
about violence risk, based on psychopathy, is provided.
Because of its central role in forensic decision making it is
vital to assess its strengths and limitations and, by comparison,
the limitations of less well-validated procedures.
PREDICTIONS FOR INDIVIDUALS VERSUS
PREDICTIONS FOR GROUPS
 Prediction is the raison d’eˆtre of many forensic instruments
(e.g., VRAG, Quinsey et al., 1998 ; Static-99, Hanson &
Thornton, 1999 ; COVR, Monahan et al., 2005 ). While this is
not true of the PCL-R its frequent use in forensic practice is
underpinned by the assumption—implicit or explicit—that it
can predict future offending (Walsh & Walsh, 2006 ). How
precise can such predictions be? The precision of any estimate
of a parameter (e.g., mean rate of recidivism of a group)
can be measured by the width of a confidence interval (CI); a
CI gives an estimated range of values, which is likely to
include an unknown population parameter. If independent
samples are taken repeatedly from the same population, and a
CI calculated for each sample, then a certain percentage
(confidence level) of the intervals will include the unknown
population parameter. Typically, 95% of these intervals
should include the unknown population parameter; other
intervals may be used (e.g., 68% and 99%). The width of this
interval provides a measure of the precision—or certainty—
that we can have in the estimate of the population parameter.
The width of a CI of a population parameter is linked, in part,
to the sample size used to estimate the population parameter
(see below for a more technical explanation).
The prevailing prediction paradigm has two stages.
First, the parameters (mean, slope, and variance) of a
regression model linking an independent variable (e.g.,
PCL-R score) to a dependent variable (e.g., likelihood of
reconviction) are estimated. Each of these parameters has
uncertainty associated with them, which can be expressed
by confidence bands about the regression line. Second, a
new case is selected and the PCL-R score is assessed, the
model is applied and the likelihood of reconviction is
estimated. The best estimate of the likelihood of reconviction
for a new case will be identical to the point on the
regression line for that PCL-R score. This new estimate has
a CI—also known as a prediction interval—that expresses
the precision, or certainty, that should be associated with
the prediction made about the new case. Often the two
steps are conflated, with the unrecognized assumption
being made that the prediction interval for the new case is
comparable to the CIs for the model. It is not (see below).
The problem of making predictions for individuals from
statistical models is now recognized in other disciplines. In
relation to medical risks, Rose (1992 ) expressed the position
clearly: ‘‘Unfortunately the ability to estimate the
average risk for a group, which may be good, is not matched
by any corresponding ability to predict which
individuals are going to fall ill soon’’ (p. 48). In relation to
reoffending, Copas and Marshall (1998 ) made a related
point ‘‘… the score is not a prediction about an individual
 [italics added], but an estimate of what rate of conviction
2 The PCL-R is the most commonly used instrument for assessing
psychopathy in this setting (Mary Alice Conroy, Personal communication,
10 April 2007).
260 Law Hum Behav (2010) 34:259–274
123
 might be expected of a group  [italics added] of offenders
who match that individual on the set of covariates used by
the score’’ (p. 170) (see also Altman & Royston, 2000 ;
Bradfield, Huntzickler, & Fruehan, 1970 ; Colditz, 2001 ;
Elmore & Fletcher, 2006 ; Henderson, Jones, & Stare, 2001 ;
Henderson & Keiding, 2005 ; Rockhill, 2001 ; Rockhill,
Kawachi, & Colditz, 2000 ; Tam & Lopman, 2003 ; Wald,
Hackshaw, & Frost, 1999 ).
It is not generally recognized that a risk factor must have
a very  strong relative risk (i.e.,[ 50) if it is to have utility as
a screening instrument at the individual level (Rockhill
et al., 2000 ; see also Kennaway, 1998 ). However, others set
the bar higher:
A risk factor has to be extremely strongly associated
with a disease within a population before it can be
considered to be a potentially useful screening test.
Even a relative odds of 200 between the highest and
lowest fifths will yield a detection rate of no more
than about 56% for a 5% false positive rate…  (Wald
et al., 1999 , p. 1564).
To put this in perspective, the relative risk for the
association between lung cancer and smoking is between
10 and 15 (Rockhill et al., 2000 ), depending on the definition
of exposure. The relative risk for the PCL-R and
recidivism is something of the order of 3 for general
recidivism and 4 for violent recidivism (Hare, 2003 ).
Does the application of current forensic tools provide an
adequate basis for testimony concerning the individual case?
In this article, we attempt to answer this question by considering
three issues pertaining to PCL-R data. How
confident can clinicians and legal decision makers be, first, in
the use of critical diagnostic cut-offs; second, in the
numerical value of PCL-R scores; and third, in individual
predictions of violent recidivism? We describe two studies.
The first study addresses the accuracy of diagnostic decisions
and the potential range of discrepancies between two raters.
The second study addresses the accuracy of prediction of
future violence in the individual case. The results have relevance
beyond the PCL-R to the use of other psychometric
instruments in forensic practice: The same limitations may
apply to many forensic assessment instruments.
STUDY ONE
 In the first study, we examined diagnostic accuracy, specifically
the allocation of individuals around two critical
cut-offs, i.e., around 30 and around 25; the first is the
standard PCL-R cut-off for the diagnosis of psychopathy
and the second, often adopted in the UK, has proven useful
in that context including in decisions regarding treatment
allocation (Hare, 2003 ).
The inter-rater reliability figures presented in the PCL-R
manual can be regarded as good (Nunnally & Bernstein,
1994 ); intraclass correlation coefficient for single ratings
(ICC1 ) are estimated in some research studies as being
above .80 (Male offenders =  .86; Male forensic psychiatric
patients =  .88; Hare, 2003 , Table 5.4).3  Edens and
Petrila (2006 ) indicated that these are probably ‘‘best case’’
estimates and ‘‘real world’’ reliabilities may be substantially
poorer.4  Murrie, Boccaccini, Johnson, and Janke
(2008 ), in one ‘‘real world’’ study, demonstrated poor
agreement (ICC1 =  .39). These views and findings echo
concerns expressed by Hare (1998 ), that while researchers
take great pains to ensure reliability in their studies, the
level of reliability achieved by individual clinicians
remains unknown—and by implication—is likely to be
poorer than published studies. Inter-rater reliability is not
the only relevant consideration: Diagnostic precision is
also influenced by the underlying distribution of test scores.
Diagnostic precision is influenced by the location of the
cut-off and the shape of the distribution of scores—both
skewness and kurtosis. Estimates of the precision of a test
score (e.g., standard errors of measurement, SEM) are
weighted toward the mean of the distribution whereas cutoffs
are generally located substantially above the mean.
Item Response Theory (IRT) studies demonstrate that the
measurement precision of the PCL-R—in terms of measurement
information—falls toward the diagnostic cut-off
(Cooke, Michie, & Hart, 2006 ); thus, the SEM estimated
on the mean will provide an optimistic estimate of diagnostic
precision.
The SEM cannot be directly translated into estimates of
precision of diagnosis because of the impact of the score
distributions. Equally, it is not possible to estimate misclassification
rates directly using ICC1  values; therefore,
simulation approaches are required. Study one describes a
simulation that examines the impact of unreliability on
diagnostic accuracy.
Method
 Monte Carlo studies allow the investigation of the properties
of distributions and estimates of parameters where
results cannot be derived theoretically (Mooney, 1997 ;
Robert, 2004 ). Large numbers of simulated datasets can be
3 The estimates of reliability are frequently obtained by re-rating the
same interview or with an observer simultaneously rating within an
interview. This will tend to inflate reliability, but not validity, as the
same information source is being used.
4 The case of THE PEOPLE, Plaintiff and Respondent, v. KURT
ADRIAN PARKER, a Sexual Violent Predator ACT case highlights
the variability that can emerge in some cases; five accredited experts
furnished five PCL-R scores that ranged from 10 to 25. (Edens, John,
Personal Communication, 22 May 2006).
Law Hum Behav (2010) 34:259–274 261
123
 created based on an explicit and replicable data-generation
process. The effect of known features designed into the
data, such as levels of inter-rater reliability, on outcomes,
such as diagnostic precision, can be assessed. Multiple
trials of procedures are carried out to allow precise estimation
of outcomes. Mooney (1997 ) argued that Monte
Carlo simulations could allow social scientists to test
classical parametric inference methods and provide more
accurate statistical models. In our view, this mainstream
statistical technique is underused in forensic research.
Materials
 We used Monte Carlo techniques based on distribution
information from two datasets of PCL-R total scores: (1)
data for North American Male Offenders (Table 9.1, Hare,
2003 ) and (2) data from UK prisoners (Cooke, Michie,
Hart, & Clark, 2005 ).5
 The first distribution, being the largest, probably provides
the best estimate of the true distribution of scores
underlying the PCL-R and is described as ‘‘approximately
normal’’ (Hare, 2003 , p. 55). Given the potential impact of
a departure from normality we tested whether this distribution
was in fact normal. The departure from normality
was highly significant (Kolmogorov–Smirnov =  .068,
df =  5408, p/ .0001, Skewness = - .33, Kurtosis =
- .570). Examination of Fig. 1  demonstrates that around the
standard cut-off of 30 cases are over-represented while in
the right tail of the distribution they are under-represented.
In the simulation study,we generated two randomvariables
per case using MATHCAD.13 (2005 ). These random variableswere
scaled according to one of the two datasets referred
to above with mean l  and standard deviation r .6  This gives
two uncorrelated ratings (x1  and x2 ) from the distribution of
scores: x1  is our first rating on the subject, PCL1 . We then
calculated a linear combination of the two ratings to provide a
second rating on the same subject, which has a correlation of
q  with the first rating. The linear combination is PCL2 .
 roundf l . .x1 _ l.q . .x2 _ l.
ffiffiffiffiffiffiffiffiffiffiffiffiffi
 1 _ q2
p
g ;  using rounding
to ensure an integer score. This process gives two random,
correlated  scores from the distribution. There is a very small
probability of obtaining second ratings less than 0 or greater
than 40: These scores have been taken as 0 or 40, respectively.
Assuming that the ICC1  represents the best estimate of
the correlation between the two scores, we estimated the
distributions for four values of reliability, i.e., ICC1  values
of .75, .80, .85, and .90. The .80 value is a lower bound
estimate for reasonable practice. Hare (1998 ) indicated that
at least this level should be achievable with ‘‘… properly
conducted assessments’’ (p. 107). The .85 level may be
achievable by one rater with good training; the .90 level,
perhaps the best case scenario, is the level achievable where
two independent sets of ratings are averaged. Values above
.90 are rarely if ever achievable—Hare (1998 ) describes .95
and higher as ‘‘unbelievably high’’ (p. 107). The .75 provides
a lower-bound estimate of what may be obtained in
clinical practice. These values probably represent optimistic
estimates for actual clinical practice; we did not assess the
‘‘worst case’’ scenarios implied by Edens and Petrila (2006 )
and by Murrie et al. (2008 ). The estimation procedure was
repeated 10,000,000 times for each of the four levels of
ICC1  to provide stable estimates of the distribution of the
correlated ratings and to ensure at least 10,000 cases within
each of the extreme score bands. We examine discrepancies
in two ways: First, in terms of diagnostic disagreement and
second, in terms of disagreements about total scores.
What is the level of diagnostic agreement? Kappa (j )
coefficients measure the proportion of diagnostic agreements
corrected for observed base rates (Fleiss, 1981 ).
Conventionally, j/ .75 represents excellent agreement,
.40/j/ .75 represents fair to good agreement and
j/ .40 represents poor agreement (Gail & Benichou,
2000 ). Kappa values for three distributions and the four
ICC1  values are given in Table 1 . We calculated Kappa
coefficients for agreement in diagnosis between the two
ratings using both common cut-offs, i.e., 30 and 25. The
vast majority of Kappa values are only in the fair to good
range; few values approach the poor range.
Kappa is an omnibus statistic, which is useful for
summarizing group results; however, it tells us little about
agreement in the individual case. The potential for misclassification
is clearer when distributions of disagreements
0 10 20 30 40
PCL-R
0
100
200
300
Frequency
Fig. 1 Distribution of North American male prisoners and normal
curve
5 We carried out a similar analysis of data for Male Forensic
Psychiatric Patients (Table 9.2, Hare, 2003); the results, which
demonstrate the same pattern, can be obtained from the first author.
6 A full description of the simulation study including the Mathcad
code can be obtained from the first author.
262 Law Hum Behav (2010) 34:259–274
123
 are considered. The distributions based on the North
American Male Offenders are in Table 2 . For ease of
interpretation, we tabulated the distributions in 5-point
ranges. Examination of the sub-table for ICC1 =  .80
indicates that if one rater gives a score between 30 and 34,
i.e., just above the diagnostic cut-off then only in 46% of
occasions—approximately half the time—will the other
rater obtain a score within the same range. In 44% of the
occasions, the second rater would place the individual
below the critical cut-off. Even in the best case scenario,
i.e., ICC1 =  .90, if one rater gives a score between 30 and
34 then only in 60% of occasions will the other rater obtain
a score within the same range. On 29% of occasions, the
second rater would place the participant below the critical
cut-off.
The distributions based on the UK prisoners are in
Table 3 . Examination of the table for ICC1 =  .80 indicates
that if one rater gives a score between 30 and 34, i.e., just
above the diagnostic cut-off then only in 39% of occasions
will the second rater obtain a score within the same range. In
54% of the cases, the second rater would place the individual
below the critical cut-off.As previously, even in the best case
scenario, i.e., ICC1 =  .90, if one rater gives a score between
30 and 34 then only in 53% of cases will the other rater obtain
a score within the same range. In 39% of cases, the second
rater would place the participant below the critical cut-off.
In the UK, the cut-off of 25, as well as 30, is often applied
(DSPD Programme, 2005 ; Hare, 2003 ). Examination of the
table for ICC1 =  .80 indicates that if one rater gives a score
between 25 and 29, i.e., just above theUKdiagnostic cut-off,
then only in 29% of occasions will the other rater obtain a
score within the same range. On 49% of occasions, the second
rater would place the individual below the critical cutoff.
Even in the best case scenario, i.e., ICC1 =  .90, if one
rater gives a score between 25 and 29 then only in 37% of
cases will the other rater obtain a score within the same
range. On 37% of occasions, the second rater would place the
participant below the critical cut-off.
Therefore, in broad terms, all of the findings reported
above demonstrate that the allocation of an individual
above or below diagnostic cut-offs is much less precise
than previously thought.
Another way of considering the precision of PCL-R
scores is to examine expected discrepancies in scores based
on variations in ICC1  while taking into account the distributional
characteristics of the PCL-R scores. The PCL-R
manual suggests that in 68% of cases the discrepancies
between two raters should be up to 3 points, and in 95% of
cases it should be up to 6 points (Hare, 2003 ). This assumes
normality of the PCL-R score distribution, an assumption
that is not met (see above). The cumulative distribution of
score discrepancies estimated from the Monte Carlo studies
are tabulated in Table 4 . With the North American prisoner
sample and an ICC1  of .80, a discrepancy of between 8 and
9 points would be expected in 9% of cases, around 10
points in 5% of cases, and between 12 and 13 points in 1%
of cases. With the UK prisoner sample, and an ICC1  of .80,
a discrepancy of between 8 and 9 points would be expected
in 23% of cases, around 10 points in 5% of cases, and
around 12 points in 1% of cases.
An alternative approach to summarize the range of
possible discrepancies is to estimate the distribution of a
2nd PCL-R rating given the 1st PCL-R rating. This conditional
distribution can be summarized by a CI that
contains 95% of the 2nd ratings. This interval is thus
defined by the lower and upper limits LL and UL given by
Prob(LL/ 2nd rating/ ULj 1st rating) .  0: 95:
 Results for both 68% and 95% CIs for ICC1 =  .80, and
for both samples, are presented in Table 5 . For example, in
the North American prisoner sample, if rater one obtains a
total score of 30, then the 95% CI for rater two’s total score
will be between 19 and 36 (i.e., between the 35th and 99th
percentile).
All the estimates in this study are conservative; that is,
they assume that the SEM that applies at the mean applies
Table 1 Kappa coefficients and levels of agreement for four levels of correlation (q) for two distributions
q Both/30 Both C 30 Different j Both/25 Both C 25 Different j
North American male offenders
0.75 72.9 10.6 16.4 .46 48.0 29.6 22.4 .54
0.80 73.3 11.5 15.1 .51 48.9 31.0 20.0 .59
0.85 74.4 12.2 13.5 .56 50.5 32.3 17.2 .64
0.90 75.5 13.4 11.1 .64 51.7 34.2 14.1 .71
United Kingdom prisoners
0.75 91.8 2.1 6.1 .38 79.8 7.6 12.6 .47
0.80 92.0 2.4 5.6 .43 83.1 6.8 10.2 .52
0.85 92.4 2.7 5.0 .49 81.2 9.1 9.7 .60
0.90 92.7 3.1 4.2 .57 82.0 10.1 7.9 .67
Law Hum Behav (2010) 34:259–274 263
123
 around the cut-off. However, this is an unwarranted
assumption. The overall variance of errors of measurement is
a weighted average of the errors that pertain across the range
of true score values. Precision of measurement of the PCL-R
drops as scores approach the diagnostic cut-off (e.g., Cooke
& Michie, 1997 ; Cooke et al., 2006 ). Thus, the degree of
diagnostic misclassification and score discrepancy is likely
to be greater in practice than demonstrated in the simulation
above. The conditional SEM  (CSEM)7  is the square root of
the variance of errors at a particular level of true scores. To
Table 2 Distribution of diagnostic disagreements by four levels of
correlation between raters based on distribution of North American
male offenders
PCL-R score
0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40
q = 0.75
0–4 .209 .086 .033 .006 0 0 0 0
5–9 .505 .305 .200 .079 .008 0 0 0
10–14 .245 .318 .250 .188 .072 .005 0 0
15–19 .041 .235 .266 .270 .239 .078 .002 0
20–24 0 .053 .192 .256 .310 .281 .097 0
25–29 0 .002 .056 .155 .236 .369 .371 .130
30–34 0 0 .003 .044 .121 .230 .445 .551
35–40 0 0 0 .001 .014 .037 .085 .319
q = 0.8
0–4 .245 .096 .032 .002 0 0 0 0
5–9 .523 .372 .211 .065 .003 0 0 0
10–14 .217 .315 .288 .198 .055 .001 0 0
15–19 .014 .198 .280 .306 .240 .055 0 0
20–24 0 .019 .162 .269 .340 .283 .063 0
25–29 0 0 .026 .137 .247 .391 .379 .076
30–34 0 0 0 .023 .106 .237 .465 .585
35–40 0 0 0 0 .010 .031 .093 .339
q = 0.85
0–4 .285 .105 .021 0 0 0 0 0
5–9 .552 .423 .217 .038 0 0 0 0
10–14 .158 .328 .333 .202 .028 0 0 0
15–19 .005 .141 .303 .354 .231 .024 0 0
20–24 0 .003 .121 .287 .386 .265 .029 0
25–29 0 0 .005 .111 .266 .430 .351 .038
30–34 0 0 0 .008 .085 .252 .519 .569
35–40 0 0 0 0 .003 .029 .101 .393
q = 0.9
0–4 .361 .103 .009 0 0 0 0 0
5–9 .578 .486 .215 .012 0 0 0 0
10–14 .061 .348 .390 .190 .010 0 0 0
15–19 0 .063 .319 .422 .216 .005 0 0
20–24 0 0 .067 .304 .449 .238 .004 0
25–29 0 0 0 .071 .267 .500 .289 .006
30–34 0 0 0 0 .056 .239 .609 .494
35–40 0 0 0 0 0 .018 .098 .501
The tables show column percentages, which sum to 1 within rounding
error. The rows therefore do not sum to 1
Table 3 Distribution of diagnostic disagreements by four levels of correlation
between raters based on distribution of UK prisoners
PCL-R score
0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40
q = 0.75
0–4 .395 .158 .064 .020 .002 0 0 0
5–9 .477 .352 .228 .107 .035 .002 0 0
10–14 .128 .332 .309 .219 .131 .032 0 0
15–19 0 .153 .271 .336 .265 .198 .041 0
20–24 0 .005 .119 .215 .316 .297 .266 .031
25–29 0 0 .009 .086 .162 .261 .291 .254
30–34 0 0 0 .017 .083 .191 .328 .529
35–40 0 0 0 0 .006 .020 .075 .186
q = 0.8
0–4 .438 .163 .059 .011 0 0 0 0
5–9 .485 .387 .232 .097 .014 0 0 0
10–14 .077 .354 .328 .229 .111 .016 0 0
15–19 0 .095 .296 .354 .289 .154 .023 0
20–24 0 0 .083 .237 .331 .321 .214 .007
25–29 0 0 .002 .067 .176 .287 .303 .224
30–34 0 0 0 .005 .077 .200 .387 .545
35–40 0 0 0 0 .002 .022 .073 .224
q = 0.85
0–4 .470 .168 .042 .003 0 0 0 0
5–9 .474 .437 .224 .069 .003 0 0 0
10–14 .056 .342 .375 .226 .075 .003 0 0
15–19 0 .053 .308 .404 .287 .106 .001 0
20–24 0 0 .050 .251 .383 .325 .140 0
25–29 0 0 0 .047 .194 .321 .328 .145
30–34 0 0 0 0 .058 .229 .447 .578
35–40 0 0 0 0 0 .017 .084 .277
q = 0.9
0–4 .530 .169 .020 0 0 0 0 0
5–9 .450 .501 .219 .033 0 0 0 0
10–14 .019 .312 .457 .207 .037 0 0 0
15–19 0 .018 .286 .489 .275 .050 0 0
20–24 0 0 .017 .252 .450 .318 .062 0
25–29 0 0 0 .018 .214 .367 .325 .046
30–34 0 0 0 0 .024 .257 .528 .587
35–40 0 0 0 0 0 .008 .085 .367
7 Professional standards indicate that the CSEM is an important piece
of information that should be provided in a test manual. For example,
Standard 2.14 ‘‘Conditional standard error of measurements should be
reported at several score levels if constancy cannot be assumed.
Where cut scores are specified for selection or classification, the
standard errors of measurement should be reported in the vicinity of
each cut score.’’ (American Educational Research Association/
American Psychological Association, 1999; p. 35 emphasis added).
264 Law Hum Behav (2010) 34:259–274
123
 evaluate the true level of agreement of diagnosis likely to
apply around a cut-off it is necessary to take the CSEM into
account.
Item Response Theory indicates that the error of measurement
varies with location on the trait (h ).
IRT gives
SE.h. .
 1
ffiffiffiffiffiffiffiffi
I.h.
p
 where I (h ) is the information at h .
CTT gives
SEM . SD
ffiffiffiffiffiffiffiffiffiffiffi
 1 _ q
p
 Let q1  be the correlation at location 1 (h1 ), q2  be the
correlation at location 2 (h2 ).
Then
q2 .  1 _ . 1 _ q1.
I.h1.
I.h2.
 Location 1 is h =  0.0 (PCL-R =  20) and Location 2 is
h =  1.0 (PCL-R =  30) (Approximate locations from Hare,
2003 ; Fig. 6.6; see also Cooke & Michie, 1997 ). Overall,
the impact of the location of the estimated ICC1  is limited,
dropping—at a maximum—from .75 to .69. However, as
noted above, even small drops in ICC1  (e.g., from .85 to
.80) can substantially affect the misclassification rate and
the range of likely score discrepancies (see Table 6 ). It is
noteworthy that the magnitude of the drop appears to be
proportionately larger the poorer the mean estimated level
of inter-rater reliability. This suggests that the effect of the
CSEM is larger in cases that start with a relatively poor
level of inter-rater reliability. Equally, this would suggest
that proportionately greater discrepancies would, in
general, be obtained when factor or facet scores are
considered because they have lower levels of reliability
than the total scores (Hare, 2003 ).
STUDY TWO
 The use of the PCL-R in court is frequently justified based
on its predictive utility, the support being garnered from
between-subject designs (Edens & Petrila, 2006 ; Hare,
2003 ; Walsh & Walsh, 2006 ). In this study, we are concerned
with the individual. We examine the confidence that
can be placed in a prediction that an individual with a
particular PCL-R score will be reconvicted for a violent
offence.
All measurements and estimates entail error. As noted
above, the degree of error is expressed by CIs. For
Table 4 Cumulative distribution of expected discrepancies between two raters for different levels of correlation based on two sample
distributions
Point discrepancy SEMa North American male offenders United Kingdom prisoners
Correlation Correlation
0.75 0.80 0.85 0.90 0.75 0.80 0.85 0.90
0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1 .741 .934 .927 .919 .901 .936 .928 .918 .900
2 .503 .804 .785 .758 .704 .804 .785 .758 .703
3 .317 .679 .647 .595 .518 .681 .647 .595 .518
4 .303 .560 .516 .453 .352 .561 .518 .454 .357
5 .095 .449 .397 .327 .217 .452 .400 .332 .222
6 .046 .349 .292 .215 .121 .353 .298 .221 .125
7 .020 .262 .205 .136 .058 .269 .212 .141 .063
8 .007 .190 .137 .079 .023 .196 .142 .083 .026
9 .002 .132 .086 .041 .007 .137 .090 .044 .009
10 .001 .088 .050 .018 .002 .091 .054 .021 .003
11 .055 .026 .007 .058 .030 .009 .001
12 .032 .012 .002 .035 .015 .004
13 .017 .005 .019 .007 .001
14 .008 .002 .010 .003
15 .003 .005 .001
16 .001 .002
17 .001
a This column shows the cumulative distribution of discrepancies which was calculated assuming that discrepancies between two raters are
normally distributed and that the SEM is 3 (Hare, 2003, pp. 66–67)
Law Hum Behav (2010) 34:259–274 265
123
 example, while the mean rate of reoffending for a ‘‘High
Risk’’ group may be estimated as being 55%; the 95% CI
indicates that the true value of the mean rate or reoffending
for this group will lie between 44% and 66%, 95% of the
time, i.e., 19 times out of 20 (Hart, Michie, & Cooke,
2007 ). However, the clinician and the decision maker are
interested in the individual case not  the group. Therefore,
how much confidence can the clinician and decision maker
have in predictions of reoffending in the individual case
based on PCL-R scores? We examine CIs for group and
individual predictions.
Participants
 Two hundred fifty-five male prisoners between 18 and
40 years of age (M =  26.8, SD =  5.9) were interviewed in
Scotland’s largest prison for a study of psychological
characteristics and violence (Cooke, Michie, & Ryan,
2001 ; Michie & Cooke, 2006 ). Prisoners were selected by
systematic random sampling of the prison. The average
sentence length was 39 months (SD =  23 months; range
=  3 months to 10 years and life).
PCL-R Ratings
 PCL-R ratings were made according to instructions in the
test manual (Hare, 1991 ). All PCL-R evaluations were
conducted by trained raters using both interview and file
review (ICC1 =  .86).
Assessment of Recidivism
 Reconviction data were obtained from two sources: The
Scottish Criminal Records Office (SCRO) and the Police
National Computer (PNC). The average follow-up period
was 29 months. The point-biserial correlation between
PCL-R scores and recidivism (r =  .31) was above average
for the field (Walters, 2003 ). For the purposes of illustration,
we consider reconviction for violence that resulted in
a prison sentence (i.e., generally a more serious violent
offence). Follow-up data were available for 190 cases and
PCL-R data for 184 of these.
Table 5 The 68% and 95% confidence intervals for 2nd PCL-R total
score given 1st PCL-R score and ICC = 0.8
1st PCLR
Prisoners UK
LL
.95
LL
.68
UL
.68
LL
.95
LL
.95
LL
.68
UL
.68
LL
.95
0 0 0 9 12 0 0 8 13
1 0 0 10 13