Psychiatry, Psychology and Predictive Criteria
Write a 1- 2pg response paper discussing the utility of using psychological measures as suggested below in the diagnosis of mental illness: There is a movement in psychology to require the usage of psychological measures to aid in the diagnosis of some mental illnesses. While developing the DSM-V, there was some discussion of including results of specific measures as part of diagnostic criteria. Considering the two articles attached to draft this short paper. Diagnostic utility of the NAB List Learning test in Alzheimer's disease and amnestic mild cognitive impairment Abstract Measures of episodic memory are often used to identify Alzheimer's disease (AD) and mild cognitive impairment (MCI). The Neuropsychological Assessment Battery (NAB) List Learning test is a promising tool for the memory assessment of older adults due to its simplicity of administration, good psychometric properties, equivalent forms, and extensive normative data. This study examined the diagnostic utility of the NAB List Learning test for differentiating cognitively healthy, MCI, and AD groups. One hundred fifty-three participants (age: range, 57-94 years; M = 74 years; SD, 8 years; sex: 61% women) were diagnosed by a multidisciplinary consensus team as cognitively normal, amnestic MCI (aMCI; single and multiple domain), or AD, independent of NAB List Learning performance. In univariate analyses, receiver operating characteristics curve analyses were conducted for four demographically-corrected NAB List Learning variables. Additionally, multivariate ordinal logistic regression and fivefold cross-validation was used to create and validate a predictive model based on demographic variables and NAB List Learning test raw scores. At optimal cutoff scores, univariate sensitivity values ranged from .58 to .92 and univariate specificity values ranged from .52 to .97. Multivariate ordinal regression produced a model that classified individuals with 80% accuracy and good predictive power. (JINS, 2009, 15, 121-129.) [PUBLICATION ABSTRACT] Full Text Translate Full text? (ProQuest: ... denotes non-US-ASCII text omitted.) INTRODUCTION Alzheimer's disease (AD) compromises episodic memory systems, resulting in the earliest symptoms of the disease (Budson & Price, 2005). Measures of anterograde episodic memory are useful in quantifying memory impairment and identifying performance patterns consistent with AD or its prodromal phase, mild cognitive impairment (MCI; Blacker et al., 2007; Salmon et al., 2002). List learning tests are commonly used measures of episodic memory that offer a means of evaluating a multitude of variables relevant to learning and memory. Some of the more common verbal list learning tasks are the California Verbal Learning Test (CVLT; Delis et al., 1987), Auditory Verbal Learning Test (AVLT; Rey, 1941, 1964), Hopkins Verbal Learning Test (HVLT; Brandt & Benedict, 2001), and the Word List Recall test from the Consortium to Establish a Registry for Alzheimer's Disease (CERAD; Morris et al., 1989). List learning tests have been shown to possess adequate sensitivity and specificity in differentiating participants with MCI ( Mdn: sensitivity = .67, specificity = .86) (Ivanoiu et al., 2005; Karrasch et al., 2005; Schrijnemaekers et al., 2006; Woodard et al., 2005) and AD (Mdn: sensitivity = .80, specificity = .89) from controls (Bertolucci et al., 2001; Derrer et al., 2001; Ivanoiu et al., 2005; Karrasch et al., 2005; Kuslansky et al., 2004; Salmon et al., 2002; Schoenberg et al., 2006), as well as AD from MCI (Mdn: sensitivity = .85, specificity = .83) (de Jager et al., 2003). The current study was undertaken to evaluate the diagnostic utility of a new list learning test in a sample of older adults seen as part of a prospective study on aging and dementia. The Neuropsychological Assessment Battery (NAB; Stern & White, 2003a, b) is a recently-developed comprehensive neuropsychological battery that has been standardized for use with individuals ages 18 to 97. It contains several measures of episodic memory, including a List Learning test similar to other commonly used verbal list learning tests. The NAB List Learning test was developed to "create a three trial learning test to avoid the potential difficulties that five trial tasks represent for impaired individuals, include three semantic categories to allow for examination of the use of semantic clustering as a learning strategy, avoid sex, education, and other potential biases, and include both free recall and forced-choice recognition paradigms" (White & Stern, 2003, p. 24). One major benefit of the NAB includes the fact that all of its 33 subtests, together encompassing the major domains of neuropsychological functioning, are co-normed on the same large sample of individuals (n = 1448), with demographic adjustments available for age, sex, and education. This normative group contains a large proportion of individuals ages 60 to 97 (n = 841), making it particularly well suited for use in dementia evaluations. Despite psychometric validation (White & Stern, 2003), its diagnostic utility has yet to be evaluated. For the last several years, several NAB measures have been included in the standard research battery in the Boston University (BU) Alzheimer's Disease Core Center (ADCC) Research Registry. The BU ADCC recruits both healthy and cognitively impaired older adults for comprehensive yearly neurological and neuropsychological assessments. After each individual is assessed, a multidisciplinary consensus diagnostic conference is held to diagnose each individual based on accepted diagnostic criteria. Importantly, the NAB measures have yet to be included for consideration when the consensus team meets to diagnose study participants. Therefore, the current study setting offers optimal clinical conditions (i.e., without neuropathological confirmation) for evaluating the diagnostic utility of the NAB List Learning test. In other words, NAB performance can be judged against current clinical diagnostic criteria without the tautological error that occurs when the reference standard is based on the test under investigation. Samples of participants from the BU ADCC Registry were used to evaluate the utility of the NAB List Learning test in the diagnosis of amnestic (a)MCI and AD. As the diagnostic utility of the NAB List Learning test has yet to be examined empirically, the present study was considered exploratory. METHOD Participants Participant data were drawn from an existing database--the BU ADCC Research Registry--and retrospectively analyzed. Participants were recruited from the greater Boston area through a variety of methods, including newspaper advertisements, physician referrals, community lectures, and referrals from other studies. Participants diagnosed as cognitively healthy controls consisted of community-dwelling older adults, many of whom have neither expressed concern about nor been evaluated clinically for cognitive difficulties. Data collection and diagnostic procedures have been described in detail elsewhere (see Ashendorf et al., 2008 and Jefferson et al., 2006). Briefly, after undergoing a comprehensive participant and informant interview, clinical history taking (i.e., psychosocial, medical), and assessment (i.e., neurological, neuropsychological), participants were diagnosed by a multidisciplinary consensus group that included at least two board certified neurologists, two neuropsychologists, and a nurse practitioner. Of an initial pool of 490 participants, 18 were excluded from the present study because English was not their primary language. An additional 172 were excluded because they were not diagnosed as control, aMCI, or AD. Of the remaining 300 participants, 153 completed all relevant portions of the NAB List Learning test. These 153 participants comprised the current sample, from which three groups were established: controls (i.e., cognitively normal older adults), participants diagnosed with single or multiple domain aMCI (based on Winblad et al., 2004), and participants diagnosed with possible or probable AD (based on NINCDS-ADRDA criteria; McKhann et al., 1984). The sample consisted of 93 women (60.8%) and 60 men (39.2%), ranging in age from 57 to 94 (M = 73.9; SD = 8.1). There were 128 (83.7%) non-Hispanic Caucasian participants and 25 (16.3%) African American participants. The data used in the current study were collected between 2005 and 2007 at each participant's most recent assessment, which ranged from the first to ninth visit ( Mdn = 4.0) of their longitudinal participation. Measures Procedure The BU ADCC Research Registry data collection procedures were approved by the Boston University Medical Center Institutional Review Board. All participants provided written informed consent to participate in the study. Participants were administered a comprehensive neuropsychological test battery designed for the assessment of individuals with known or suspected dementia, including all tests that make up the Uniform Data Set (UDS) of the National Alzheimer's Coordinating Center (Beekly et al., 2007; Morris et al., 2006). Neuropsychological assessment was carried out by a trained psychometrist in a single session. The identification of cognitive impairment in each of the domains assessed (language, memory, attention, visuospatial functioning, and executive functioning) was based on BU ADCC Research Registry procedures, which defined psychometric impairment a priori as a standardized score (e.g., Z-score, T-score) of greater than or equal to 1.5 standard deviation units below appropriate normative means on one or more "primary" variables. Primary variables in the memory domain include Trial 3 and Delayed Recall from the CERAD Word List, and both Immediate and Delayed portions of the Logical Memory and Visual Reproduction subtests from the Wechsler Memory Scales-Revised (WMS-R; Wechsler, 1987). WMS-R subtests were administered according to UDS procedures (e.g., only Story A from Logical Memory is administered) and no other WMS-R subtests were used. In addition to neuropsychological testing, participant information was also obtained via clinical interview with the participant and a close informant, neurological evaluation, review of medical history, and informant questionnaires. Diagnosis The results from the "primary" neuropsychological variables were used by the multidisciplinary consensus team, along with social and medical history, neurological examination results, and self/informant report (i.e., interviews and questionnaires), to arrive at a diagnosis for each participant. Diagnoses were made based only on information obtained during the participant's most recent visit. The NAB List Learning test was not a "primary" neuropsychological variable, and thus, was not considered for diagnostic purposes by the multidisciplinary consensus team. Data Analysis RESULTS A breakdown of the participant demographics among the three diagnostic groups is provided in Table 1. Table 1 also depicts the level of global impairment for each group, based on both Clinical Dementia Rating (CDR; Morris, 1993) Global Score and Mini-Mental State Exam (MMSE; Folstein et al., 1975) scores. Significant group differences were found on age (control < aMCI = AD), education (control > aMCI = AD), CDR Global Score, and average MMSE score (control > aMCI > AD). Table 1. Participant demographics and test results Note. aMCI = amnestic mild cognitive impairment; AD = Alzheimer's disease; AA = African American; CDR = Clinical Dementia Rating; MMSE = Mini-Mental Status Examination; NAB = Neuropsychological Assessment Battery. a Possible AD: n = 6; Probable AD: n = 20. Univariate Analyses Independent samples t tests demonstrated significant group differences on each of the four NAB List Learning tests (Table 1). ROC curve analyses for the NAB List Learning variables are presented in Table 2. The cutoff scores presented in Table 2 were chosen to maximize sensitivity and specificity, with equal emphasis on both (Youden, 1950). The individual NAB List Learning test variables were able to differentiate aMCI from controls ( Mdn: sensitivity = .73; specificity = .71), AD from controls (Mdn: sensitivity = .89; specificity = .94), and AD from aMCI (Mdn: sensitivity = .69; specificity = .78). Additional prevalence-free classification accuracy statistics (i.e., those independent of base rates, such as sensitivity, specificity, PLR, and NLR) for conventional cutoff scores are provided in Table 3. Table 2. Prevalence-free classification accuracy statistics for NAB List Learning variables at optimal cutoff scores Note. PLR = Positive Likelihood Ratio; NLR = Negative Likelihood Ratio; aMCI = Amnestic mild cognitive impairment; AD = Alzheimer's disease. Table 3. Prevalence-free diagnostic accuracy statistics for NAB List Learning variables at conventional cutoff scores Note. aMCI = Mild Cognitive Impairment; AD = Alzheimer's disease; Sn = Sensitivity; Sp = Specificity; PLR = Positive Likelihood Ratio; NLR = Negative Likelihood Ratio. Dashes represent a value of positive infinity due to a specificity of 1.00. a Includes both aMCI and AD. Multivariate Analyses Likelihood ratio and goodness-of-fit tests revealed that the multiple ordinal logistic regression model explained a significant portion of outcome variance and fit the data well, -2 Log Likelihood [chi]2 (4, n = 153) = 127.80; p < .01; Pearson Goodness of Fit [chi]2 (298, n = 153) = 216.86; p = 1.00. Of the four independent variables, List B Immediate Recall (parameter estimate = -0.05; 95% CI = -0.09 to -0.01; Wald (1, n = 153) = 5.30; p = .02) and List A Short Delay Recall (parameter estimate = -0.10; 95% CI = -0.15 to -0.05; Wald (1, n = 153) = 14.5; p < .001) were found to be the two that contributed significantly to the model. List A Immediate Recall (parameter estimate = -0.02; 95% CI = -0.07 to 0.02; Wald (1, n = 153) = 1.33; p = .25) and List A Long Delay Recall (parameter estimate = 0.03; 95% CI = -0.08 to 0.02, Wald (1, n = 153) = 1.11; p = .29) were not significant contributors to the model. Cross-validation The estimated classification accuracy of the model using cross-validation was 80% (95% CI = 72-88%). In identifying aMCI, the model yielded a sensitivity of .47 (95% CI = .17-.77) and a specificity of .91 (95% CI = .83-.99; PLR = 4.96; NLR = .59). In identifying AD, the model yielded a sensitivity of .65 (95% CI = .41-.89) and a specificity of .97 (95% CI = .94-.99; PLR = 21.18; NLR = .36). A frequency table of predicted by actual diagnosis is presented in Table 4. Table 5 presents the positive predictive power (PPP) and negative predictive power (NPP) of the ordinal model across a range of clinically relevant base rates. Table 4. Frequency of predicted diagnosis by actual consensus diagnosis Note. aMCI = Amnestic mild cognitive impairment; AD = Alzheimer's disease. Table 5. Positive and negative predictive power of the ordinal NAB List Learning model at various base rates Note. aMCI = Amnestic mild cognitive impairment; PPP = Positive Predictive Power; NPP = Negative Predictive Power; AD = Alzheimer's disease. DISCUSSION The results of this study show that the NAB List Learning test can differentiate between cognitively normal older adults and those with aMCI and AD. Univariate analyses showed that each of the four variables was able to make dichotomous classifications with sensitivity values ranging from .58 to .92 and specificity values ranging from .52 to .97. For instance, AD was differentiated from controls with over 90% sensitivity and specificity using a cutoff score of T < or = 37 on List A Short Delay Recall or T < or = 40 on List A Long Delay Recall. In addition, AD was differentiated from aMCI with over 70% sensitivity and 80% specificity using a cutoff score of T < or = 30 on List A Short Delay Recall (see Table 2). The multivariate ordinal logistic regression model, which incorporated four NAB List Learning variables, yielded an overall accuracy estimate of 80% based on fivefold cross-validation. In particular, the model was able to identify participants diagnosed with aMCI and AD with high specificity (.91 for aMCI and .97 for AD), but lower sensitivity (.47 for aMCI and .65 for AD). Taking prevalence into account, the ordinal logistic regression model was found to perform best when ruling out aMCI or AD (i.e., higher NPP) at lower base rates and when ruling in aMCI or AD (i.e., higher PPP) at higher base rates (Table 5). More specifically, in settings with clinical base rates of aMCI and AD at 20% or below, good performance on the NAB List Learning test can yield high confidence (i.e., NPP [greater than or equal to] .87) that the patient would not be diagnosed as aMCI or AD by our consensus team. Similarly, in a setting with base rates of aMCI or AD around 50% or greater, as may be seen in a memory disorders clinic, poor performance on the NAB List Learning test can provide a high degree of confidence (i.e., PPP [greater than or equal to] .72) that the patient would be given a diagnosis of aMCI or AD by our consensus team. It should be noted that the current sample excluded individuals who did not complete the entire NAB List Learning test, which, for some participants, was due to excessive cognitive impairment. In addition, the participants with AD in the current study were predominantly in the very mild (CDR = 0.5, n = 5 [19%]) to mild (CDR = 1.0; n = 13 [50%]) stages. Because the current sample is generally free from severe impairment, it may be a valid representation of the types of patients that clinicians are asked to evaluate for early diagnosis. Of the four variables entered into the ordinal logistic regression model, only two were found to contribute significantly: List B Immediate Recall and List A Short Delay Recall. Despite these findings, the results do not necessarily suggest that the nonsignificant variables lack value in differentiating healthy controls from individuals with aMCI from those with AD; in fact, both List A Immediate Recall and List A Long Delay Recall, in isolation, can differentiate control, aMCI, and AD groups with sensitivity values ranging from .58 to .92 and specificity values ranging from .52 to .97 (see Table 2). However, the results do suggest that these nonsignificant variables do not lead to a significant increase in explanatory power beyond what can be attained after considering List B Immediate Recall and List A Short Delay Recall performance. Despite the fact that the MCI and AD groups were older and less educated than the control group, these demographic differences are unlikely to be contributing to the current results. Although age and education differ across groups, the use of demographically-corrected normative data protects against their potential confounding influence. In other words, the use of demographically-corrected norms prevents age and education from being associated with the independent variables. In fact, the NAB Psychometric and Technical Manual (White & Stern, 2003) illustrates that age accounts for 0.0% of the variance and education accounts for 0.0% to 0.2% of the variance in scores on the independent variables that were used in the current study. The classification accuracy of the NAB List Learning test compares favorably to published data on other list learning tests. For instance, the median sensitivity and specificity values of the individual NAB List Learning variables are generally on par with those seen in tests such as the CVLT, AVLT, HVLT, and the CERAD Word List (Table 6). More specifically, for example, a recent study found that long delay free recall on the CVLT differentiated AD from controls with a sensitivity of .98 and a specificity of .88 (Salmon et al., 2002), similar to the reported values of NAB List A Long Delay Recall in the current study (sensitivity = .92, specificity = .97). However, a major strength of the current study is that it validates a single model, developed using multiple ordinal logistic regression, that combines several list learning variables simultaneously to discriminate between three diagnostic groups (i.e., control, aMCI, and AD). One advantage of this ordinal logistic regression model is that it combines the NAB List Learning variables quantitatively, yielding results that can be integrated with applicable base rates to estimate diagnostic likelihood. The use of this model allows for an empirically-validated, quantitative method of combining important variables, as opposed to using clinical judgment for "profile" analysis, which may be susceptible to limitations in human cognitive processing, such as interpreting patterns among multiple neuropsychological test variables (Wedding & Faust, 1989). Table 6. Comparison of sensitivity and specificity between individual variables from the NAB and the AVLT, CERAD, CVLT, and HVLT Note. MCI = Mild cognitive impairment; CERAD = Consortium to Establish a Registry for Alzheimer's Disease; IR = Immediate Recall; DR = Delayed Recall; %Ret = Percent Retention; HVLT = Hopkins Verbal Learning Test; NAB = Neuropsychological Assessment Battery; SDR = Short Delay Recall; LDR = Long Delay Recall; AD = Alzheimer's disease; CVLT = California Verbal Learning Test; AVLT = Auditory Verbal Learning Test. a Karrasch et al. (2005). b Ivanoiu et al. (2005). c Woodard et al. (2005). d Schrijnemaekers et al. (2006). e Current study (see Table 2). f Bertolucci et al. (2001). g Derrer et al. (2001). h Salmon et al. (2002). i Kuslansky et al. (2004). j Schoenberg et al. (2006). k de Jager et al. (2003). Although the current findings support the diagnostic utility of the NAB List Learning test, the generalizability of the current results is limited. For instance, the sample is highly educated; data were collected in a research setting where many individuals volunteered due to self-awareness of memory difficulties; and the specifics of the reference standard, such as the clinicians participating in the consensus team and the assessment protocol used, are unique to our setting. Although the sample contains a fair number of African American participants (16%), representation of other minority groups is lacking. An additional limitation is the fact that the NAB List Learning test was not directly compared with other list learning tests in the same sample, precluding more definitive statements about its diagnostic accuracy in relationship to alternate tests. Finally, the results are limited by the reference standard that was used to establish a diagnosis. Despite the documented advantages of actuarial approaches over subjective approaches to clinical decision making (Dawes et al., 1989; Grove et al., 2000), it is important to emphasize that the reference standard used in the current study is a multidisciplinary consensus diagnosis based on contemporary clinical diagnostic criteria, not neuropathological diagnosis. At the present time, diagnosis of definite AD requires neuropathological confirmation (McKhann et al., 1984). Consequently, the classification accuracy statistics reported herein cannot be interpreted to reflect the likelihood that a patient actually has AD; instead, they indicate the likelihood that this specific consensus diagnostic team would make a particular diagnosis when using the assessment methods described above. It should also be noted that the consensus diagnosis was made, in part, on the basis of other neuropsychological tests, some of which are methodologically and psychometrically similar to the NAB List Learning test. This may have introduced an inherent and unavoidable source of bias. However, the diagnoses was based on consensus after consideration of a wide range of information, thus reducing the likelihood that shared method variance between the NAB List Learning test and other episodic memory measures would have caused significant tautological concerns. From a methodological standpoint, there are other limitations that require future study. The data were analyzed retrospectively and at various points in the longitudinal assessment of participants. An important line of future research would be to longitudinally follow individuals diagnosed with aMCI to prospectively examine whether NAB List Learning test performance is associated with AD progression. Because the current study does not include other dementia subtypes, future studies should also examine non-AD dementias. Finally, to limit the number of predictor variables in the ordinal logistic regression model, the NAB List Learning variables that are considered "secondary" or "descriptive" (White & Stern, 2003) were excluded. However, these additional variables may add additional diagnostic utility to the List Learning test, and future study is warranted. Despite its limitations, the current study has several strengths. For instance, diagnostic accuracy statistics are provided for a large number of cutoff scores, providing users of the test considerable flexibility in interpreting test results. For example, depending on the desired purpose of the examination, users may wish to choose cutoff scores that place a higher value on sensitivity (e.g., clinical settings, where false positive errors may preferable to false negative errors) or specificity (e.g., research settings, where false negative errors may be preferable to false positive errors). Users of the test may choose to interpret results using traditional cutoff scores (e.g., Z-scores < or = 1.5 or 2.0), or to use the empirically-derived cutoff scores presented herein to emphasize sensitivity and specificity equally. In addition, test users may choose to examine each test variable individually, or to interpret the overall pattern of test scores using the multiple ordinal logistic regression model, which accounts for performance on the four primary NAB List Learning variables simultaneously. For the latter approach, positive and negative predictive values are provided for a range of base rates, allowing for a more individually tailored approach to test interpretation. An additional strength of the study was the lack of tautological error, as the NAB List Learning test was not used in diagnostic formulations. Instead, NAB List Learning performance was examined independently against the clinical "gold standard," a multidisciplinary consensus diagnostic conference. The cross-validation of the ordinal logistic regression model allows for examination of the degree of precision in estimates of sensitivity, specificity, and overall accuracy. Based on the reported confidence intervals, there is a good degree of precision in the ordinal model's overall accuracy (accuracy = 80%; 95% CI = 72-88%) and in the model's specificity to the diagnosis of both aMCI (specificity = .91; 95% CI = .83-.99) and AD (specificity = .97; 95% CI = .94-.99). However, in examining the 95% confidence intervals surrounding the sensitivity estimates for both aMCI and AD, it is apparent that the sensitivity of the ordinal model is considerably lower and lacking precision. This may be due in part to the relatively small sizes of the clinical sample and in part due to the negative log-log link function that was used in the multiple ordinal logistic regression model. This link function makes an a priori assumption that the underlying distribution of the data is skewed toward "normality." In other words, the model was chosen based on the assumption that the prevalence of healthy controls is greater than the prevalence of individuals with aMCI and AD. As a result, the ordinal logistic regression model may be more prone to false negative errors (i.e., reduced sensitivity) than to false positive errors (i.e., reduced specificity). This decreased sensitivity to aMCI and AD may also reflect the fact that individuals with aMCI and AD perform similarly on measures of episodic memory, and that functional measures may be necessary to improve diagnostic sensitivity once a certain degree of cognitive decline has occurred in an individual. Although the current results present diagnostic accuracy statistics for the NAB List Learning test, it should be emphasized that a diagnosis of aMCI or AD cannot be made on the basis of a single neuropsychological instrument. The current results demonstrate that the NAB List Learning test was able to classify older adults into cognitively normal, AD, and aMCI groups with accuracy levels similar to other published list learning tests (Bertolucci et al., 2001; de Jager et al., 2003; Derrer et al., 2001; Ivanoiu et al., 2005; Karrasch et al., 2005; Kuslansky et al., 2004; Salmon et al., 2002; Schoenberg et al., 2006; Schrijnemaekers et al., 2006; Woodard et al., 2005). The NAB List Learning test possesses a large and up-to-date set of demographically-corrected normative data ( n = 1441) and it was co-normed as part of a comprehensive neuropsychological test battery. In addition, it was developed to include two equivalent forms; in fact, in the NAB standardization sample (n = 1448), test form accounted for less than 1.5% of the total variance seen in List Learning performance (White & Stern, 2003), making it suitable for clinical re-evaluation and longitudinal research applications. The findings from the current study, along with the overall strengths of the NAB, suggest that the NAB List Learning test is an appropriate and clinically useful tool for the evaluation of older adults with known or suspected Alzheimer's disease. Although the current study did not directly compare the diagnostic utility of the NAB List Learning test to other list learning measures, the classification accuracy data presented herein are similar to those reported in the literature investigating the diagnostic utility of other list learning tests in control, MCI, and AD samples (see Table 6). Future research is warranted to make direct comparisons of diagnostic utility to other list learning instruments. ACKNOWLEDGMENTS The project described was supported by Grant Number M01 RR000533 and the CTSA Grant Number 1UL1RR025771 from the National Center for Research Resources (NCRR), a component of the National Institute of Health (NIH). Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH. This research was also supported by P30-AG13846 (Boston University Alzheimer's Disease Core Center), R03-AG026610 (ALJ), R03-AG027480 (ALJ), K12-HD043444 (ALJ), K23-AG030962 (ALJ), RO1-HG02213 (RCG), RO1-AG09029 (RCG), R01-MH080295 (RAS), and K24-AG027841 (RCG). Robert A. Stern is one of the developers of the NAB and receives royalties from its publisher, Psychological Assessment Resources Inc. Portions of this manuscript were presented at the International Conference on Alzheimer's Disease, Chicago, July 2008. With this exception, the information in this manuscript and the manuscript itself has never been published either electronically or in print. A scoring program based on the multiple ordinal regression model reported in this manuscript is available as an electronic addendum to this manuscript. Limitations of Diagnostic Precision and Predictive Utility in the Individual Case: A Challenge for Forensic Practice David J. Cooke Æ Christine Michie Received: 24 August 2007 / Accepted: 2 February 2009 / Published online: 11 March 2009 _ American Psychology-Law Society/Division 41 of the American Psychological Association 2009 Abstract Knowledge of group tendencies may not assist accurate predictions in the individual case. This has importance for forensic decision making and for the assessment tools routinely applied in forensic evaluations. In this article, we applied Monte Carlo methods to examine diagnostic agreement with different levels of inter-rater agreement given the distributional characteristics of PCL-R scores. Diagnostic agreement and score agreement were substantially less than expected. In addition, we examined the confidence intervals associated with individual predictions of violent recidivism. On the basis of empirical findings, statistical theory, and logic, we conclude that predictions of future offending cannot be achieved in the individual case with any degree of confidence. We discuss the problems identified in relation to the PCL-R in terms of the broader relevance to all instruments used in forensic decision making. There is an important disjunction between the perspective of science and the perspective of the law; while science seeks universal principles that apply across cases, the law seeks to apply universal principles to the individual case. Bridging these perspectives is a major challenge for psychology (Faigman, 2007 ). It is recognized by statisticians that knowledge of group tendencies—even when precise— may not assist accurate evaluation of the individual case (e.g., Colditz, 2001 ; Henderson & Keiding, 2005 ; Rockhill, 2001 ; Tam & Lopman, 2003 ). It is a statistical truism that the mean of a distribution tells us about everyone, yet no one. This has serious implications for the use of psychological tests in forensic decision making. To illustrate these limitations, we focus on one of the most widely used, and perhaps the most extensively validated, test in the forensic arena—the Psychopathy Checklist Revised (PCL-R1 ; Hare, 2003 ). We emphasize, however, that all psychological tests used in the same way in the forensic arena will suffer from similar limitations (e.g., VRAG, Quinsey, Harris, Rice, & Cormier, 1998 ; Static-99, Hanson & Thornton, 1999 ; COVR, Monahan et al., 2005 ). Mental health professionals are frequently asked to opine whether an individual might be violent in the future; psychopathic personality disorder is an important risk factor to consider (Hart, 1998 ). The PCL-R is the most frequently used measure of psychopathic personality disorder; it has been described as the ‘‘gold standard’’ for that purpose (Edens, Skeem, Cruise, & Cauffman, 2001; as cited in Hare, 2003 ). There can be little doubt that the PCLR has made a major contribution to our understanding of violence (Hart, 1998 ); nonetheless, it is important for the field to consider both its strengths and its limitations. Findings for this instrument will have implications for less well-validated tools. In this introduction, we consider two issues; first, the use of PCL-R scores in forensic practice and second, the general problem of the precision of predictions about an individual case. D. J. Cooke (&) _ C. Michie Department of Psychology, Glasgow Caledonian University, Glasgow G4 0BA, UK e-mail: djcooke@rgardens.vianw.co.uk 1 The PCL-R is a 20-item rating scale of traits and behaviors intended for use in a range of forensic settings. Definitions of each item are provided and evaluators rate the lifetime presence of each item on a 3-point scale (0 = absent, 1 = possibly or partially present, and 2 = definitely present) on the basis of an interview with the participant and a review of case history information. 123 Law Hum Behav (2010) 34:259–274 DOI 10.1007/s10979-009-9176-x PCL-R SCORES AND FORENSIC PRACTICE Much of the interest in the construct psychopathy comes from the relationship between the PCL-R and future criminal behavior (Lyon & Ogloff, 2000 ). Previous research suggests that psychopathy—as assessed using the Psychopathy Checklist-Revised (PCL-R; Hare, 1991 )—is an important risk marker for criminal and violent behavior (Douglas, Vincent, & Edens, 2006 ; Hart, 1998 ; Hart & Hare, 1997 ; Hemphill, Hare, & Wong, 1998 ; Leistico, Salekin, DeCoster, & Rogers, 2008 ; Salekin, Rogers, & Sewell, 1996 ). In fact, the PCL-R has been lauded as an ‘‘unparalleled’’ single predictor of violence (Salekin et al., 1996 ). Hart (1998 ) argued that failure to consider psychopathy in a violence risk assessment may constitute professional negligence. This empirical base has resulted in the PCL-R being used, not merely to measure the trait strength of psychopathy in an individual, but also to make predictions about what he or she will do in the future (Hare, 1993 ). As we demonstrate formally below, this additional step of prediction means that the potential for imprecision in forensic evidence is greatly increased: It expands the gulf between inferences about groups and inferences about individuals. The PCL-R has been incorporated into statutory or legal decision making (Hare, 2003 ). Within England and Wales, a PCL-R score above a cut-off of 25 or 30 can lead to detention in either a Special Hospital or a prison (Maden & Tyrer, 2003 ); in certain Canadian provinces parole boards explicitly consider PCL-R scores (Hare, 2003 ), and in Texas psychopathy assessments are mandated by statute for sexual predator evaluation (Edens & Petrila, 2006 ).2 The PCL-R plays a role in criminal sentencing, including decisions regarding indefinite commitment and capital punishment, institutional placement and treatment, conditional release, juvenile transfer, child custody, witness credibility, civil torts, and indeterminate civil commitment (DeMatteo & Edens, 2006 ; Fitch & Ortega, 2000 ; Hart, 2001 ; Hemphill & Hart, 2002 ; Lyon & Ogloff, 2000 ; Walsh & Walsh, 2006 ; Zinger & Forth, 1998 ). The PCL-R is regarded by many as the best method for operationalizing the construct of psychopathy. For example, Lyon and Ogloff (2000 ) argued that ‘‘… it is critical that the assessment is made using the PCL-R’’ (p. 166) when evidence about violence risk, based on psychopathy, is provided. Because of its central role in forensic decision making it is vital to assess its strengths and limitations and, by comparison, the limitations of less well-validated procedures. PREDICTIONS FOR INDIVIDUALS VERSUS PREDICTIONS FOR GROUPS Prediction is the raison d’eˆtre of many forensic instruments (e.g., VRAG, Quinsey et al., 1998 ; Static-99, Hanson & Thornton, 1999 ; COVR, Monahan et al., 2005 ). While this is not true of the PCL-R its frequent use in forensic practice is underpinned by the assumption—implicit or explicit—that it can predict future offending (Walsh & Walsh, 2006 ). How precise can such predictions be? The precision of any estimate of a parameter (e.g., mean rate of recidivism of a group) can be measured by the width of a confidence interval (CI); a CI gives an estimated range of values, which is likely to include an unknown population parameter. If independent samples are taken repeatedly from the same population, and a CI calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Typically, 95% of these intervals should include the unknown population parameter; other intervals may be used (e.g., 68% and 99%). The width of this interval provides a measure of the precision—or certainty— that we can have in the estimate of the population parameter. The width of a CI of a population parameter is linked, in part, to the sample size used to estimate the population parameter (see below for a more technical explanation). The prevailing prediction paradigm has two stages. First, the parameters (mean, slope, and variance) of a regression model linking an independent variable (e.g., PCL-R score) to a dependent variable (e.g., likelihood of reconviction) are estimated. Each of these parameters has uncertainty associated with them, which can be expressed by confidence bands about the regression line. Second, a new case is selected and the PCL-R score is assessed, the model is applied and the likelihood of reconviction is estimated. The best estimate of the likelihood of reconviction for a new case will be identical to the point on the regression line for that PCL-R score. This new estimate has a CI—also known as a prediction interval—that expresses the precision, or certainty, that should be associated with the prediction made about the new case. Often the two steps are conflated, with the unrecognized assumption being made that the prediction interval for the new case is comparable to the CIs for the model. It is not (see below). The problem of making predictions for individuals from statistical models is now recognized in other disciplines. In relation to medical risks, Rose (1992 ) expressed the position clearly: ‘‘Unfortunately the ability to estimate the average risk for a group, which may be good, is not matched by any corresponding ability to predict which individuals are going to fall ill soon’’ (p. 48). In relation to reoffending, Copas and Marshall (1998 ) made a related point ‘‘… the score is not a prediction about an individual [italics added], but an estimate of what rate of conviction 2 The PCL-R is the most commonly used instrument for assessing psychopathy in this setting (Mary Alice Conroy, Personal communication, 10 April 2007). 260 Law Hum Behav (2010) 34:259–274 123 might be expected of a group [italics added] of offenders who match that individual on the set of covariates used by the score’’ (p. 170) (see also Altman & Royston, 2000 ; Bradfield, Huntzickler, & Fruehan, 1970 ; Colditz, 2001 ; Elmore & Fletcher, 2006 ; Henderson, Jones, & Stare, 2001 ; Henderson & Keiding, 2005 ; Rockhill, 2001 ; Rockhill, Kawachi, & Colditz, 2000 ; Tam & Lopman, 2003 ; Wald, Hackshaw, & Frost, 1999 ). It is not generally recognized that a risk factor must have a very strong relative risk (i.e.,[ 50) if it is to have utility as a screening instrument at the individual level (Rockhill et al., 2000 ; see also Kennaway, 1998 ). However, others set the bar higher: A risk factor has to be extremely strongly associated with a disease within a population before it can be considered to be a potentially useful screening test. Even a relative odds of 200 between the highest and lowest fifths will yield a detection rate of no more than about 56% for a 5% false positive rate… (Wald et al., 1999 , p. 1564). To put this in perspective, the relative risk for the association between lung cancer and smoking is between 10 and 15 (Rockhill et al., 2000 ), depending on the definition of exposure. The relative risk for the PCL-R and recidivism is something of the order of 3 for general recidivism and 4 for violent recidivism (Hare, 2003 ). Does the application of current forensic tools provide an adequate basis for testimony concerning the individual case? In this article, we attempt to answer this question by considering three issues pertaining to PCL-R data. How confident can clinicians and legal decision makers be, first, in the use of critical diagnostic cut-offs; second, in the numerical value of PCL-R scores; and third, in individual predictions of violent recidivism? We describe two studies. The first study addresses the accuracy of diagnostic decisions and the potential range of discrepancies between two raters. The second study addresses the accuracy of prediction of future violence in the individual case. The results have relevance beyond the PCL-R to the use of other psychometric instruments in forensic practice: The same limitations may apply to many forensic assessment instruments. STUDY ONE In the first study, we examined diagnostic accuracy, specifically the allocation of individuals around two critical cut-offs, i.e., around 30 and around 25; the first is the standard PCL-R cut-off for the diagnosis of psychopathy and the second, often adopted in the UK, has proven useful in that context including in decisions regarding treatment allocation (Hare, 2003 ). The inter-rater reliability figures presented in the PCL-R manual can be regarded as good (Nunnally & Bernstein, 1994 ); intraclass correlation coefficient for single ratings (ICC1 ) are estimated in some research studies as being above .80 (Male offenders = .86; Male forensic psychiatric patients = .88; Hare, 2003 , Table 5.4).3 Edens and Petrila (2006 ) indicated that these are probably ‘‘best case’’ estimates and ‘‘real world’’ reliabilities may be substantially poorer.4 Murrie, Boccaccini, Johnson, and Janke (2008 ), in one ‘‘real world’’ study, demonstrated poor agreement (ICC1 = .39). These views and findings echo concerns expressed by Hare (1998 ), that while researchers take great pains to ensure reliability in their studies, the level of reliability achieved by individual clinicians remains unknown—and by implication—is likely to be poorer than published studies. Inter-rater reliability is not the only relevant consideration: Diagnostic precision is also influenced by the underlying distribution of test scores. Diagnostic precision is influenced by the location of the cut-off and the shape of the distribution of scores—both skewness and kurtosis. Estimates of the precision of a test score (e.g., standard errors of measurement, SEM) are weighted toward the mean of the distribution whereas cutoffs are generally located substantially above the mean. Item Response Theory (IRT) studies demonstrate that the measurement precision of the PCL-R—in terms of measurement information—falls toward the diagnostic cut-off (Cooke, Michie, & Hart, 2006 ); thus, the SEM estimated on the mean will provide an optimistic estimate of diagnostic precision. The SEM cannot be directly translated into estimates of precision of diagnosis because of the impact of the score distributions. Equally, it is not possible to estimate misclassification rates directly using ICC1 values; therefore, simulation approaches are required. Study one describes a simulation that examines the impact of unreliability on diagnostic accuracy. Method Monte Carlo studies allow the investigation of the properties of distributions and estimates of parameters where results cannot be derived theoretically (Mooney, 1997 ; Robert, 2004 ). Large numbers of simulated datasets can be 3 The estimates of reliability are frequently obtained by re-rating the same interview or with an observer simultaneously rating within an interview. This will tend to inflate reliability, but not validity, as the same information source is being used. 4 The case of THE PEOPLE, Plaintiff and Respondent, v. KURT ADRIAN PARKER, a Sexual Violent Predator ACT case highlights the variability that can emerge in some cases; five accredited experts furnished five PCL-R scores that ranged from 10 to 25. (Edens, John, Personal Communication, 22 May 2006). Law Hum Behav (2010) 34:259–274 261 123 created based on an explicit and replicable data-generation process. The effect of known features designed into the data, such as levels of inter-rater reliability, on outcomes, such as diagnostic precision, can be assessed. Multiple trials of procedures are carried out to allow precise estimation of outcomes. Mooney (1997 ) argued that Monte Carlo simulations could allow social scientists to test classical parametric inference methods and provide more accurate statistical models. In our view, this mainstream statistical technique is underused in forensic research. Materials We used Monte Carlo techniques based on distribution information from two datasets of PCL-R total scores: (1) data for North American Male Offenders (Table 9.1, Hare, 2003 ) and (2) data from UK prisoners (Cooke, Michie, Hart, & Clark, 2005 ).5 The first distribution, being the largest, probably provides the best estimate of the true distribution of scores underlying the PCL-R and is described as ‘‘approximately normal’’ (Hare, 2003 , p. 55). Given the potential impact of a departure from normality we tested whether this distribution was in fact normal. The departure from normality was highly significant (Kolmogorov–Smirnov = .068, df = 5408, p/ .0001, Skewness = - .33, Kurtosis = - .570). Examination of Fig. 1 demonstrates that around the standard cut-off of 30 cases are over-represented while in the right tail of the distribution they are under-represented. In the simulation study,we generated two randomvariables per case using MATHCAD.13 (2005 ). These random variableswere scaled according to one of the two datasets referred to above with mean l and standard deviation r .6 This gives two uncorrelated ratings (x1 and x2 ) from the distribution of scores: x1 is our first rating on the subject, PCL1 . We then calculated a linear combination of the two ratings to provide a second rating on the same subject, which has a correlation of q with the first rating. The linear combination is PCL2 . roundf l . .x1 _ l.q . .x2 _ l. ffiffiffiffiffiffiffiffiffiffiffiffiffi 1 _ q2 p g ; using rounding to ensure an integer score. This process gives two random, correlated scores from the distribution. There is a very small probability of obtaining second ratings less than 0 or greater than 40: These scores have been taken as 0 or 40, respectively. Assuming that the ICC1 represents the best estimate of the correlation between the two scores, we estimated the distributions for four values of reliability, i.e., ICC1 values of .75, .80, .85, and .90. The .80 value is a lower bound estimate for reasonable practice. Hare (1998 ) indicated that at least this level should be achievable with ‘‘… properly conducted assessments’’ (p. 107). The .85 level may be achievable by one rater with good training; the .90 level, perhaps the best case scenario, is the level achievable where two independent sets of ratings are averaged. Values above .90 are rarely if ever achievable—Hare (1998 ) describes .95 and higher as ‘‘unbelievably high’’ (p. 107). The .75 provides a lower-bound estimate of what may be obtained in clinical practice. These values probably represent optimistic estimates for actual clinical practice; we did not assess the ‘‘worst case’’ scenarios implied by Edens and Petrila (2006 ) and by Murrie et al. (2008 ). The estimation procedure was repeated 10,000,000 times for each of the four levels of ICC1 to provide stable estimates of the distribution of the correlated ratings and to ensure at least 10,000 cases within each of the extreme score bands. We examine discrepancies in two ways: First, in terms of diagnostic disagreement and second, in terms of disagreements about total scores. What is the level of diagnostic agreement? Kappa (j ) coefficients measure the proportion of diagnostic agreements corrected for observed base rates (Fleiss, 1981 ). Conventionally, j/ .75 represents excellent agreement, .40/j/ .75 represents fair to good agreement and j/ .40 represents poor agreement (Gail & Benichou, 2000 ). Kappa values for three distributions and the four ICC1 values are given in Table 1 . We calculated Kappa coefficients for agreement in diagnosis between the two ratings using both common cut-offs, i.e., 30 and 25. The vast majority of Kappa values are only in the fair to good range; few values approach the poor range. Kappa is an omnibus statistic, which is useful for summarizing group results; however, it tells us little about agreement in the individual case. The potential for misclassification is clearer when distributions of disagreements 0 10 20 30 40 PCL-R 0 100 200 300 Frequency Fig. 1 Distribution of North American male prisoners and normal curve 5 We carried out a similar analysis of data for Male Forensic Psychiatric Patients (Table 9.2, Hare, 2003); the results, which demonstrate the same pattern, can be obtained from the first author. 6 A full description of the simulation study including the Mathcad code can be obtained from the first author. 262 Law Hum Behav (2010) 34:259–274 123 are considered. The distributions based on the North American Male Offenders are in Table 2 . For ease of interpretation, we tabulated the distributions in 5-point ranges. Examination of the sub-table for ICC1 = .80 indicates that if one rater gives a score between 30 and 34, i.e., just above the diagnostic cut-off then only in 46% of occasions—approximately half the time—will the other rater obtain a score within the same range. In 44% of the occasions, the second rater would place the individual below the critical cut-off. Even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 30 and 34 then only in 60% of occasions will the other rater obtain a score within the same range. On 29% of occasions, the second rater would place the participant below the critical cut-off. The distributions based on the UK prisoners are in Table 3 . Examination of the table for ICC1 = .80 indicates that if one rater gives a score between 30 and 34, i.e., just above the diagnostic cut-off then only in 39% of occasions will the second rater obtain a score within the same range. In 54% of the cases, the second rater would place the individual below the critical cut-off.As previously, even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 30 and 34 then only in 53% of cases will the other rater obtain a score within the same range. In 39% of cases, the second rater would place the participant below the critical cut-off. In the UK, the cut-off of 25, as well as 30, is often applied (DSPD Programme, 2005 ; Hare, 2003 ). Examination of the table for ICC1 = .80 indicates that if one rater gives a score between 25 and 29, i.e., just above theUKdiagnostic cut-off, then only in 29% of occasions will the other rater obtain a score within the same range. On 49% of occasions, the second rater would place the individual below the critical cutoff. Even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 25 and 29 then only in 37% of cases will the other rater obtain a score within the same range. On 37% of occasions, the second rater would place the participant below the critical cut-off. Therefore, in broad terms, all of the findings reported above demonstrate that the allocation of an individual above or below diagnostic cut-offs is much less precise than previously thought. Another way of considering the precision of PCL-R scores is to examine expected discrepancies in scores based on variations in ICC1 while taking into account the distributional characteristics of the PCL-R scores. The PCL-R manual suggests that in 68% of cases the discrepancies between two raters should be up to 3 points, and in 95% of cases it should be up to 6 points (Hare, 2003 ). This assumes normality of the PCL-R score distribution, an assumption that is not met (see above). The cumulative distribution of score discrepancies estimated from the Monte Carlo studies are tabulated in Table 4 . With the North American prisoner sample and an ICC1 of .80, a discrepancy of between 8 and 9 points would be expected in 9% of cases, around 10 points in 5% of cases, and between 12 and 13 points in 1% of cases. With the UK prisoner sample, and an ICC1 of .80, a discrepancy of between 8 and 9 points would be expected in 23% of cases, around 10 points in 5% of cases, and around 12 points in 1% of cases. An alternative approach to summarize the range of possible discrepancies is to estimate the distribution of a 2nd PCL-R rating given the 1st PCL-R rating. This conditional distribution can be summarized by a CI that contains 95% of the 2nd ratings. This interval is thus defined by the lower and upper limits LL and UL given by Prob(LL/ 2nd rating/ ULj 1st rating) . 0: 95: Results for both 68% and 95% CIs for ICC1 = .80, and for both samples, are presented in Table 5 . For example, in the North American prisoner sample, if rater one obtains a total score of 30, then the 95% CI for rater two’s total score will be between 19 and 36 (i.e., between the 35th and 99th percentile). All the estimates in this study are conservative; that is, they assume that the SEM that applies at the mean applies Table 1 Kappa coefficients and levels of agreement for four levels of correlation (q) for two distributions q Both/30 Both C 30 Different j Both/25 Both C 25 Different j North American male offenders 0.75 72.9 10.6 16.4 .46 48.0 29.6 22.4 .54 0.80 73.3 11.5 15.1 .51 48.9 31.0 20.0 .59 0.85 74.4 12.2 13.5 .56 50.5 32.3 17.2 .64 0.90 75.5 13.4 11.1 .64 51.7 34.2 14.1 .71 United Kingdom prisoners 0.75 91.8 2.1 6.1 .38 79.8 7.6 12.6 .47 0.80 92.0 2.4 5.6 .43 83.1 6.8 10.2 .52 0.85 92.4 2.7 5.0 .49 81.2 9.1 9.7 .60 0.90 92.7 3.1 4.2 .57 82.0 10.1 7.9 .67 Law Hum Behav (2010) 34:259–274 263 123 around the cut-off. However, this is an unwarranted assumption. The overall variance of errors of measurement is a weighted average of the errors that pertain across the range of true score values. Precision of measurement of the PCL-R drops as scores approach the diagnostic cut-off (e.g., Cooke & Michie, 1997 ; Cooke et al., 2006 ). Thus, the degree of diagnostic misclassification and score discrepancy is likely to be greater in practice than demonstrated in the simulation above. The conditional SEM (CSEM)7 is the square root of the variance of errors at a particular level of true scores. To Table 2 Distribution of diagnostic disagreements by four levels of correlation between raters based on distribution of North American male offenders PCL-R score 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40 q = 0.75 0–4 .209 .086 .033 .006 0 0 0 0 5–9 .505 .305 .200 .079 .008 0 0 0 10–14 .245 .318 .250 .188 .072 .005 0 0 15–19 .041 .235 .266 .270 .239 .078 .002 0 20–24 0 .053 .192 .256 .310 .281 .097 0 25–29 0 .002 .056 .155 .236 .369 .371 .130 30–34 0 0 .003 .044 .121 .230 .445 .551 35–40 0 0 0 .001 .014 .037 .085 .319 q = 0.8 0–4 .245 .096 .032 .002 0 0 0 0 5–9 .523 .372 .211 .065 .003 0 0 0 10–14 .217 .315 .288 .198 .055 .001 0 0 15–19 .014 .198 .280 .306 .240 .055 0 0 20–24 0 .019 .162 .269 .340 .283 .063 0 25–29 0 0 .026 .137 .247 .391 .379 .076 30–34 0 0 0 .023 .106 .237 .465 .585 35–40 0 0 0 0 .010 .031 .093 .339 q = 0.85 0–4 .285 .105 .021 0 0 0 0 0 5–9 .552 .423 .217 .038 0 0 0 0 10–14 .158 .328 .333 .202 .028 0 0 0 15–19 .005 .141 .303 .354 .231 .024 0 0 20–24 0 .003 .121 .287 .386 .265 .029 0 25–29 0 0 .005 .111 .266 .430 .351 .038 30–34 0 0 0 .008 .085 .252 .519 .569 35–40 0 0 0 0 .003 .029 .101 .393 q = 0.9 0–4 .361 .103 .009 0 0 0 0 0 5–9 .578 .486 .215 .012 0 0 0 0 10–14 .061 .348 .390 .190 .010 0 0 0 15–19 0 .063 .319 .422 .216 .005 0 0 20–24 0 0 .067 .304 .449 .238 .004 0 25–29 0 0 0 .071 .267 .500 .289 .006 30–34 0 0 0 0 .056 .239 .609 .494 35–40 0 0 0 0 0 .018 .098 .501 The tables show column percentages, which sum to 1 within rounding error. The rows therefore do not sum to 1 Table 3 Distribution of diagnostic disagreements by four levels of correlation between raters based on distribution of UK prisoners PCL-R score 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40 q = 0.75 0–4 .395 .158 .064 .020 .002 0 0 0 5–9 .477 .352 .228 .107 .035 .002 0 0 10–14 .128 .332 .309 .219 .131 .032 0 0 15–19 0 .153 .271 .336 .265 .198 .041 0 20–24 0 .005 .119 .215 .316 .297 .266 .031 25–29 0 0 .009 .086 .162 .261 .291 .254 30–34 0 0 0 .017 .083 .191 .328 .529 35–40 0 0 0 0 .006 .020 .075 .186 q = 0.8 0–4 .438 .163 .059 .011 0 0 0 0 5–9 .485 .387 .232 .097 .014 0 0 0 10–14 .077 .354 .328 .229 .111 .016 0 0 15–19 0 .095 .296 .354 .289 .154 .023 0 20–24 0 0 .083 .237 .331 .321 .214 .007 25–29 0 0 .002 .067 .176 .287 .303 .224 30–34 0 0 0 .005 .077 .200 .387 .545 35–40 0 0 0 0 .002 .022 .073 .224 q = 0.85 0–4 .470 .168 .042 .003 0 0 0 0 5–9 .474 .437 .224 .069 .003 0 0 0 10–14 .056 .342 .375 .226 .075 .003 0 0 15–19 0 .053 .308 .404 .287 .106 .001 0 20–24 0 0 .050 .251 .383 .325 .140 0 25–29 0 0 0 .047 .194 .321 .328 .145 30–34 0 0 0 0 .058 .229 .447 .578 35–40 0 0 0 0 0 .017 .084 .277 q = 0.9 0–4 .530 .169 .020 0 0 0 0 0 5–9 .450 .501 .219 .033 0 0 0 0 10–14 .019 .312 .457 .207 .037 0 0 0 15–19 0 .018 .286 .489 .275 .050 0 0 20–24 0 0 .017 .252 .450 .318 .062 0 25–29 0 0 0 .018 .214 .367 .325 .046 30–34 0 0 0 0 .024 .257 .528 .587 35–40 0 0 0 0 0 .008 .085 .367 7 Professional standards indicate that the CSEM is an important piece of information that should be provided in a test manual. For example, Standard 2.14 ‘‘Conditional standard error of measurements should be reported at several score levels if constancy cannot be assumed. Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score.’’ (American Educational Research Association/ American Psychological Association, 1999; p. 35 emphasis added). 264 Law Hum Behav (2010) 34:259–274 123 evaluate the true level of agreement of diagnosis likely to apply around a cut-off it is necessary to take the CSEM into account. Item Response Theory indicates that the error of measurement varies with location on the trait (h ). IRT gives SE.h. . 1 ffiffiffiffiffiffiffiffi I.h. p where I (h ) is the information at h . CTT gives SEM . SD ffiffiffiffiffiffiffiffiffiffiffi 1 _ q p Let q1 be the correlation at location 1 (h1 ), q2 be the correlation at location 2 (h2 ). Then q2 . 1 _ . 1 _ q1. I.h1. I.h2. Location 1 is h = 0.0 (PCL-R = 20) and Location 2 is h = 1.0 (PCL-R = 30) (Approximate locations from Hare, 2003 ; Fig. 6.6; see also Cooke & Michie, 1997 ). Overall, the impact of the location of the estimated ICC1 is limited, dropping—at a maximum—from .75 to .69. However, as noted above, even small drops in ICC1 (e.g., from .85 to .80) can substantially affect the misclassification rate and the range of likely score discrepancies (see Table 6 ). It is noteworthy that the magnitude of the drop appears to be proportionately larger the poorer the mean estimated level of inter-rater reliability. This suggests that the effect of the CSEM is larger in cases that start with a relatively poor level of inter-rater reliability. Equally, this would suggest that proportionately greater discrepancies would, in general, be obtained when factor or facet scores are considered because they have lower levels of reliability than the total scores (Hare, 2003 ). STUDY TWO The use of the PCL-R in court is frequently justified based on its predictive utility, the support being garnered from between-subject designs (Edens & Petrila, 2006 ; Hare, 2003 ; Walsh & Walsh, 2006 ). In this study, we are concerned with the individual. We examine the confidence that can be placed in a prediction that an individual with a particular PCL-R score will be reconvicted for a violent offence. All measurements and estimates entail error. As noted above, the degree of error is expressed by CIs. For Table 4 Cumulative distribution of expected discrepancies between two raters for different levels of correlation based on two sample distributions Point discrepancy SEMa North American male offenders United Kingdom prisoners Correlation Correlation 0.75 0.80 0.85 0.90 0.75 0.80 0.85 0.90 0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1 .741 .934 .927 .919 .901 .936 .928 .918 .900 2 .503 .804 .785 .758 .704 .804 .785 .758 .703 3 .317 .679 .647 .595 .518 .681 .647 .595 .518 4 .303 .560 .516 .453 .352 .561 .518 .454 .357 5 .095 .449 .397 .327 .217 .452 .400 .332 .222 6 .046 .349 .292 .215 .121 .353 .298 .221 .125 7 .020 .262 .205 .136 .058 .269 .212 .141 .063 8 .007 .190 .137 .079 .023 .196 .142 .083 .026 9 .002 .132 .086 .041 .007 .137 .090 .044 .009 10 .001 .088 .050 .018 .002 .091 .054 .021 .003 11 .055 .026 .007 .058 .030 .009 .001 12 .032 .012 .002 .035 .015 .004 13 .017 .005 .019 .007 .001 14 .008 .002 .010 .003 15 .003 .005 .001 16 .001 .002 17 .001 a This column shows the cumulative distribution of discrepancies which was calculated assuming that discrepancies between two raters are normally distributed and that the SEM is 3 (Hare, 2003, pp. 66–67) Law Hum Behav (2010) 34:259–274 265 123 example, while the mean rate of reoffending for a ‘‘High Risk’’ group may be estimated as being 55%; the 95% CI indicates that the true value of the mean rate or reoffending for this group will lie between 44% and 66%, 95% of the time, i.e., 19 times out of 20 (Hart, Michie, & Cooke, 2007 ). However, the clinician and the decision maker are interested in the individual case not the group. Therefore, how much confidence can the clinician and decision maker have in predictions of reoffending in the individual case based on PCL-R scores? We examine CIs for group and individual predictions. Participants Two hundred fifty-five male prisoners between 18 and 40 years of age (M = 26.8, SD = 5.9) were interviewed in Scotland’s largest prison for a study of psychological characteristics and violence (Cooke, Michie, & Ryan, 2001 ; Michie & Cooke, 2006 ). Prisoners were selected by systematic random sampling of the prison. The average sentence length was 39 months (SD = 23 months; range = 3 months to 10 years and life). PCL-R Ratings PCL-R ratings were made according to instructions in the test manual (Hare, 1991 ). All PCL-R evaluations were conducted by trained raters using both interview and file review (ICC1 = .86). Assessment of Recidivism Reconviction data were obtained from two sources: The Scottish Criminal Records Office (SCRO) and the Police National Computer (PNC). The average follow-up period was 29 months. The point-biserial correlation between PCL-R scores and recidivism (r = .31) was above average for the field (Walters, 2003 ). For the purposes of illustration, we consider reconviction for violence that resulted in a prison sentence (i.e., generally a more serious violent offence). Follow-up data were available for 190 cases and PCL-R data for 184 of these. Table 5 The 68% and 95% confidence intervals for 2nd PCL-R total score given 1st PCL-R score and ICC = 0.8 1st PCLR Prisoners UK LL .95 LL .68 UL .68 LL .95 LL .95 LL .68 UL .68 LL .95 0 0 0 9 12 0 0 8 13 1 0 0 10 13