NUR 811 Population-Based Nursing Assignment

NUR 811 Population-Based Nursing Assignment

Permalink:

Hello,
Ok on the Cupp curley assignment for A-C answer the questions related to the article provided on Prediabetes A- African American Females aged 30-60 with Prediabetes (population of interest) . B- Read the article and provide 2 possible interventions from the article provided. Explain what those two interventions are. C- Discuss how you would evaluate the interventions and what outcome measure could be reported from this interventions. D- Formulate 2 PICOT questions 1. Among African American adult females (P), what is the effects of on Hemoglobin A1c levels when Metformin is prescribed (I) compared to teaching on diet and exercise (C) on levels in Hemoglobin A1C (O) within three months (T)?
2. Among African American adult females (P), what are the perceived barriers to the use of Metformin (I) compared to implementing diet and exercise(C) to lower Hemoglobin A1C levels (O) within three months (T)? NUR 811 Population-Based Nursing Assignment

Page 110-114 explains the process of PICOT development. Both question should be about the same population but either different aspects of care or need of this population or another group that takes care of the population. See examples under letter D. E- Discuss what type of data would be need to collected. Only need once source. See chapter 6 for sources. F- With only one of the Picot questions discussed how you would control the selection bias example random selection sampling. G- explain to a non nursing report the terms of validity and reliability as it is related to research. H- include sources from article and other sources that were requested.

The assignment for Gordis
• Gordis Problems: Ch 5 Problems 1-8.
1. Provide the correct answer for these problems 1-8
2. Show your math on Problems #: 1-3, 6-7
3. Provide an explanation or rationale for each of your answers both correct and incorrect answers.
4. Provide the page number of concepts reference in text, I will provided this information.
• Gordis Problems: Ch 6 Complete Problem 1
1. Provide the correct answer for Problem #1
2. Show your math on Problem #: 1
3. Provide an explanation or rationale for each of your answers
4. Provide the page number of concepts reference in text , I will provide this information
• Gordis Ch 6 – Natural History of Disease or Survival Data
1. Locate and appropriate article or report within the last 6 years related to natural history of disease, survival data, or or data noted in Gordis Ch 6. Article provided on Colon Cancer
2. Report: Focus, Population, Outcomes, what you have learned related to the texts concepts from the article provided.
3. Consider bookmarking this course for later use.
4. Post with your Gordis Chapter problems to Canvas Assignments. See article on Colon cancer attached. I Will post the assignment to Canvas. NUR 811 Population-Based Nursing Assignment

Module 4—Wk 6-7 2 Week Module–Week of Feb. 22-March 6, 2021 NUR 811

Module 4: Epidemiology Methods and Measurements: Part II (Cupp Curley Ch 4);

Applying Evidence at the Population Level (Cupp Curley Ch 5)

Using Informatics Technology to Improve Population Outcomes (Cupp Curley Ch 6)

Epi-Assessing the Validity and Reliability of Diagnostic and Screening Tests (Gordis Ch 5)

Epidemiology-Natural History of Disease: Ways of Expressing Prognosis (Gordis Ch 6)

2 Week Module – Feb. 28 – March 6

Module Objectives:

1. The student will identify and locate appropriate population-focused sources of data and resources available to HCP. (CO 1,2,3,6)

2. The student will utilize epidemiological methods to determine and apply data of populations including cultural and psychosocial aspects relevant to natural history of disease and survival over time. (CO 1,2)

3. The student will utilize, apply, and evaluate a variety of epidemiological methods with health conditions in problem-solving current health status of the population. (CO 1,2,3)

4. The student will further explore more epidemiological methods for use in evaluating the health of populations and translating this evidence into change in clinical practice. (CO 1)

5. The student will compute and interpret epidemiologic problems and apply to population-based health strategies and/or outcomes. CO: Course Objectives

Required Readings: NUR 811 Population-Based Nursing Assignment

· Cupp Curley and Vitale, Ch 4, 5, 6

· Gordis Ch 5 & 6

· Review Gordis Ch 1-3 for Epi Quiz 1. Wednesday, Feb 24 and open from 11:00 AM CST- closes at 11:45 pm CST

Materials-Review:

· Powerpoint Lecture

· Search EBP articles for application of population and topic

· Link to Websites and Full Text Databases of Interest on USM Library Site

· National Clearinghouse/Clinical Practice Guidelines and other Resources Cupp Curley pgs.151-153

· Online Discussion Forum posting and response to another student

From Cupp Curley:

· Review and Consider: Read and Review Concepts of Ch 4-5-6

Epidemiology Methods and Measurements in Population-based Nursing-Part II

Discussion 2

To Discussions-USE Headings with your post.

(A) Post your population of interest while in this course (this may have changed-list).

(B) Explore the literature and evidence. List at least two (2) possible interventions from the literature (article) for your population of interest. This search can be USM Library, GoogleScholar, CDC, data base, state or federal sponsored agencies (HRSA, SAMSHA, other), and/or the informational resources in Ch 6. Cite source(s).

(C) Report how will you evaluate these interventions-what measures could you report? (Epi Methods).

(D) Propose 2 PICO(T) questions p 110-114) for a population-focused study. Use format of PICOT and address each (Population, Intervention, Comparison, Outcome, Time). Second PICOT should be addressing another aspect of the care or needs of this population or for another group or participants who care or interact with this population (think of this as an alternate).

Examples: Population: A. Children ages X-X or B. Parents or caregivers of the children or C. Nurses who care for Children ages X to X

Examples: Intervention: A. Tolerance to new exercise program post-surgery for X or B. Knowledge gap in home care for the child with X or Nurses knowledge/or need for current education of X with children.

Now put the PICOT together in this order in a sentence. Let’s collaborate for clarity with PICOT.

(E) What type of data would you need to collect (remember your outcomes)? Locate and post 1 source of this data (See Ch 6)

Note: you should begin to look for peer-reviewed journal articles on your topic and possible interventions for the populations (not blogs or editorials). Collect them in an online folder.

(F) With one of the PICOT questions listed above, discuss how you can control selection bias (p 89-90)

(G) Offer an explanation to non-nursing person about these terms: validity and reliability in your field.

(H) Include your citations. NUR 811 Population-Based Nursing Assignment

Submit all in 1 Discussion Post with A-H Headings used and responses to each.

Posting of Discussions of posted questions on Canvas and responses to at least 2 other students. Primary post should be done by Thursday of 2^nd week of module. First response to 2 other students due by Thursday night (prefer Tuesday 2^nd week) module due date. Cite Source(s). (Discussion/Response)

Submission from Gordis in Assignments:

Ø Gordis Problems CH 5 & 6 as listed below. See instructions.

From Gordis: Review Appendix 1 to Ch 5-p 119-120—For Later

· Gordis Problems: Ch 5 Problems 1-8.

1. Provide the correct answer for these problems 1-8

2. Show your math on Problems #: 1-3, 6-7

3. Provide an explanation or rationale for each of your answers

4. Provide the page number of concepts reference in text

· Gordis Problems: Ch 6 Complete Problem 1 and Review Problems 2-5 for content.

1. Provide the correct answer for Problem #1

2. Show your math on Problem #: 1

3. Provide an explanation or rationale for each of your answers

4. Provide the page number of concepts reference in text

· Gordis Ch 6 – Natural History of Disease or Survival Data

1. Locate and appropriate article or report within the last 6 years related to natural history of disease, survival data, or or data noted in Gordis Ch 6.

2. Report: Focus, Population, Outcomes, what you have learned related to the texts concepts.

3. Consider bookmarking this course for later use.

4. Post with your Gordis Chapter problems to Canvas Assignments.

Submit all in 1 Assignment Post-Gordis Ch 5, Gordis Ch 6, Response to article or report Ch 6

Make Plans for your Epidemiology Quiz 1-Wed., Feb 24 openings at 11:00 am CST and closes at 11:45 pm CST

Cupp Curley Discussion Rubric: (Post to Discussions followed by at least 2 Responses)

Cupp Curley Chapters 4, 5, 6-Use Headings Post assignment to Discussion Board followed by 2 responses	Completed	Not Done
(A) Post your population of interest	5 points	0
(B) From Literature & Evidence. List at least two (2) possible interventions	15 points	0
(C) Report how will you evaluate these interventions-what measures could you report?	10 points	0
(D) Propose 2 PICO(T) questions for a population-focused study. Use format of PICOT	20 points 10 points x 2	0
(E) What type of data would you need to collect (remember your outcomes)? Locate and post 1 source of this data	10 points	0
(F) With one of the PICOT questions listed above, discuss how you can control selection bias	10 points	0
(G) Offer an explanation to non-nursing person about these terms: validity and reliability in your field.	10 points	0
(H) Respond to at least 2 other class members with meaningful post by due date	10 points (5 pts x 2-each)	0
(I) Include APA Citations and Use of Heading for parts of questions.	10 points (5 pts x 2-each)	0
On time submission-late assignment receive -5 points per day late x 2 days and then no further points received.	0 points On time	-5 points per day
	100 points	tbd

Gordis Rubric: (Submit to Assignments)

CH 5 Problems 1-8 and Ch 6 Problem 1
Complete Ch 5-All 1-8 problems correctly-showing your work	16 points (2 pts x 8)	Not Done = 0
In the 1-8 problems in chapter 5— -offer rationale why each choice is correct or incorrect. NUR 811 Population-Based Nursing Assignment	16 points (2 pts x 8)	0
In the 8 problems in chapter 5— -included page number in text connected with content.	8 points (1 pts x 8)	0
Complete Ch 6-Only Problem 1 correctly-showing your work	5 points	0
In the 1 problem in chapter 6— -included page number in text connected with content.	1 point	0
Gordis Ch 6 Rubric: (Submit to Assignments) Natural History of Disease or Survival Data
Locate and appropriate article related to natural history of disease or survival data.	10 points	Not Done = 0
Report: Focus	8 points	0
Report: Population	8 points	0
Report: Outcomes	8 points	0
Report: What you have learned	10 points	0
Include APA Citations and Use of Heading for parts of questions.	10 points (5 pts x 2-each)	0
On time submission-late assignment receive -5 points per day late x 2 days and then no further points received.	0 points	0
Submit to Assignments.	100 points	tbd

Gordis CHAPTER 5

Assessing the Validity and Reliability of Diagnostic and Screening Tests

Keywords

sensitivity; specificity; sequential (two-stage) and simultaneous testing; predictive value; reliability; percent agreement and kappa statistic; validity

A normal individual is a person who has not been sufficient

LEARNING OBJECTIVES

To define the validity and reliability of screening and diagnostic tests.
To compare measures of validity, including sensitivity and specificity.
To illustrate the use of multiple tests (sequential and simultaneous testing).
To introduce positive and negative predictive value.
To address measures of reliability, including percent agreement and kappa.

To understand how a disease is transmitted and develops and to provide appropriate and effective health care, it is necessary to distinguish between people in the population who have the disease and those who do not. This is an important challenge, both clinically, where patient care is the issue, and in the public health arena, where secondary prevention programs involving early disease detection through screening and interventions are fielded and where etiologic studies are conducted to provide a basis for primary prevention, if possible. Thus the quality of screening and diagnostic tests is a critical issue. Regardless of whether the test is a physical examination, a chest x-ray, an electrocardiogram, or a blood or urine assay, the same issue arises: How good is the test in identifying populations of people with and without the disease in question? This chapter addresses the question of how we assess the quality of newly available screening and diagnostic tests to make reasonable decisions about their use and interpretation.

Biologic Variation of Human Populations

In using a test to distinguish between individuals with normal and abnormal results, it is important to understand how characteristics are distributed in human populations.

Fig. 5.1 shows the distribution of newly reported confirmed cases of hepatitis C virus infection in Massachusetts for 2009. We can see that there are two peaks of hepatitis C virus infection cases among young adults and middle-aged persons. This type of distribution, in which there are two peaks, is called a bimodal curve. The bimodal distribution permits the identification of increased rates of new cases among these two distinct age groups, which could be related to different reasons. In this situation, there has been a dramatic increase in hepatitis among injection drug users, a practice associated with sharing of injection equipment that led to this bimodal distribution.

FIG. 5.1 Distribution of newly reported confirmed cases of hepatitis C virus infection in Massachusetts for 2009. (Modified from Centers for Disease Control and Prevention. Hepatitis C virus infection among adolescents and young adults: Massachusetts, 2002–2009. MMWR Morb Mortal Wkly Rep. 2011;60:537–541.)

In general, however, most human characteristics are not distributed bimodally. Fig. 5.2 shows the distribution of achieved low-density lipoprotein cholesterol (LDL-C) in participants of a clinical trial to study the safety of intensively reducing LDL-C as compared with less intensive LDL-C level lowering in patients after acute coronary syndrome. In this figure, there is no bimodal curve; what we see is a unimodal curve—a single peak. Therefore if we want to separate those in the group who achieved a safe low level of LDL-C, a cutoff level of LDL-C must be set below which people are labeled as achieving the “safe low level” and above which they are not labeled as such. This study shows that there is no obvious level of LDL-C that should be a treatment target. Although we could choose a cutoff based on statistical considerations, as the authors in this study showed, we would ideally like to choose a cutoff on the basis of some biologic information—that is, we would want to know that an intensive LDL-C lowering strategy below the chosen cutoff level is associated with increased risk of subsequent treatment side effects; adverse muscle, hepatobiliary, and neurocognitive events; or disease complications; hemorrhagic stroke, heart failure, cancer, and noncardiovascular death. Unfortunately, for many human characteristics, we do not have such information to serve as a guide in setting this level. NUR 811 Population-Based Nursing Assignment

FIG. 5.2 Distribution of achieved calculated low-density lipoprotein cholesterol (LDL-C) level at 1 month among patients who did not have a primary efficacy or prespecified safety event prior to the sample. (Data from Giugliano RP, Wiviott SD, Blazing MA, et?al. Long-term safety and efficacy of achieving very low levels of low-density lipoprotein cholesterol: a prespecified analysis of the IMPROVE-IT trial. JAMA Cardiol. 2017;2:547–555.)

In either distribution—unimodal or bimodal—it is usually easy to distinguish between the extreme values of abnormal and normal. With either type of curve, however, uncertainty remains about cases that fall into the gray zone.

Validity of Screening Tests

The validity of a test is defined as its ability to distinguish between who has a disease and who does not. Validity has two components: sensitivity and specificity. The sensitivity of the test is defined as the ability of the test to identify correctly those who have the disease. The specificity of the test is defined as the ability of the test to identify correctly those who do not have the disease.

Tests With Dichotomous Results (Positive or Negative)

Suppose we have a hypothetical population of 1,000 people, of whom 100 have a certain disease and 900 do not. A test is available that gives either positive or negative results. We want to use this test to distinguish persons who have the disease from those who do not. The results obtained by applying the test to this population of 1,000 people are shown in Table 5.1.

TABLE 5.1

Calculation of the Sensitivity and Specificity of Screening Examinations

How good was the test? First, how good was the test in correctly identifying those who had the disease? Table 5.1 indicates that of the 100 people with the disease, 80 were correctly identified as “positive” by the test, and a positive identification was missed in 20. Thus the sensitivity of the test, which is defined as the proportion of diseased people who were correctly identified as “positive” by the test, is 80/100, or 80%.

Second, how good was the test in correctly identifying those who did not have the disease? Looking again at Table 5.1, of the 900 people who did not have the disease, the test correctly identified 800 as “negative.” The specificity of the test, which is defined as the proportion of nondiseased people who are correctly identified as “negative” by the test, is therefore 800/900, or 89%.

To calculate the sensitivity and specificity of a test, we must know who “really” has the disease and who “does not” from a source other than the test we are using. We are, in fact, comparing our test results with some gold standard—an external source of “truth” regarding the disease status of each individual in the population. Sometimes this truth may be the result of another test that has been in use, and sometimes it is the result of a more definitive, and often more invasive, test (e.g., tumor biopsy, cardiac catheterization, or tissue biopsy). However, in real life, when we use a test to identify diseased and nondiseased persons in a population, we clearly do not know who has the disease and who does not. (If this were already established, testing would be pointless.) But to quantitatively assess the sensitivity and specificity of a test, we must have another source of truth with which to compare the test results.

Table 5.2 compares the results of a dichotomous test (results are unambiguously either positive or negative) with the actual disease status. Ideally, we would like all of the tested subjects to fall into the two cells shown in the upper left and lower right on the table: people with the disease who are correctly called “positive” by the test (true positives) and people without the disease who are correctly called “negative” by the test (true negatives). Unfortunately, such is rarely if ever the case. Some people who do not have the disease are erroneously called “positive” by the test (false positives), and some people with the disease are erroneously called “negative” (false negatives).

TABLE 5.2

Comparison of the Results of a Dichotomous Test With Disease Status

Why are these issues important? When we conduct a screening program, we often have a large group of people who screened positive, including both people who really have the disease (true positives) and people who do not have the disease (false positives). The issue of false positives is important because all people who screened positive are brought back for more sophisticated and more expensive tests or perhaps undergo an invasive procedure that is not necessary. Of the several problems that result, the first is a burden on the health care system. Another is the anxiety and worry induced in persons who have been told that they have tested positive. Considerable evidence indicates that many people who are labeled “positive” by a screening test never have that label completely erased, even if the results of a subsequent evaluation are negative. For example, children labeled “positive” in a screening program for heart disease may be handled as handicapped by parents and school personnel even after being told that subsequent more definitive tests were negative. In addition, such individuals may be limited in regard to employment and insurability by erroneous interpretation of positive screening test results, even if subsequent tests fail to substantiate any positive finding. NUR 811 Population-Based Nursing Assignment

Why is the problem of false negatives important? If a person has the disease but is erroneously informed that the test result is negative, and if the disease is a serious one for which effective intervention is available, the problem is indeed critical. For example, if the disease is a type of cancer that is curable only in its early stages, a false-negative result could represent a virtual death sentence. Thus the importance of false-negative results depends on the nature and severity of the disease being screened for, the effectiveness of available intervention measures, and whether the effectiveness is greater if the intervention is administered early in the natural history of the disease.

Tests of Continuous Variables

So far we have discussed a test with only two possible results: positive or negative. But we often test for a continuous variable, such as blood pressure or blood glucose level, for which there is no obvious “positive” or “negative” result. A decision must therefore be made in establishing a cutoff level above which a test result is considered positive and below which a result is considered negative. Let’s consider the diagrams shown in Fig. 5.3.

FIG. 5.3 (A to G) The effects of choosing different cutoff levels to define a positive test result when screening for diabetes using a continuous marker, blood sugar, in a hypothetical population. (See discussion in the text under the subheading “Tests of Continuous Variables” on page 97.)

Fig. 5.3A shows a population of 20 diabetics and 20 nondiabetics who are being screened using a blood sugar test whose scale is shown along the vertical axis from high to low. The diabetics are represented by blue circles and the nondiabetics by red circles. We see that although blood sugar levels tend to be higher in diabetics than in nondiabetics, no level clearly separates the two groups; there is some overlap of diabetics and nondiabetics at every blood sugar level. Nevertheless, we must select a cutoff point so that those whose results fall above the cutoff can be called “positive,” and can be called back for further testing, and those whose results fall below that point are called “negative,” and are not called back for further testing.

Suppose a relatively high cutoff level is chosen (see Fig. 5.3B). Clearly, many of the diabetics will not be identified as positive; on the other hand, most of the nondiabetics will be correctly identified as negative. If these results are distributed on a 2 × 2 table, the sensitivity of the test using this cutoff level will be 25% (5/20) and the specificity will be 90% (18/20). So, most of the diabetics will not be detected, but most of the nondiabetics will be correctly classified.

What if a low cutoff level is chosen (see Fig. 5.3C)? Very few diabetics would be misdiagnosed. What, then, is the problem? A large proportion of the nondiabetics are now identified as positive by the test. As seen in the 2 × 2 table, the sensitivity is now 85% (17/20), but the specificity is only 30% (6/20).

The difficulty is that in the real world, no vertical line separates the diabetics and nondiabetics, and they are indeed mixed together (see Fig. 5.3D); in fact, they are not even distinguishable by red or blue circles (see Fig. 5.3E). So if a high cutoff level is used (see Fig. 5.3F), all those with results below the line will be assured they do not have the disease and will not be followed further; if the low cutoff is used (see Fig. 5.3G), all those with results above the line will be brought back for further testing.

Fig. 5.4A shows actual data from a historical report regarding the distribution of blood sugar levels in diabetics and nondiabetics. Suppose we were to screen this population. If we decide to set the cutoff level so that we identify all of the diabetics (100% sensitivity), we could set the level at 80?mg/dL (see Fig. 5.4B). The problem is, however, that in so doing we will also call many of the nondiabetics positive—that is, the specificity will be very low. On the other hand, if we set the level at 200?mg/dL (see Fig. 5.4C) so that we call all the nondiabetics negative (100% specificity), we now miss many of the true diabetics because the sensitivity will be very low. Thus there is a trade-off between sensitivity and specificity: if we increase the sensitivity by lowering the cutoff level, we decrease the specificity; if we increase the specificity by raising the cutoff level, we decrease the sensitivity. To quote an unknown sage: “There is no such thing as a free lunch.”

FIG. 5.4 (A) Distribution of blood sugar levels in hospital patients with diabetes and without diabetes. (The number of people with diabetes is shown for each specific blood sugar level in the [upper] distribution for persons without diabetes. Because of limited space, the number of people for each specific level of blood sugar is not shown in the [lower] distribution for persons with diabetes.) (B) and (C) show two different blood sugar cutpoints that were used in the study to define diabetes. Data from the graphs are presented to the right of each graph in a 2 × 2 table. (B) When a blood sugar cutpoint of ≥80?mg/dL is used to define diabetes in this population, sensitivity of the screening test is 100%, but specificity is low. (C) When a blood sugar cutpoint of ≥200?mg/dL is used to define diabetes in this population, sensitivity of the screening test is low, but specificity is 100%. (See explanation in the text under the subheading “Tests of Continuous Variables” on page 97.) FN, False negatives; FP, false positives; TN, true negatives; TP, true positives. (Modified from Blumberg M. Evaluating health screening procedures. Oper Res. 1957;5:351–360.)

The dilemma involved in deciding whether to set a high cutoff or a low cutoff rests in the problem of the false positives and the false negatives that result from the testing. It is important to remember that in screening we end up with groups classified only on the basis of the results of their screening tests, either positive or negative. We have no information regarding their true disease status, which, of course, is the reason for carrying out the screening. In effect, the results of the screening test yield not four groups, as seen in Fig. 5.5, but rather two groups: one group of people who tested positive and one group who tested negative. Those who tested positive will be notified of their test result and will be asked to return for additional examinations. The other group, who tested negative, will be notified that their test result was negative and will therefore not be asked to return for further testing (Fig. 5.6). NUR 811 Population-Based Nursing Assignment

FIG. 5.5 Diagram showing four possible groups resulting from screening with a dichotomous test.

FIG. 5.6 Diagram showing the two groups of people resulting from screening with a dichotomous screening test: all people with positive test results and all people with negative test results.

The choice of a high or a low cutoff level for screening therefore depends on the importance we attach to false positives and false negatives. False positives are associated with costs—emotional and financial—as well as with the difficulty of “delabeling” a person who tests positive and is later found not to have the disease. In addition, false-positive results may pose a major burden to the health care system, in that a large group of people need to be brought back for a retest, when only a few of them may have the disease. Those with false-negative results, on the other hand, will be told they do not have the disease and will not be followed, so a serious disease might possibly be missed at an early treatable stage. Thus the choice of cutoff level relates to the relative importance of false positivity and false negativity for the disease in question.

Use of Multiple Tests

Often more than one screening test may be applied in the same individuals to detect an illness—either sequentially (one after another) or simultaneously (both conducted at the same time). The results of these approaches are described in this section.

Sequential (Two-Stage) Testing

In sequential (or two-stage) screening, a less expensive, less invasive, or less uncomfortable test is generally performed first, and those who screen positive are recalled for further testing with a more expensive, more invasive, or more uncomfortable test, which may have greater sensitivity and specificity. It is hoped that bringing back for further testing only those who screen positive will reduce the problem of false positives.

Consider the hypothetical example in Fig. 5.7A, in which a population is screened for diabetes using a test with a sensitivity of 70% and a specificity of 80%. How are the data shown in this table obtained? The disease prevalence in this population is given as 5%, so that in the population of 10,000, 500 persons have the disease. With a sensitivity of 70%, the test will correctly identify 350 of the 500 people who have the disease. With a specificity of 80%, the test will correctly identify as nondiabetic 7,600 of the 9,500 people who are free of diabetes; however, 1,900 of these 9,500 will have positive results. Thus a total of 2,250 people will test positive and will be brought back for a second test. (Remember that in real life we do not have the vertical line separating diabetics and nondiabetics, and we do not know that only 350 of the 2,250 have diabetes.)

FIG. 5.7 Hypothetical example of a two-stage screening program. (A) Findings using Test 1 in a population of 10,000 people. (B) Findings using Test 2 in participants who tested positive using Test 1. (See explanation in the text under the subheading “Sequential (Two-Stage) Testing” on page 99.)

Now those 2,250 people are brought back and screened using a second test (such as a glucose tolerance test), which, for purposes of this example, is assumed to have a sensitivity of 90% and a specificity of 90%. Fig. 5.7B shows test 1 together with test 2, which deals only with the 2,250 people who tested positive in the first screening test and have been brought back for second-stage screening.

Since 350 people (of the 2,250) have the disease and the test has a sensitivity of 90%, 315 of those 350 will be correctly identified as positive. Because 1,900 (of the 2,250) do not have diabetes and the test specificity is 90%, 1,710 of the 1,900 will be correctly identified as negative and 190 will be false positives.

We are now able to calculate the net sensitivity and the net specificity of using both tests in sequence. After finishing both tests, 315 people of the total 500 people with diabetes in this population of 10,000 will have been correctly called positive: 315/500 = 63% net sensitivity (which can also be calculated by multiplying the sensitivity of the first test times the sensitivity of the second test; i.e., 0.70 × 0.90 = 0.63). Thus there is a loss in net sensitivity by using both tests sequentially. To calculate net specificity, note that 7,600 people of the 9,500 in this population who do not have diabetes were correctly called negative in the first-stage screening and were not tested further; an additional 1,710 of those 9,500 nondiabetics were correctly called negative in the second-stage screening. Thus a total of 7,600 + 1,710 of the 9,500 nondiabetics were correctly called negative: 9,310/9,500 = 98% net specificity. Thus use of both tests in sequence has resulted in a gain in net specificity.

Simultaneous Testing

Let’s now turn to the use of simultaneous tests. We assume that in a population of 1,000 people, the prevalence of a disease is 20%. Therefore 200 people have the disease, but we do not know who they are. In order to identify the 200 people who have this disease, we screen this population of 1,000 using two tests for this disease, test A and test B, at the same time. We assume that the sensitivity and specificity of the two tests are as follows:

Test A Test B

Sensitivity = 80% Sensitivity = 90%

Specificity = 60% Specificity = 90%

Net Sensitivity Using Two Simultaneous Tests

The first question we ask is, “What is the net sensitivity using test A and test B simultaneously?” To be considered positive and therefore included in the numerator for net sensitivity for two tests used simultaneously, a person must be identified as positive by test A, test B, or both tests.

To calculate net sensitivity, let’s first consider the results of screening with test A whose sensitivity is 80%: of the 200 people who have the disease, 160 test positive (Table 5.3). In Fig. 5.8A, the oval represents the 200 people who have the disease. In Fig. 5.8B the pink circle within the oval represents the 160 who test positive with test A. These 160 are the true positives using test A.

TABLE 5.3

Results of Screening With Test A

FIG. 5.8 (A to F) Net sensitivity: hypothetical example of simultaneous testing. (See explanation in the text under the subheading “Net Sensitivity Using Two Simultaneous Tests” on page 102.)

Consider next the results of screening with test B whose sensitivity is 90% (Table 5.4). Of the 200 people who have the disease, 180 test positive by test B. In Fig. 5.8C, the oval again represents the 200 people who have the disease. The blue circle within the oval represents the 180 who test positive with test B. These 180 are the true positives using test B.

TABLE 5.4

Results of Screening With Test B

In order to calculate the numerator for net sensitivity, we cannot just add the number of persons who tested positive using test A to those who tested positive using test B because some people tested positive on both tests. These people are shown in lavender by the overlapping area of the two circles, and we do not want to count them twice (see Fig. 5.8D). How do we determine how many people tested positive on both tests?

Test A has a sensitivity of 80% and thus identifies as positive 80% of the 200 who have the disease (160 people). Test B has a sensitivity of 90%. Therefore it identifies as positive 90% of the same 160 people who are identified by test A (144 people). Thus when tests A and B are used simultaneously, 144 people are identified as positive by both tests (see Fig. 5.8E).

Recall that test A correctly identified 160 people with the disease as positive. Because 144 of them were identified by both tests, 160 − 144, or 16 people, were correctly identified only by test A.

Test B correctly identified 180 of the 200 people with the disease as positive. Because 144 of them were identified by both tests, 180 − 144, or 36 people, were correctly identified only by test B. Thus as seen in Fig. 5.8F, using tests A and B simultaneously,

Net Specificity Using Two Simultaneous Tests

The next question is, “What is the net specificity using test A and test B simultaneously?” To be included in the numerator for net specificity for two tests used simultaneously, a person must be identified as negative by both tests. In order to calculate the numerator for net specificity, we therefore need to determine how many people had negative results on both tests. How do we do this?

Test A has a specificity of 60% and thus correctly identifies 60% of the 800 who do not have the disease (480 people; Table 5.5). In Fig. 5.9A, the oval represents the 800 people who do not have the disease. The green circle within the oval in Fig. 5.9B represents the 480 people who test negative with test A. These are the true negatives using test A.

TABLE 5.5

Results of Screening With Test A

FIG. 5.9 (A to F) Net specificity: hypothetical example of simultaneous testing. (See explanation in the text under the subheading “Net Specificity Using Two Simultaneous Tests” on page 104.)

Test B has a specificity of 90% and thus identifies as negative 90% of the 800 people who do not have the disease (720 people; Table 5.6 and the yellow circle in Fig. 5.9C). However, to be called negative in simultaneous tests, only people who test negative on both tests are considered to have had negative results (see Fig. 5.9D). These people are shown in light green by the overlapping area of the two circles. Test B also identifies as negative 90% of the same 480 people identified as negative by test A (432 people). Thus, as shown by the overlapping circles, when tests A and B are used simultaneously, 432 people are identified as negative by both tests (see Fig. 5.9E). Thus when tests A and B are used simultaneously (see Fig. 5.9F),

TABLE 5.6

Results of Screening With Test B

Therefore when two simultaneous tests are used, there is a net gain in sensitivity (from 80% using test A and 90% using test B to 98% using both tests simultaneously). However, there is a net loss in specificity (net specificity = 54%) compared with using either test alone (specificity of 60% using test A and 90% using test B).

Comparison of Simultaneous and Sequential Testing

In a clinical setting, multiple tests are often used simultaneously. For example, a patient admitted to a hospital may have an array of tests performed at the time of admission. When multiple tests are used simultaneously to detect a specific disease, the individual is generally considered to have tested “positive” if he or she has a positive result on any one or more of the tests. The individual is considered to have tested “negative” if he or she tests negative on all of the tests. The effects of such a testing approach on sensitivity and specificity differ from those that result from sequential testing. In sequential testing, when we retest those who tested positive on the first test, there is a loss in net sensitivity and a gain in net specificity. In simultaneous testing, because an individual who tests positive on any one or multiple tests is considered positive, there is a gain in net sensitivity. However, to be considered negative, a person would have to test negative on all the tests performed. As a result, there is a loss in net specificity.

In summary, as we have seen previously, when two sequential tests are used and those who test positive by the first test are brought in for the second test, there is a net loss in sensitivity, but a net gain in specificity, compared with either test alone. However, when two simultaneous tests are used, there is a net gain in sensitivity and a net loss in specificity, compared with either test alone.

Given these results, the decision to use either sequential or simultaneous testing often is based both on the objectives of the testing, including whether testing is being done for screening or diagnostic purposes, and on practical considerations related to the setting in which the testing is being done, including the length of hospital stay, costs, and degree of invasiveness of each of the tests, as well as the extent of third-party insurance coverage. Fig. 5.10 shows a physician dealing with perceived information overload.

FIG. 5.10 “Whoa—way too much information.” A physician comments on excessive information. (Alex Gregory/The New Yorker Collection/The Cartoon Bank.) NUR 811 Population-Based Nursing Assignment

Predictive Value of a Test

So far we have asked, “How good is the test at identifying people with the disease and people without the disease?” This is an important issue, particularly in screening free-living populations who have no symptoms of the disease being evaluated. In effect, we are asking, “If we screen a population, what proportion of people who have the disease will be correctly identified?” This is clearly an important public health consideration. In the clinical setting, however, a different question may be important for the clinician: If the test results are positive in this patient, what is the probability that this patient has the disease? This is called the positive predictive value (PPV) of the test. In other words, what proportion of patients who test positive actually have the disease in question? To calculate the PPV, we divide the number of true positives by the total number who tested positive (true positives + false positives).

Let’s return to the example shown in Table 5.1, in which a population of 1,000 persons is screened. As seen in Table 5.7, a 2 × 2 table shows the results of a dichotomous screening test in that population. Of the 1,000 subjects, 180 have a positive test result; of these 180 subjects, 80 have the disease. Therefore the PPV is 80/180, or 44%.

TABLE 5.7

Predictive Value of a Test

A parallel question can be asked about negative test results: “If the test result is negative, what is the probability that this patient does not have the disease?” This is called the negative predictive value (NPV) of the test. It is calculated by dividing the number of true negatives by all those who tested negative (true negatives + false negatives). Looking again at the example in Table 5.7, 820 people have a negative test result, and of these, 800 do not have the disease. Thus the NPV is 800/820, or 98%.

Every test that a clinician performs—history, physical examination, laboratory tests, x-rays, electrocardiograms, and other procedures—is used to enhance the likelihood of making the correct diagnosis. What he or she wants to know after administering a test to a patient is: “Given this positive test result, what is the likelihood that the patient has the disease?”

Unlike the sensitivity and specificity of the test, which can be considered characteristic of the test being used, the PPV is affected by two factors: the prevalence of the disease in the population tested and, when the disease is infrequent, the specificity of the test being used. Both of these relationships are discussed in the following sections.

Relationship Between Positive Predictive Value and Disease Prevalence

In the discussion of predictive value that follows, the term predictive value is used to denote the positive predictive value of the test.

The relationship between predictive value and disease prevalence can be seen in the example given in Table 5.8. First, let’s direct our attention to the upper part of the table. Assume that we are using a test with a sensitivity of 99% and a specificity of 95% in a population of 10,000 people in which the disease prevalence is 1%. Because the prevalence is 1%, 100 of the 10,000 persons have the disease and 9,900 do not. With a sensitivity of 99%, the test correctly identifies 99 of the 100 people who have the disease. With a specificity of 95%, the test correctly identifies as negative 9,405 of the 9,900 people who do not have the disease. Thus, in this population with a 1% prevalence, 594 people are identified as positive by the test (99 + 495). However, of these 594 people, 495 (83%) are false positives and the PPV is therefore 99/594, or only 17%.

TABLE 5.8

Relationship of Disease Prevalence to Positive Predictive Value

example: sensitivity = 99%, specificity = 95%

Disease Prevalence Test Results Sick Not Sick Totals Positive Predictive Value

1% + 99 495 594

− 1 9,405 9,406

Totals 100 9,900 10,000

5% + 495 475 970

− 5 9,025 9,030

Totals 500 9,500 10,000

Let’s now apply the same test—with the same sensitivity and specificity—to a population with a higher disease prevalence, 5%, as seen in the lower part of Table 5.8. Using calculations similar to those used in the upper part of the table, the PPV is now 51%. Thus the higher prevalence in the screened population has led to a marked increase in the PPV using the same test. Fig. 5.11 shows the relationship between disease prevalence and predictive value from a classic example. Clearly most of the gain in predictive value occurs with increases in prevalence at the lowest rates of disease prevalence.

FIG. 5.11 Relationship between disease prevalence and predictive value in a test with 95% sensitivity and 95% specificity. (From Mausner JS, Kramer S. Mausner and Bahn Epidemiology: An Introductory Text. Philadelphia: WB Saunders; 1985:221.)

Why should we be concerned about the relationship between predictive value and disease prevalence? As we have seen, the higher the prevalence, the higher the predictive value. Therefore a screening program is most productive and more cost-effective if it is directed to a high-risk target population. Screening a total population for a relatively infrequent disease can be a wasteful use of resources and may yield few previously undetected cases relative to the amount of effort involved. However, if a high-risk subset can be identified and screening can be directed to this group, the program is likely to be far more productive. In addition, a high-risk population may be more motivated to participate in such a screening program and more likely to take recommended action if their screening results are positive.

The relationship between predictive value and disease prevalence also shows that the results of any test must be interpreted in the context of the prevalence of the disease in the population from which the subject originates. An interesting example is seen with the measurement of the maternal serum α-fetoprotein (MSAFP) level for prenatal diagnosis of spina bifida. Fig. 5.12 shows the distribution of MSAFP levels in normal unaffected pregnancies and in pregnancies in which the fetus has Down syndrome; spina bifida, which is a neural tube defect; or anencephaly. For the purpose of this example, we will focus on the curves for unaffected pregnancies and spina bifida. Although the distribution of these two curves is bimodal, there is a range in which the curves overlap, and within that range, it may not always be clear to which curve the mother and fetus belong. If MSAFP is in the higher range for an unaffected pregnancy, the true prevalence of spina bifida will be low for the same range. Thus such overlap in the MSAFP in the unaffected pregnancies and those with fetuses with spina bifida has led to the test having a very low PPV, of only 2% to 6%. 1 NUR 811 Population-Based Nursing Assignment

FIG. 5.12 Maternal serum alpha-fetoprotein (MSAPF) distribution for singleton pregnancies at 15 to 20 weeks. The screen cutoff value of 2.5 multiples of the median is expected to result in a false-positive rate of up to 5% (black hatched area) and false-negative rates of up to 20% for spina bifida (orange hatched area) and 10% for anencephaly (red hatched area). (Modified from Prenatal diagnosis. In: Cunningham F, Leveno KJ, Bloom SL, et?al, eds. Williams Obstetrics. 24th ed. New York: McGraw-Hill; 2013. http://accessmedicine.mhmedical.com.ezp.welch.jhmi.edu/content.aspx?bookid=1057&sectionid=59789152. Accessed June 19, 2017.)

It is possible that the same test can have a very different predictive value when it is administered to a high-risk (high prevalence) population or to a low-risk (low prevalence) population. This has clear clinical implications: A woman may make a decision to terminate a pregnancy, and a physician may formulate advice to such a woman on the basis of the test results. However, the same test result may be interpreted differently, depending on whether the woman comes from a pool of high-risk or low-risk women, which will be reflected in the PPV of the test. Consequently, by itself, the test result may not be sufficient to serve as a guide without taking into account the other considerations just described.

The following true examples highlight the importance of this issue:

The head of a firefighters’ union consulted a university cardiologist because the fire department physician had read an article in a leading medical journal reporting that a certain electrocardiographic finding was highly predictive of serious, generally unrecognized, coronary heart disease. On the basis of this article, the fire department physician was disqualifying many young, able-bodied firefighters from active duty. The cardiologist read the paper and found that the study had been carried out in hospitalized patients.

What was the problem? Because hospitalized patients have a much higher prevalence of heart disease than does a group of young, able-bodied firefighters, the fire department physician had erroneously taken the high predictive value obtained in studying a high-prevalence population and inappropriately applied it to a low-prevalence population of healthy firefighters, in whom the same test would actually have a much lower predictive value.

Here is another example:

A physician visited his general internist for a regular annual medical examination, which included a stool examination for occult blood. One of the three stool specimens examined in the test was positive. The internist told his physician-patient that the result was of no significance because he regularly encountered many false-positive test results in his busy practice. The test was repeated on three new stool specimens, and all three of the new specimens were now negative. Nevertheless, sensing his patient’s lingering concerns, the internist referred his physician-patient to a gastroenterologist. The gastroenterologist said, “In my experience, the positive stool finding is serious. Such a finding is almost always associated with pathologic gastrointestinal disorders. The subsequent negative test results mean nothing, because you could have a tumor that only bleeds intermittently.”

Who was correct in this episode? The answer is that both the general internist and the gastroenterologist were correct. The internist gave his assessment of predictive value based on his experience in his general medical practice—a population with a low prevalence of serious gastrointestinal disease. On the other hand, the gastroenterologist gave his assessment of the predictive value of the test based on his experience in his referral practice—a practice in which most patients are referred because of a likelihood of serious gastrointestinal illness, a high-prevalence population.

Relationship Between Positive Predictive Value and Specificity of the Test

In the discussion that follows, the term predictive value is used to denote the PPV of the test.

A second factor that affects the predictive value of a test is the specificity of the test. Examples of this are shown first in graphical form and then in tabular form. Fig. 5.13A to D diagrams the results of screening a population; however, the 2 × 2 tables in these figures differ from those shown in earlier figures. Each cell is drawn with its size proportional to the population it represents. In each figure the cells that represent persons who tested positive are shaded blue; these are the cells that will be used in calculating the PPV.

FIG. 5.13 (A to D) Relationship of specificity to positive predictive value (PPV). (See explanation in the text under the subheading “Relationship Between Positive Predictive Value and Specificity of the Test” on page 109.)

Fig. 5.13A presents the baseline screened population that is used in our discussion: a population of 1,000 people in whom the prevalence is 50%; thus 500 people have the disease and 500 do not. In analyzing this figure, we also assume that the screening test that was used has a sensitivity of 50% and a specificity of 50%. Because 500 people tested positive, and 250 of these have the disease, the predictive value is 250/500, or 50%.

Fortunately, the prevalence of most diseases is much lower than 50%; we are generally dealing with relatively infrequent diseases. Therefore Fig. 5.13B assumes a lower prevalence of 20% (although even this would be an unusually high prevalence for most diseases). Both the sensitivity and the specificity remain at 50%. Now only 200 of the 1,000 people have the disease, and the vertical line separating diseased from nondiseased persons is shifted to the left. The predictive value is now calculated as 100/500, or 20%.

Given that we are screening a population with the lower prevalence rate, can we improve the predictive value? What would be the effect on predictive value if we increased the sensitivity of the test? Fig. 5.13C shows the results when we leave the prevalence at 20% and the specificity at 50%, but increase the sensitivity to 90%. The predictive value is now 180/580, or 31%—a modest increase.

What if, instead of increasing the sensitivity of the test, we increase its specificity? Fig. 5.13D shows the results when prevalence remains 20% and sensitivity remains 50%, but specificity is increased to 90%. The predictive value is now 100/180, or 56%. Thus an increase in specificity resulted in a much greater increase in predictive value than the same increase in sensitivity.

Why does specificity have a greater effect than sensitivity on predictive value? The answer becomes clear by examining these figures. Because we are dealing with infrequent diseases, most of the population falls to the right of the vertical line. Consequently, any change to the right of the vertical line affects a greater number of people than would a comparable change to the left of the line. Thus a change in specificity has a greater effect on predictive value than a comparable change in sensitivity. If we were dealing with a high-prevalence disease, the situation would be different.

The effect of changes in specificity on predictive value is also seen in Table 5.9 in a form similar to that used in Table 5.8. As seen in this example, even with 100% sensitivity, a change in specificity from 70% to 95% has a dramatic effect on the PPV.

TABLE 5.9

Relationship of Specificity to Positive Predictive Value

example: prevalence = 10%, sensitivity = 100%

Specificity Test Results Sick Not Sick Totals Predictive Value

70% + 1,000 2,700 3,700

− 0 6,300 6,300

Totals 1,000 9,000 10,000

95% + 1,000 450 1,450

− 0 8,550 8,550

Totals 1,000 9,000 10,000

Reliability (Repeatability) of Tests

Let’s consider another aspect of assessing diagnostic and screening tests—the question of whether a test is reliable or repeatable. Can the results obtained be replicated (getting the same result) if the test is repeated? Clearly, regardless of the sensitivity and specificity of a test, if the test results cannot be reproduced, the value and usefulness of the test are minimal. The rest of this chapter focuses on the reliability or repeatability of diagnostic and screening tests. The factors that contribute to the variation between test results are discussed first: intrasubject variation (variation within individual subjects), intraobserver variation (variation in the reading of test results by the same reader), and interobserver variation (variation between those reading the test results).

Intrasubject Variation

The values obtained in measuring many human characteristics often vary over time, even during a short period of 24 hours, or a longer period, such as seasonal variation. Fig. 5.14 shows changes in blood pressure readings over a 24-hour period in 28 normotensive individuals. Variability over time is considerable. This, as well as the conditions under which certain tests are conducted (e.g., shortly after eating or post-exercise, at home or in a physician’s office), clearly can lead to different results in the same individual. Therefore in evaluating any test result, it is important to consider the conditions under which the test was performed, including the time of day.

FIG. 5.14 Endogenous circadian variation in blood pressure during the constant routine protocol. DBP, Diastolic blood pressure; HR, heart rate; SBP, systolic blood pressure. (From Shea SA, Hilton MF, Hu K, et?al. Existence of an endogenous circadian blood pressure rhythm in humans that peaks in the evening. Circ Res. 2011;108:980–984.)

Intraobserver Variation

Sometimes variation occurs between two or more readings of the same test results made by the same observer. For example, a radiologist who reads the same group of x-rays at two different times may read one or more of the x-rays differently the second time. Tests and examinations differ in the degree to which subjective factors enter into the observer’s conclusions, and the greater the subjective element in the reading, the greater the intraobserver variation in readings is likely to be (Fig. 5.15).

FIG. 5.15 “This is a second opinion. At first, I thought you had something else.” One view of a second opinion. (Leo Cullum/The New Yorker Collection/The Cartoon Bank.)

Interobserver Variation

Another important consideration is variation between observers. Two examiners often do not give the same result. The extent to which observers agree or disagree is an important issue, whether we are considering physical examinations, laboratory tests, or other means of assessing human characteristics. We therefore need to be able to express the extent of agreement in quantitative terms. NUR 811 Population-Based Nursing Assignment

Percent Agreement

Table 5.10 shows a schema for examining variation between observers. Two observers were instructed to categorize each test result into one of the following four categories: abnormal, suspect, doubtful, and normal. This diagram might refer, for example, to readings performed by two radiologists. In this diagram, the readings of observer 1 are cross-tabulated against those of observer 2. The number of readings in each cell is denoted by a letter of the alphabet. Thus A x-rays were read as abnormal by both radiologists. C x-rays were read as abnormal by radiologist 2 and as doubtful by radiologist 1. M x-rays were read as abnormal by radiologist 1 and as normal by radiologist 2.

TABLE 5.10

Observer or Instrument Variation: Percent Agreement

As seen in Table 5.10, to calculate the overall percent agreement, we add the numbers in all of the cells in which readings by both radiologists agreed (A + F + K + P), divide that sum by the total number of x-rays read, and multiply the result by 100 to yield a percentage. Fig. 5.16A shows the use of this approach for a test with possible readings of either “positive” or “negative.”

FIG. 5.16 Calculating the percent agreement between two observers. (A) Percent agreement when examining paired observations between observer 1 and observer 2. (B) Percent agreement when examining paired observations between observer 1 and observer 2, considering that cell d (agreement on the negatives) is very high. (C) Percent agreement when examining paired observations between observer 1 and observer 2, ignoring cell d. (D) Percent agreement when examining paired observations between observer 1 and observer 2, using only cells a, b, and c for the calculation.

In general, most persons who are tested have negative results. This is shown in Fig. 5.16B, in which the size of each cell is drawn in proportion to the number of people in that cell. There is likely to be considerable agreement between the two observers about these negative, or normal, subjects (cell d). Therefore when percent agreement is calculated for all study subjects, its value may be high only because of the large number of clearly negative findings (cell d) on which the observers agree. Thus the high value may conceal significant disagreement between the observers in identifying subjects who are considered positive by at least one observer.

One approach to this problem, seen in Fig. 5.16C, is to disregard the subjects who were labeled negative by both observers (cell d) and to calculate percent agreement using as a denominator only the subjects who were labeled abnormal by at least one observer (cells a, b, and c; Fig. 5.16D).

Thus in the paired observations in which at least one of the findings in each pair was positive, the following equation is applicable:

Kappa Statistic

Percent agreement between two observers is often of value in assessing the quality of their observations. The extent to which two observers, such as two physicians or two nurses, for example, agree with one another is often an important index of the quality of the health care being provided. However, the percent agreement between two observers does not entirely depend on the quality of their training and practice. The extent of their agreement is also significantly influenced by the fact that even if two observers use completely different criteria to identify subjects as positive or negative, we would expect the observers to agree about the observations made, at least in some of the participants, solely as a function of chance. What we really want to know is how much better their level of agreement is than that which results just from chance. The answer to this question will presumably tell us, for example, to what extent the education and training that the observers received improved the quality of their readings so that the percent agreement between them was increased beyond what we would expect from chance alone.

This can be shown intuitively in the following example: you are the director of a radiology department that is understaffed 1 day, and a large number of chest x-rays remain to be read. To solve your problem, you go out to the street and ask a few neighborhood residents, who have no background in biology or medicine, to read the unread x-rays and assess them as either positive or negative. The first person goes through the pile of x-rays, reading them haphazardly as positive, negative, negative, positive, and so on. The second person does the same, in the same way, but completely independent of the first reader. Given that both readers have no knowledge, criteria, or standards for reading x-rays, would any of their readings on a specific x-ray agree? The answer is clearly yes; they would agree in some cases, purely by chance alone.

However, if we want to know how well two observers read x-rays, we might ask, “To what extent do their readings agree beyond what we would expect by chance alone?” In other words, to what extent does the agreement between the two observers exceed the level of agreement that would result just from chance? One approach to answering this question is to calculate the kappa statistic, proposed by Cohen in 1960. 2 In this section, we will first discuss the rationale of the kappa statistic and the questions that the kappa statistic is designed to answer. This will be followed by a detailed calculation of the kappa statistic to serve as an example for intrepid readers. Even if you do not follow through the detailed calculation presented here, it is important to be sure that you understand the rationale of the kappa statistic because it is frequently applied both in clinical medicine and in public health.

Rationale of the Kappa Statistic.

In order to understand kappa, we ask two questions. First, how much better is the agreement between the observers’ readings than would be expected by chance alone? This can be calculated as the percent agreement observed minus the percent agreement we would expect by chance alone. This is the numerator of kappa:

Our second question is, “What is the most that the two observers could have improved their agreement over the agreement that would be expected by chance alone?” Clearly the maximum that they could agree would be 100% (full agreement, where the two observers agree completely). Therefore the most that we could expect them to be able to improve (the denominator of kappa) would be:

Kappa expresses the extent to which the observed agreement exceeds that which would be expected by chance alone (i.e., percent agreement observed − percent agreement expected by chance alone [numerator]) relative to the maximum that the observers could hope to improve their agreement (i.e., 100% − percent agreement expected by chance alone [denominator]).

Thus kappa quantifies the extent to which the observed agreement that the observers achieved exceeds that which would be expected by chance alone, and expresses it as the proportion of the maximum improvement that could occur beyond the agreement expected by chance alone. The kappa statistic can be defined by the equation:

Calculation of the Kappa Statistic: An Example.

To calculate the numerator for kappa, we must first calculate the amount of agreement that might be expected on the basis of chance alone. As an example, let’s consider data on breast density reported on the radiologic classification of breast density on synthetic 2D images as compared with digital 2D mammograms. 3 Fig. 5.17A shows data comparing the findings of the two methods in classifying 309 such cases.

FIG. 5.17 (A) Radiologic classification of breast density on synthetic 2D images as compared with digital 2D mammograms. (B) Percent agreement by synthetic and digital 2D mammograms. (C) Percent agreement by synthetic and digital 2D mammograms expected by chance alone. (From Alshafeiy TI, Wadih A, Nicholson BT, et?al. Comparison between digital and synthetic 2D mammograms in breast density interpretation. AJR Am J Roentgenol. 2017;209:W36–W41. Reprinted with permission from the American Journal of Roentgenology.)

The first question is, “What is the observed agreement between the two types of mammograms?” Fig. 5.17B shows the classifications using the synthetic 2D mammography along the bottom of the table and those of digital 2D mammography along the right margin. Thus synthetic 2D mammography identified 179 (or 58%) of all of the 309 breast images as nondense and 130 (or 42%) of the images as dense. Digital 2D mammography identified 182 (or 59%) of all of the images as nondense and 127 (or 41%) of the images as dense. As discussed earlier, the percent agreement is calculated by the following equation:

That is, the two mammography devices had the same breast image classification on 91.9% of the readings.

The next question is, “If the two types of mammography had used entirely different sets of criteria for classifying a breast image as dense versus nondense, how much agreement would have been expected solely on the basis of chance?” Synthetic 2D mammography read 58% of all 309 images (179 images) as being nondense and 42% (130 images) as dense. If these readings had used criteria independent of those used by digital 2D mammography, we would expect that synthetic 2D mammography would read as nondense both 58% of the images that the digital had identified as dense and 58% of the images that digital 2D mammography had identified as dense. Therefore we would expect that 58% (105.56) of the 182 images identified as nondense by digital 2D mammography would be identified as nondense by synthetic 2D mammography, and that 58% (73.44) of the 127 images identified as dense by digital 2D mammography would also be identified as nondense by synthetic 2D mammography (see Fig. 5.16C). Of the 127 images called dense by digital 2D mammography, 42% (53.34) would also be classified as dense by synthetic 2D mammography. NUR 811 Population-Based Nursing Assignment

Thus the agreement expected by chance alone would be

of all images read.

Having calculated the figures needed for the numerator and denominator, kappa can now be calculated as follows:

Landis and Koch 4 suggested that a kappa greater than 0.75 represents excellent agreement beyond chance, a kappa below 0.40 represents poor agreement, and a kappa of 0.40 to 0.75 represents intermediate to good agreement. Testing for the statistical significance of kappa is described by Fleiss. 5 Considerable discussion has arisen about the appropriate use of kappa, a subject addressed by MacLure and Willett. 6

Validity of Tests With Multicategorical Results.

Validity, as a concept, can be applied to any test against a gold standard. As we explained earlier, we use sensitivity/specificity to validate the results of tests with dichotomous results against a gold standard. What about tests with multicategorical results? In this case, we can calculate kappa statistic, which we demonstrated earlier as a tool to assess reliability.

Validity of Self-Reports.

Often we obtain information on health and disease status by directly asking patients or study participants about their medical history, their habits, and other factors of interest. Most people today know their date of birth, so the assessment of age is usually without significant error. However, many people underreport their weight, their drinking and smoking practices, and other types of risks. Self-reports of sexual behaviors are considered to be subject to considerable error. To overcome these reporting biases, biomarkers have become commonly used in field studies. For example, Zenilman et?al. 7 used a polymerase chain reaction (PCR) assay to detect Y chromosome fragments in self-collected vaginal swabs. This biomarker can detect coitus in women for a 2-week period, and can validate self-reports of condom use. 8

Relationship Between Validity and Reliability

To conclude this chapter, let’s compare validity and reliability using a graphical presentation.

The horizontal line in Fig. 5.18 is a scale of values for a given variable, such as blood glucose level, with the true value indicated. The test results obtained are shown by the curve. The curve is narrow, indicating that the results are quite reliable (repeatable); unfortunately, however, they cluster far from the true value, so they are not valid. Fig. 5.19 shows a curve that is broad and therefore has low reliability. However, the values obtained cluster around the true value and thus are valid. Clearly, what we would like to achieve are results that are both valid and reliable (Fig. 5.20).

FIG. 5.18 Graph of hypothetical test results that are reliable, but not valid.

FIG. 5.19 Graph of hypothetical test results that are valid, but not reliable.

FIG. 5.20 Graph of hypothetical test results that are both valid and reliable.

It is important to point out that in Fig. 5.20, in which the distribution of the test results is a broad curve centered on the true value, we describe the results as valid. However, the results are valid only for a group (i.e., they tend to cluster around the true value). It is important to remember that what may be valid for a group or a population may not be so for an individual in a clinical setting. When the reliability or repeatability of a test is poor, the validity of the test for a given individual also may be poor. The distinction between group validity and individual validity is therefore important to keep in mind when assessing the quality of diagnostic and screening tests.

Conclusion

This chapter has discussed the validity of diagnostic and screening tests as measured by their sensitivity and specificity, their predictive value, and the reliability or repeatability of these tests. Clearly, regardless of how sensitive and specific a test may be, if its results cannot be replicated, the test is of little use. All these characteristics must therefore be borne in mind when evaluating such tests, together with the purpose for which the test will be used.

References

1 Prenatal diagnosis. In: Cunningham F, Leveno KJ, Bloom SL, et al., eds. Williams Obstetrics. 24th ed . New York: McGraw-Hill; 2013. http://accessmedicine.mhmedical.com.ezp.welch.jhmi.edu/content.aspx?bookid=1057&sectionid=59789152 .

2 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37.

3 Alshafeiy TI, Wadih A, Nicholson BT, et al. Comparison between digital and synthetic 2D mammograms in breast density interpretation. AJR Am J Roentgenol. 2017;209:W36–W41.

4 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174.

5 Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed New York: John Wiley & Sons; 1981.

6 MacLure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol. 1987;126:161–169.

7 Zenilman JM, Yeunger J, Galai N, et al. Polymerase chain reaction detection of Y chromosome sequences in vaginal fluid: preliminary studies of a potential biomarker. Sex Transm Dis. 2005;32:90–94.

8 Ghanem KG, Melendez JH, McNeil-Solis C, et al. Condom use and vaginal Y-chromosome detection: the specificity of a potential biomarker. Sex Transm Dis. 2007;34:620.

Appendices to Chapter 5

The text of Chapter 5 focuses on the logic behind the calculation of sensitivity, specificity, and predictive value. Appendix 1 summarizes measures of validity for screening tests to detect the absence or presence of a given disease, the pages in the text where the measures are first introduced, and the interpretation of each measure. For those who prefer to see the formulae for each measure, they are provided in the right-hand column of this table; however, they are not essential for understanding the logic behind the calculation of each measure.

Appendix 1 to Chapter 5: Measures of Test Validity and Their Interpretation

blank cell Measure of Test Validity Page Numbers Interpretation Formula

INDIVIDUAL screening tests Sensitivity 95 The proportion of those with the disease who test positive

Specificity 95 The proportion of those without the disease who test negative

Positive predictive value 106–107 The proportion of those who test positive who do have the disease

Negative predictive value 107 The proportion of those who test negative who do NOT have the disease

SEQUENTIAL screening tests Net sensitivity 99–101 The proportion of those with the disease who test positive on BOTH Test 1 and Test 2

Net specificity 99–101 The proportion of those without the disease who test negative on EITHER Test 1 or Test 2

SIMULTANEOUS screening tests Net sensitivity 101–104 The proportion of those with the disease who test positive on EITHER Test 1 or Test 2

Net specificity 105–106 The proportion of those without the disease who test negative on BOTH Test 1 and Test 2

FN, False negatives; FP, false positives; TN, true negatives; TP, true positives.

Appendix 2 summarizes the three steps required to calculate kappa statistic.

Appendix 2 to Chapter 5: The Three Steps Required for Calculating Kappa Statistic (κ)

Components Steps

NUMERATOR STEP 1:

How much better is the observed agreement than the agreement expected by chance alone?

DENOMINATOR STEP 2:

What is the maximum the observers could have improved upon the agreement expected by chance alone?

STEP 3:

Of the maximum improvement in agreement expected beyond chance alone that could have occurred, what proportion has in fact occurred?

A full discussion of kappa and a sample calculation starts on page 113.

Review Questions for Chapter 5

Questions 1, 2, and 3 are based on the information given below:

A physical examination was used to screen for breast cancer in 2,500 women with biopsy-proven adenocarcinoma of the breast and in 5,000 age- and race-matched control women. The results of the physical examination were positive (i.e., a mass was palpated) in 1,800 cases and in 800 control women, all of whom showed no evidence of cancer at biopsy.

1 The sensitivity of the physical examination was: ______

2 The specificity of the physical examination was: ______

3 The positive predictive value of the physical examination was: ______

Question 4 is based on the following information:

A screening test is used in the same way in two similar populations, but the proportion of false-positive results among those who test positive in population A is lower than that among those who test positive in population B.

4 What is the likely explanation for this finding?

It is impossible to determine what caused the difference
The specificity of the test is lower in population A
The prevalence of disease is lower in population A
The prevalence of disease is higher in population A
The specificity of the test is higher in population A

Question 5 is based on the following information:

A physical examination and an audiometric test were given to 500 persons with suspected hearing problems, of whom 300 were actually found to have them. The results of the examinations were as follows: NUR 811 Population-Based Nursing Assignment

Result hearing problems

Present Absent

Physical Examination

Positive 240 40

Negative 60 160

Audiometric Test

Positive 270 60

Negative 30 140

5 Compared with the physical examination, the audiometric test is:

Equally sensitive and specific
Less sensitive and less specific
Less sensitive and more specific
More sensitive and less specific
More sensitive and more specific

Question 6 is based on the following information:

Two pediatricians want to investigate a new laboratory test that identifies streptococcal infections. Dr. Kidd uses the standard culture test, which has a sensitivity of 90% and a specificity of 96%. Dr. Childs uses the new test, which is 96% sensitive and 96% specific.

6 If 200 patients undergo culture with both tests, which of the following is correct?

Dr. Kidd will correctly identify more people with streptococcal infection than Dr. Childs
Dr. Kidd will correctly identify fewer people with streptococcal infection than Dr. Childs
Dr. Kidd will correctly identify more people without streptococcal infection than Dr. Childs
The prevalence of streptococcal infection is needed to determine which pediatrician will correctly identify the larger number of people with the disease

Questions 7 and 8 are based on the following information:

A colon cancer screening study is being conducted in Nottingham, England. Individuals 50 to 75 years old will be screened with the Hemoccult test. In this test, a stool sample is tested for the presence of blood.

7 The Hemoccult test has a sensitivity of 70% and a specificity of 75%. If Nottingham has a prevalence of 12/1,000 for colon cancer, what is the positive predictive value of the test? _____

8 If the Hemoccult test result is negative, no further testing is done. If the Hemoccult test result is positive, the individual will have a second stool sample tested with the Hemoccult II test. If this second sample also tests positive for blood, the individual will be referred for more extensive evaluation. What is the effect on net sensitivity and net specificity of this method of screening?

Net sensitivity and net specificity are both increased
Net sensitivity is decreased and net specificity is increased
Net sensitivity remains the same and net specificity is increased
Net sensitivity is increased and net specificity is decreased
The effect on net sensitivity and net specificity cannot be determined from the data

Questions 9 through 12 are based on the information given below:

Two physicians were asked to classify 100 chest x-rays as abnormal or normal independently. The comparison of their classification is shown in the following table:

Classification of Chest X-Rays by Physician 1 Compared With Physician 2

Physician 1 Physician 2

Abnormal Normal Total

Abnormal 40 20 60

Normal 10 30 40

Total 50 50 100

9 The simple percent agreement between the two physicians out of the total is: ______

10 The percent agreement between the two physicians, excluding the x-rays that both physicians classified as normal, is: ______

CHAPTER 6

The Natural History of Disease

Ways of Expressing Prognosis

Keywords

case fatality; person-years; life table; survival analysis; Kaplan-Meier method

LEARNING OBJECTIVES

To compare five different ways of describing the natural history of disease: case-fatality, 5-year survival, observed survival, median survival time, and relative survival.
To describe two approaches for calculating observed survival over time: the life table approach and the Kaplan-Meier method.
To illustrate the use of life tables for examining changes in survival.
To describe how improvements in available diagnostic methods may affect the estimation of prognosis (stage migration).

At this point, we have learned how diagnostic and screening tests permit the categorization of sick and healthy individuals. Once a person is identified as having a disease, the question arises: How can we characterize the natural history of the disease in quantitative terms? Such quantification is important for several reasons. First, it is necessary to describe the severity of a disease to establish priorities for clinical services and public health programs. Second, patients often ask questions about prognosis (Fig. 6.1). Third, such quantification is important to establish a baseline for natural history, so that as new treatments become available, the effects of these treatments can be compared with the expected outcome without them. This is also important to identify different treatments or management strategies for different stages of the disease. Furthermore, if different types of therapy are available for a given disease, such as surgical or medical treatments or two different types of surgical procedures, we want to be able to compare the effectiveness of the various types of therapy. Therefore, to allow such a comparison, we need a quantitative means of expressing the prognosis in groups receiving the different treatments.

This chapter describes some of the ways in which prognosis can be described in quantitative terms for a group of patients. Thus the natural history of disease (and hence its prognosis) is discussed in this chapter; later chapters discuss the issue of how to intervene in the natural history of disease to improve prognosis: Chapters 10 and 11 discuss how randomized trials are used to select the most appropriate intervention (medical, surgical, or lifestyle), and Chapter 18 discusses how, through screening, disease can be detected at an earlier point than usual in its natural history to maximize the effectiveness of treatment. To discuss prognosis, let’s begin with a schematic representation of the natural history of disease in a patient, as shown in Fig. 6.2.

Point A marks the biologic onset of disease. Often, this point cannot be identified because it occurs subclinically, perhaps as a subcellular change, such as an alteration in DNA. At some point in the progression of the disease process (point P), pathologic evidence of disease could be obtained if it were sought by population screening or by a physician, probably during a routine screening; this evidence can also be an incidental finding while managing another disease or complaint in the same patient. Subsequently, signs and symptoms of the disease develop in the patient (point S), and at some time after that, the patient may seek medical care (point M). The patient may then receive a diagnosis (point D), after which treatment may be given (point T). The subsequent course of the disease might result in cure or remission, control of the disease (with or without disability), or even death.

At what point do we begin to quantify survival time? Ideally we might prefer to do so from the onset of disease. However, this is not generally possible, because the time of biologic onset in an individual is most often not known. If we were to count from the time at which symptoms begin, we would introduce considerable subjective variability in measuring length of survival because we inadvertently ignored the time between the biologic onset of disease to the first symptoms and signs, which could range from hours or days (for an acute infection) to month or years (e.g., as in prostate cancer). In general, in order to standardize the calculations, duration of survival is counted from the time of diagnosis. However, even with the use of this starting point, variability still occurs, because patients differ in the point at which they seek medical care. In addition, some diseases, such as certain types of arthritis, are indolent (pain-free) and develop slowly, so that patients may not be able to accurately pinpoint the onset of symptoms or recall the point in time at which they sought medical care. Furthermore, when survival is counted from the time of diagnosis, any patients who may have died before a diagnosis was made are excluded from the count. What effect would this have on our estimates of prognosis?

An important related question is, “How is the diagnosis made?” Is there a clear pathognomonic test for the disease in question? Such a test is often not available. Sometimes a disease may be diagnosed by the isolation of an infectious agent, but because people can be carriers of organisms without actually being infected, we do not always know that the isolated organism is the cause of disease. For some diseases, we might prefer to make a diagnosis by tissue confirmation taken by biopsy, but there is often variability in the interpretation of tissue slides by different pathologists. An additional issue is that with certain health problems, such as headaches, lower back pain, and dysmenorrhea, a specific tissue diagnosis is not possible. Consequently, when we say that survivorship is measured from the time of diagnosis, the time frame is not always clear. These issues should be kept in mind as we proceed to discuss different approaches to estimating prognosis.

Prognosis can be expressed either in terms of deaths from the disease or in terms of survivors with the disease. Although both approaches are used in the following discussion, the final end point used for the purposes of our discussion in this example is death. Because death is inevitable, we are not talking about dying versus not dying, but rather about extending the interval until death occurs following diagnosis. Other end points might be used, including the interval from diagnosis to recurrence of disease or from diagnosis to the time of functional impairment, disease-specific complication, disability, or changes in the patient’s quality of life, all of which may be affected by the invasiveness of the available treatment, when the treatment was initiated, or the extent to which some of the symptoms can be relieved—even if the patient’s life span cannot be extended. These are all important measures, but they are not discussed in this chapter.

Case-Fatality

The first way to express prognosis is case-fatality, which was discussed in Chapter 4. Case-fatality is defined as the number of people who die of a disease divided by the number of people who have the disease. Given that a person has the disease, what is the likelihood that he or she will die of the disease? Note that the denominator for case-fatality is the number of people who have the disease, which makes it a proportion, while sometimes it is incorrectly referred to as a rate. This differs from a mortality rate, in which the denominator includes anyone at risk of dying of the disease—both persons who have the disease and persons who do not (yet) have the disease, but in whom it could develop.

Case-fatality does not include any explicit statement of time. However, time is expressed implicitly, because case-fatality is generally used for acute diseases in which death, if it occurs, occurs relatively soon after diagnosis. Thus if the usual natural history of the disease is known, the term case-fatality refers to the period after diagnosis during which death might be expected to occur.

Case-fatality is ideally suited to diseases that are short-term, acute conditions. In chronic diseases, in which death may occur many years after diagnosis and the possibility of death from other causes becomes more likely, case-fatality becomes a less useful measure. For example, in the study of prostate cancer, most men with this diagnosis die from some other cause, due to the very slow progression of this cancer. We therefore use different approaches for expressing prognosis in such diseases.

Person-Years

A useful way of expressing mortality is in terms of the number of deaths divided by the person-years over which a group is observed. Because individuals are often observed for different periods of time, the unit used for counting observation time is the person-year. (Person-years were previously discussed in Chapter 3, pages 47–50.) The number of person-years for two people, each of whom is observed for 5 years, is equal to that of 10 people, each of whom is observed for 1 year—that is, 10 person-years. The numbers of person-years can then be added together and the number of events such as deaths can be calculated per the number of person-years observed

One problem in using person-years is that each person-year is assumed to be equivalent to every other person-year (i.e., the risk is the same in any person-year observed). However, this may not be true. Consider the situation in Fig. 6.3 showing two examples of 10 person-years: two people each observed for 5 years and five people each observed for 2 years. Are they equivalent? NUR 811 Population-Based Nursing Assignment

Suppose the situation is that shown in Fig. 6.4, in which the period of greatest risk of dying is from shortly after diagnosis until about 20 months after diagnosis. Clearly most of the person-years in the first example (i.e., two persons observed for 5 years) will be outside the period of greatest risk (Fig. 6.5), the times from 20 months to 60 months. In contrast, most of the 2-year intervals of the five persons shown in the second example will occur during the period of highest risk (Fig. 6.6). Consequently, when we compare the two examples (Fig. 6.7), more deaths would be expected in the example of five persons observed for 2 years than in the example of two persons observed for 5 years. Despite this issue, person-years are useful as denominators of rates of events in many situations, such as randomized trials (see Chapters 10 and 11) and cohort studies (see Chapter 8). Note that, as discussed in other textbooks, 1 a rate per person-years is equivalent to an average yearly rate. Thus a rate per person-years can be compared to a Vital Statistics yearly rate based on the period’s midpoint population estimate. This is useful when it is of interest to compare rates of per person-years in a study with population rates.

Five-Year Survival

Another measure used to express prognosis is 5-year survival. This term is frequently used in clinical medicine, particularly in evaluating treatments for cancer.

The 5-year survival is the percentage of patients who are alive 5 years after treatment begins or 5 years after diagnosis. (Although 5-year survival is often referred to as a rate, it is actually a proportion.) Despite the widespread use of the 5-year interval, it should be pointed out that there is nothing magical about 5 years. Certainly no significant biologic change occurs abruptly at 5 years in the natural history of a disease that would justify its use as an end point. However, most deaths from cancer typically occurred during this period after diagnosis when it was first used in the 1950s, so 5-year survival has been used as an index of success in cancer treatment since. Some literature on chronic diseases, such as cardiovascular diseases, use 10-year survival instead of 5-year survival.

One problem with the use of 5-year survival has become more prominent in recent years with the advent of better screening programs. Let’s examine a hypothetical example: Fig. 6.8 shows a timeline for a woman who had biologic onset of breast cancer in 2005. Because the disease was subclinical at that time, she had no symptoms. In 2013, she felt a lump in her breast, which precipitated a visit to her physician, who made the diagnosis. The patient then underwent a mastectomy. In 2015, she died of metastatic cancer. As measured by 5-year survival, which is often used in oncology as a measure of whether therapy has been successful, this patient is not a “success,” because she survived for only 2 years after diagnosis.

Let’s now imagine that this woman lived in a community in which there was an aggressive breast cancer mammography screening campaign (lower timeline in Fig. 6.9). As before, biologic onset of disease occurred in 2005, but in 2010, she was identified through screening as having a very small mass in her breast. She had surgery in 2010 but died in 2015. Because she survived for 5 years after diagnosis and therapy, she would now be identified as a therapeutic “success” in terms of 5-year survival. However, this apparently longer survival is an artifact. Death still occurred in 2015; the patient’s life was not lengthened by early detection and therapy. What has happened is that the interval between her diagnosis (and treatment) and her death was increased through earlier diagnosis, but there was no delay in the time of death. (The interval between the earlier diagnosis in 2010, made possible by the screening test, and the later usual time of diagnosis in 2013 is called the lead time. This concept is discussed in detail in Chapter 18 in the context of evaluating screening programs.) It is misleading to conclude that, given the patient’s 5-year survival, the outcome of the second scenario is any better than that of the first, because no change in the natural history of the disease has occurred, as reflected by the year of death. Indeed, the only change that has taken place is that when the diagnosis was made 3 years earlier (2010 vs. 2013), the patient received medical care for breast cancer, with all its attendant difficulties, for an additional 3 years. Thus, when screening is performed, a higher 5-year survival may be observed, not because people live longer, but only because an earlier diagnosis has been made. This type of potential bias (known as lead time bias) must be taken into account in evaluating any screening program before it can be concluded that the screening is beneficial in extending survival.

Another problem with 5-year survival is that if we want to look at the survival experience of a group of patients who were diagnosed less than 5 years ago, we clearly cannot use this criterion, because 5 years of observation are necessary in these patients to calculate 5-year survival. Therefore if we want to assess a therapy that was introduced less than 5 years ago, 5-year survival is not an appropriate measure.

A final issue relating to 5-year survival is shown in Fig. 6.10. Here we see survival curves for two populations, A and B. Five-year survival is about 10%. However, the curves leading to the same 5-year survival are quite different. Although survival at 5 years is the same in both groups, most of the deaths in group A did not occur until the fifth year, whereas most of the deaths in group B occurred in the first year since they generally had a shorter time to event (death) compared with group A. Thus despite the identical 5-year survivals, survival during the 5 years is clearly better for those in group A.

Observed Survival

Rationale for the Life Table

Another approach to quantifying prognosis is to use the actual observed survival of patients followed over time, based on knowing the interval within which the event has occurred. For this purpose, we use a life table. Life tables have been used by actuaries to estimate risk in populations for centuries when there were no data on individuals. Actuarial methods and models have been applied in a large number of situations, including property/casualty, life insurance, pensions and health insurance, among others. Actuaries are credentialed, with a foundation of statistics and probability, stochastic processes, and actuarial methods and models.

Let’s examine the conceptual framework underlying the calculation of survival rates using a life table, especially when the exact event time is not known, but rather we use the interval within which the event took place.

Table 6.1 shows a hypothetical study of treatment results in patients who were treated from 2010 to 2014 and followed to 2015. (By just glancing at this table, you can tell that the example is hypothetical, because the title indicates that no patients were lost to follow-up!)

TABLE 6.1

Hypothetical Study of Treatment Results in Patients Treated From 2010–2014 and Followed to 2015 (None Lost to Follow-Up)

Year of Treatment No. of Patients Treated number alive on anniversary of treatment

2001 2002 2003 2004 2005

2010 84 44 21 13 10 8

2011 62 31 14 10 6

2012 93 50 20 13

2013 60 29 16

2014 76 43

For each calendar year of treatment, the table shows the number of patients enrolled in treatment and the number of patients alive at each calendar year after the initiation of that treatment. For example, of 84 patients enrolled in treatment in 2010, 44 were alive in 2011, a year after beginning treatment; 21 were alive in 2012; and so on.

The results in Table 6.1 include all the data available for assessing the treatment. If we want to describe the prognosis in these treated patients using all of the data in the table, obviously we cannot use 5-year survival, because the entire group of 375 patients has not been observed for 5 years. We could calculate 5-year survival using only the first 84 patients who were enrolled in 2010 and observed until 2015, because they were the only ones observed for 5 years. However, this would require us to discard the rest of the data, which would be unfortunate, given the effort and expense involved in obtaining the data, and also given the additional light that the survival experience of those patients would cast on the effectiveness of the treatment. The question is: How can we use all of the information in Table 6.1 to describe the survival experience of the patients in this study?

To use all of the data, we rearrange the data from Table 6.1 as shown in Table 6.2. In this table, the data show the number of patients who started treatment each calendar year and the number of those who remained alive on each anniversary of the initiation of treatment. The patients who started treatment in 2014 were observed for only 1 year, because the study ended in 2015.

TABLE 6.2

Rearrangement of Data in Table 6.1, Showing Survival Tabulated by Years Since Enrollment in Treatment (None Lost to Follow-Up)

Year of Treatment No. of Patients Treated number alive at end of year

1st Year 2nd Year 3rd Year 4th Year 5th Year

2010 84 44 21 13 10 8

2011 62 31 14 10 6

2012 93 50 20 13

2013 60 29 16

2014 76 43

With the data in this format, how do we use the table? First we ask, “What is the probability of surviving for 1 year after the beginning of treatment?” To answer this, we divide the total number of patients who were alive 1 year after the initiation of treatment (197) by the total number of patients who started treatment (375; Table 6.3).

TABLE 6.3

Analysis of Survival in Patients Treated From 2010–2014 and Followed to 2015 (None Lost to Follow-Up): I

Year of Treatment No. of Patients Treated number alive at end of year

1st Year 2nd Year 3rd Year 4th Year 5th Year

2010 84 44 21 13 10 8

2011 62 31 14 10 6

2012 93 50 20 13

2013 60 29 16

2014 76 43

Totals 375 197

P 1 = Probability of surviving the 1st year =197/375 = 0.525

The probability of surviving the first year (P 1) is:197/375

Next, we ask, “What is the probability that, having survived the first year after beginning treatment, the patients will survive the second year?” We see in Table 6.4 that 197 people survived the first year, but for 43 of them (the ones who were enrolled in 2014), we have no further information because they were observed for only 1 year. Because 71 survived the second year, we calculate the probability of surviving the second year, if the patient survived the first year (P 2), as

P2=71/197-43=0.461

In the denominator we subtract the 43 patients for whom we have no data for the second year.

Following this pattern, we ask, “Given that a person has survived to the end of the second year, what is the probability, on average, that he or she will survive to the end of the third year?”

In Table 6.5, we see that 36 survived the third year. Although 71 had survived the second year, we have no further information on survival for 16 of them because they were enrolled late in the study. Therefore we subtract 16 from 71 and calculate the probability of surviving the third year, given survival to the end of the second year (P3), as:

P3=36/71-16=0.655

We then ask, “If a person survives to the end of the third year, what is the probability that he or she will survive to the end of the fourth year?”

As seen in Table 6.6, a total of 36 people survived the third year, but we have no further information for 13 of them. Because 16 survived the fourth year, the probability of surviving the fourth year, if the person has survived the third year (P4), is:

P4=16/36-13=0.696

Finally, we do the same calculation for the fifth year (Table 6.7). We see that 16 people survived the fourth year, but that no further information is available for 6 of them.

Because 8 people were alive at the end of the fifth year, the probability of surviving the fifth year, if the person has survived the fourth year (P 5), is:

P5=8/16-6=0.800

Using all of the data that we have calculated, we ask, “What is the probability of surviving for all 5 years?” Box 6.1 shows all of the probabilities of surviving for each individual year that we have calculated. NUR 811 Population-Based Nursing Assignment

Now we can answer the question, “If a person is enrolled in the study, what is the probability that he or she will survive 5 years after beginning treatment?” The probability of surviving for 5 years is the product of each of the probabilities of surviving each year, shown in Box 6.1. So the probability of surviving for 5 years is:

The probabilities for surviving different lengths of time are shown in Box 6.2. These calculations can be presented graphically in a survival curve, as seen in Fig. 6.11. Note that these calculations use all of the data we have obtained, including the data for patients who were not observed for the full 5 years of the study. As a result, the use of data is economical and efficient.

Calculating a Life Table

Let’s now view the data from this example in the standard tabular form in which they are usually presented for calculating a life table. In the example just discussed, the persons for whom data were not available for the full 5 years of the study were those who were enrolled sometime after the study had started, so they were not observed for the full 5-year period. In virtually every survival study, however, subjects are also lost to follow-up. Either they cannot be found or they decline to continue participating in the study. In calculating the life table, persons for whom data are not available for the full period of follow-up—either because follow-up was not possible or because they were enrolled after the study was started—are called withdrawals (or losses to follow-up or censored observations).

Table 6.8 shows the data from this example with information provided about the number of deaths and the number of withdrawals in each interval. The columns are numbered merely for reference (i.e., there is no meaning inherent to the numbering). The row directly under the column labels gives the terms that are often used in life table calculations. The next five rows of the table give data for the 5 years of the study.

TABLE 6.8

Rearrangement of Data in Standard Format for Life Table Calculations

(1) Interval Since Beginning Treatment (2) Alive at Beginning of Interval (3) Died During Interval (4) Withdrew During Interval

x lx dx wx

1st year 375 178 0

2nd year 197 83 43

3rd year 71 19 16

4th year 36 7 13

5th year 16 2 6

The columns are as follows:

Column (1): The interval since beginning treatment.

Column (2): The number of study subjects who were alive at the beginning of each interval.

Column (3): The number of study subjects who died during that interval.

Column (4): The number who “withdrew” during the interval—that is, the number of study subjects who could not be followed for the full study period, either because they were lost to follow-up or because they were enrolled after the study had started.

Table 6.9 adds four additional columns to Table 6.8. These columns show the calculations. The new columns are as follows:

Column (5): The number of people who are effectively at risk of dying during the interval. Losses to follow-up (withdrawals) during each time interval are assumed to have occurred uniformly during the entire interval. (This assumption is most likely to hold when the interval is short.) We therefore assume that, on average, they were at risk for half the interval. Consequently, to calculate the number of people at risk during each interval, we subtract half the withdrawals during that interval as indicated in the heading for column 5.

Column (6): The proportion who died during the interval is calculated by dividing:

Column (7): The proportion who did not die during the interval—that is, the proportion of those who were alive at the beginning of the interval and who survived that entire interval = 1.0 − proportion who died during the interval (column 6).

Column (8): The proportion who survived from the point at which they were enrolled in the study to the end of this interval (cumulative survival). This is obtained by multiplying the proportion who were alive at the beginning of this interval and who survived this interval by the proportion who had survived from enrollment through the end of the previous interval. Thus each of the figures in column 8 gives the proportion of people enrolled in the study who survived to the end of each interval. This will be demonstrated by calculating the first two rows of

TABLE 6.9

Calculating a Life Table

(1) Interval Since Beginning Treatment (2) Alive at Beginning of Interval (3) Died During Interval (4) Withdrew During Interval (5) Effective No. Exposed to Risk of Dying During Interval: Col (2) − [Col (4)] (6) Proportion Who Died During Interval: (7) Proportion Who Did Not Die During Interval: 1 − Col (6) (8) Cumulative Proportion Who Survived From Enrollment to End of Interval: Cumulative Survival

x lx dx wx l′x qx px Px

1st year 375 178 0 375.0 0.475 0.525 0.525

2nd year 197 83 43 175.5 0.473 0.527 0.277

3rd year 71 19 16 63.0 0.302 0.698 0.193

4th year 36 7 13 29.5 0.237 0.763 0.147

5th year 16 2 6 13.0 0.154 0.846 0.124

Let’s look at the data for the first year. (In these calculations, we will round the results at each step and use the rounded figures in the next calculation. In reality, however, when life tables are calculated, the unrounded figures are used for calculating each subsequent interval, and at the end of all the calculations, all the figures are rounded for purposes of presenting the results.) There were 375 subjects enrolled in the study who were alive at the beginning of the first year after enrollment (column 2). Of these, 178 died during the first year (column 3). All subjects were followed for the first year, so there were no withdrawals (column 4). Consequently 375 people were effectively at risk for dying during this interval (column 5). The proportion who died during this interval was 0.475: 178 (the number who died [column 3]) divided by 375 (the number who were at risk for dying [column 5]). The proportion who did not die during the interval is 1.0 − [the proportion who died (1.0 − 0.475)] = 0.525 (column 7). For the first year after enrollment, this is also the proportion who survived from enrollment to the end of the interval (column 8).

Now let’s look at the data for the second year. These calculations are important to understand because they serve as the model for calculating each successive year in the life table.

To calculate the number of subjects alive at the start of the second year, we start with the number alive at the beginning of the first year and subtract from that number the number of deaths and withdrawals during that year. At the start of the second year, therefore, 197 subjects were alive at the beginning of the interval (column 2 [375 − 178 − 0]). Of these, 83 died during the second year (column 3). There were 43 withdrawals who had been observed for only 1 year (column 4). As discussed earlier, we subtract half of the withdrawals, 21.5 (43/2), from the 197 who were alive at the start of the interval, yielding 175.5 people who were effectively at risk for dying during this interval (column 5). The proportion who died during this interval (column 6) was 0.473—that is, 83 (the number who died [column 3]) divided by 175.5 (the number who were at risk for dying [column 5]). The proportion who did not die during the interval is 1.0 − the proportion who died (1.0 − 0.473) = 0.527 (column 7). The proportion of subjects who survived from the start of treatment to the end of the second year is the product of 0.525 (the proportion who had survived from the start of treatment to the end of the first year—that is, the beginning of the second year) multiplied by 0.527 (the proportion of people who were alive at the beginning of the second year and survived to the end of the second year) = 0.277 (column 8). Thus 27.7% of the subjects survived from the beginning of treatment to the end of the second year. Looking at the last entry in column 8, we see that 12.4% of all individuals enrolled in the study survived to the end of the fifth year.

Work through the remaining years in Table 6.9 to be sure you understand the concepts and calculations involved

The Kaplan-Meier Method

In contrast to the life tables approach just demonstrated, in the Kaplan-Meier method, 2 predetermined intervals (such as 1 month or 1 year) are not used. Rather, we identify the exact point in time when each death, or the event of interest, occurred so that each death, or event, terminates the previous interval and a new interval (and a new row in the Kaplan-Meier table) is started. The number of persons who died at that point is used as the numerator, and the number alive up to that point (including those who died at that time point) is used as the denominator, after any withdrawals that occurred before that point are subtracted.

Let’s look at the small hypothetical study shown in Fig. 6.12. Six patients were studied, of whom four died and two were lost to follow-up (“withdrawals”). The deaths occurred at 4, 10, 14, and 24 months after enrollment in the study. The data are set up as shown in Table 6.10:

Column (1): The times for each death from the time of enrollment (time that treatment was initiated).

Column (2): The number of patients who were alive and followed at the time of that death, including those who died at that time.

Column (3): The number who died at that time.

Column (4): The proportion of those who were alive and followed (column 2) who died at that time (column 3) (column 3/column 2).

Column (5): The proportion of those who were alive and survived (1.0 − column 4).

Column (6): Cumulative survival (the proportion of those who were initially enrolled and survived to that point).

TABLE 6.10

Calculating Survival Using the Kaplan-Meier Method a

(1) Times to Deaths From Starting Treatment (Months) (2) No. Alive at Each Time (3) No. Who Died at Each Time (4) Proportion Who Died at That Time: (5) Proportion Who Survived at That Time: 1 − Col (4) (6) Cumulative Proportion Who Survived to That Time: Cumulative Survival

4 6 1 0.167 0.833 0.833

10 4 1 0.250 0.750 0.625

14 3 1 0.333 0.667 0.417

24 1 1 1.000 0.000 0.000

a See text and Fig. 6.12 regarding withdrawals.

Let’s consider the first row of the table. The first death occurred at 4 months, at which time six patients were alive and followed (see Fig. 6.12). One death occurred at this point (column 3), for a proportion of 1/6 = 0.167 (column 4). The proportion who survived at that time is 1.0 − column 4, or 1.0 − 0.167 = 0.833 (column 5), which is also the cumulative survival at this point (column 6).

The next death occurred 10 months after the initial enrollment of the six patients in the study, and data for this time are seen in the next row of the table. Although only one death had occurred before this one, the number alive and followed is only four because there had also been a withdrawal before this point (not shown in the table, but seen in Fig. 6.12). Thus there was one death (column 3), and as seen in Table 6.10, the proportion who died is 1/4, or 0.250 (column 4). The proportion who survived is 1.0 − column 4, or 1.0 − 0.250 = 0.750 (column 5). Finally, the cumulative proportion surviving (column 6) is the product of the proportion who survived to the end of the previous interval (until just before the previous death) seen in column 6 of the first row (0.833) and the proportion who survived from that time until just before the second death (second row in column 5, 0.750). The product = 0.625—that is, 62.5% of the original enrollees survived to this point. Review the next two rows of the table to be sure that you understand the concepts and calculations involved. NUR 811 Population-Based Nursing Assignment

The values calculated in column 6 are plotted as seen in Fig. 6.13. Note that the data are plotted in a stepwise fashion rather than in a smoothed slope because, after the drop in survival resulting from each death, survival then remains unchanged until the next death occurs.

When information on the exact time of death is available, the Kaplan-Meier method clearly makes fullest use of this information because the data are used to define the intervals, instead of predetermined arbitrary intervals used in the life tables method. The use of modern technology to communicate with patients, conducted simultaneously in different study sites, and electronically linking mortality data to research databases allow researchers to identify the examination of time of event. In addition, computer programs are readily available that make the Kaplan-Meier method easily calculated for large data sets as well. The majority of longitudinal studies in the published literature now report data on survival using the Kaplan-Meier method. For example, in 2000, Rosenhek and colleagues reported a study of patients with asymptomatic, but severe, aortic stenosis. 3 An unresolved issue was whether patients with asymptomatic disease should have their aortic valves replaced. The investigators examined the natural history of this condition to assess the overall survival of these patients and to identify predictors of outcome. Gibson and colleagues 4 studied the association between coronary artery calcium (CAC) and incident cerebrovascular events (CVE) in 6,779 participants of the Multi-Ethnic Study of Atherosclerosis (MESA) and then followed for an average of 9.5 years. Fig. 6.14A shows their Kaplan-Meier analysis of CVE-free survival by the presence or absence of CAC at baseline. Participants with CAC present during the baseline examination had a lower CVE-free survival rate as compared with participants with CAC absent at the baseline visit. In Fig. 6.14B, the authors divided the participants into four groups according to their CAC at the baseline visit (CAC: 0, 0 to 100, >100 to 400, and >400 Agatston units), and we can clearly see a separate curve for each group showing a significant graded CVE-free survival.

Assumptions Made in Using Life Tables and Kaplan-Meier Method

Two important assumptions are made in using life tables and Kaplan-Meier methods. The first is that there has been no secular (temporal) change in the effectiveness of treatment or in survivorship over calendar time. That is, we assume that over the period of the study, there has been no improvement in treatment and that survivorship in one calendar year of the study is the same as in another calendar year of the study. Clearly, if a study is conducted over many years, this assumption may not be valid because, fortunately, therapies improve over time. If we are concerned that the effectiveness of therapy may have changed over the course of the study, we could examine the early data separately from the later data. If they seem to differ, the early and later periods could be analyzed separately and the effects compared.

The second assumption relates to follow-up of persons enrolled in the study. In virtually every real-life study, participants are lost to follow-up. People can be lost to follow-up for many reasons. Some may die and may not be traced. Some may move or seek care elsewhere. Some may be lost because their disease disappears and they feel well. In most studies, we do not know the actual reasons for losses to follow-up. How can we deal with the problem of people lost to follow-up for whom we therefore have no further information on survival? Because we have baseline data on these people, we could compare the characteristics of the persons lost to follow-up with those of persons who remained in the study. If a large proportion of the study population is lost to follow-up, the findings of the study will be less valid. The challenge is to minimize losses to follow-up. In any case, the second assumption made in life table analysis is that the survival experience of people who are lost to follow-up is the same as the experience of those who are followed up. Although this assumption is made for purposes of calculation, in actual fact its validity may often be questionable. For mortality, however, the assumption can be verified by means of linkage with the United States National Death Index, which allows comparing the mortality of those lost to follow up with those who continue to be followed up.

A third assumption is specific to traditional life tables, but not the Kaplan-Meier method, and deals with the use of predetermined intervals when calculating the life tables. The prime reason to use the life table method over the Kaplan-Meier method is that if we are not able to identify the exact time of event, we must use an arbitrary interval within which the event took place. Subsequently we are not able to identify the exact time of withdrawals from the study. Thus it is important to assume that there is a uniform distribution of risk and withdrawal during each time interval, and that there is no rapid change in the risk or withdrawal within a time interval. A reasonable way to achieve this assumption is to make the interval as short as possible

Example of Use of a Life Table

Life tables are used in virtually every clinical area. However, they are less commonly used nowadays and have been replaced with the Kaplan-Meier method, in which the investigators are able to identify the exact time of event for each study participant. Life tables were the standard means by which survival is expressed and compared for a long time, before the establishment of the Kaplan-Meier method. Let’s examine a few examples. One of the great triumphs of pediatrics in recent decades has been the treatment of leukemia in children. However, the improvement has been much greater for whites than for blacks, and the reasons for this difference are not clear. At a time when survival rates from childhood acute leukemia were increasing rapidly, a study was conducted to explore the racial differences in survivorship. Figs. 6.15 to 6.17 show data from this study. 5 The curves are based on life tables that were constructed using the approach discussed earlier.

Fig. 6.15 shows survival for white and black children with leukemia in Baltimore over a 16-year period. No black children survived longer than 4 years, but some white children survived as long as 11 years in this 16-year period of observation.

What changes took place in survivorship during the 16 years of the study? Fig. 6.16 and Fig. 6.17 show changes in leukemia mortality over time in whites and blacks, respectively. The 16-year period was divided into three periods: 1960 to 1964 (solid line), 1965 to 1969 (dashed line), and 1970 to 1975 (dotted line).

In whites (see Fig. 6.16), survivorship increased in each successive period. For example, if we examine 3-year survival by looking at the 3-year point on each successive curve, we see that survival improved from 8% to 25% to 58%. In contrast, in blacks (see Fig. 6.17) there was much less improvement in survival over time; the curves for the two later 5-year periods almost overlap.

What accounts for this racial difference? First, we must take account of the small numbers involved and the possibility that the differences could have been due to chance. Let’s assume, however, that the differences are real. During the past several decades, tremendous strides have occurred in the treatment of leukemia through combined therapy, including central nervous system radiation added to chemotherapy. Why, then, does a racial difference exist in survivorship? Why is it that the improvement in therapy that has been so effective in white children has not had a comparable benefit in black children? Further analyses of the interval from the time the mother noticed symptoms to the time of diagnosis and treatment indicated that the differences in survival did not appear to be due to a delay in black parents seeking or obtaining medical care. Because acute leukemia is more severe in blacks and more advanced at the time of diagnosis, the racial difference could reflect biologic differences in the disease, such as a more aggressive and rapidly progressive form of the illness. The definitive explanation is not known. NUR 811 Population-Based Nursing Assignment

Apparent Effects on Prognosis of Improvements in Diagnosis

Diagnosis

We have discussed the assumption made in using a life table that no improvement in the effectiveness of treatment has occurred over calendar time during the period of the study. Another issue in calculating and interpreting survival rates is the possible effect of improvements in diagnostic methods over calendar time.

An interesting example was reported by Feinstein and colleagues. 6 They compared survival in a cohort of patients with lung cancer first treated in 1977 with survival in a cohort of patients with lung cancer treated from 1953 to 1964. Six-month survival was higher in the latter group for both the total group and for subgroups formed on the basis of stage of disease. The authors found that the apparent improvement in survival was due in part to stage migration, a phenomenon shown in Fig. 6.18A to C.

In Fig. 6.18A, patients with cancer are divided into “good” and “bad” stages, on the basis of whether they had detectable metastases in 1980. Some patients who would have been assigned to a “good” stage in 1980 may have had micro-metastases at that time, which would have been unrecognized (see Fig. 6.18B). However, by 2000, as diagnostic technology improved, many of these patients would have been assigned to a “bad” stage, because their micro-metastases would now have been recognized using improved diagnostic technology that had become available (see Fig. 6.18C). If this had occurred, survival by stage would appear to have improved even if treatment had not become any more effective during this time.

Let’s consider a hypothetical example that illustrates this effect of such stage migration. Fig. 6.19A to C show a hypothetical study of cancer case-fatality for 300 patients in two time periods, 1980 and 2000, assuming no improvement in the effectiveness of available therapy between the two periods. We will assume that as shown in Fig. 6.19A, in both time periods, the case-fatality is 10% for patients who have no metastases, 30% for those with micro-metastases, and 80% for those with metastases. Looking at Fig. 6.19B, we see that in 1980, 200 patients were classified as stage I. One hundred of these patients had no metastases, and 100 had unrecognized micro-metastases. Their case-fatalities were thus 10% and 30%, respectively. In 1980, 100 patients had clearly evident metastases and were classified as stage II; their case-fatality was 80%.

As a result of improved diagnostic technology in 2000, micro-metastases were detected in the 100 affected patients, and these patients were classified as stage II (see Fig. 6.19C). Because the prognosis of the patients with micro-metastases is worse than that of the other patients in stage I, and because, in the later study period, patients with micro-metastases are no longer included in the stage I group (because they have migrated to stage II), the case-fatality for stage I patients appears to decline from 20% in the early period to 10% in the later period. However, although the prognosis of the patients who migrated from stage I to stage II was worse than that of the others in stage I, the prognosis for these patients was still better than that of the other patients in stage II, who had larger, more easily diagnosed metastases and a case-fatality of 80%. Consequently, the case-fatality for patients in stage II also appears to have improved, having declined from 80% in the early period to 55% in the later period, even in the absence of any improvement in treatment effectiveness.

The apparent improvements in survival in both stage I and stage II patients result only from the changed classification of patients with micro-metastases in the later period. Looking at the bottom line of the figure, we see that the case-fatality of 40% for all 300 patients has not changed from the early period to the later period. Only the apparent stage-specific case-fatalities have changed. It is therefore important to exclude the possibility of stage migration before attributing any apparent improvement in prognosis to improved effectiveness of medical care.

The authors call stage migration the Will Rogers phenomenon. The reference is to Will Rogers, an American humorist during the time of the economic depression of the 1930s. At that time, because of economic hardship, many residents of Oklahoma left the state and migrated to California. Rogers commented, “When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.”

Median Survival Time

Another approach to expressing prognosis is the median survival time, which is defined as the length of time that half (50%) of the study population survives. Why should we use median survival time rather than mean survival time, which is an average of the survival times? Median survival offers two advantages over mean survival. First, it is less affected by extremes, whereas the mean can be significantly affected by even a single outlier. One or two persons with a very long survival time could significantly affect the mean, even if all of the other survival times were much shorter. Second, if we used mean survival, we would have to observe all of the deaths in the study population before the mean could be calculated. However, to calculate median survival, we would only have to observe the deaths of half of the group under observation.

Relative Survival

Let’s consider 5-year survival for a group of 30-year-old men with colorectal cancer. What would we expect their 5-year survival to be if they did not have colorectal cancer? Clearly, it would be nearly 100%. Thus we are comparing the survival observed in young men with colorectal cancer to a survival of almost 100% that is expected in those without colorectal cancer. What if we consider a group of 80-year-old men with colorectal cancer? We would not expect anything near 100% 5-year survival in a population of this age, even if they do not have colorectal cancer. We would want to compare the observed survival in 80-year-old men with colorectal cancer to the expected survival of 80-year-old men without colorectal cancer. So for any group of people with a disease, we want to compare their survival to the survival we would expect in this age group even if they did not have the disease. This is known as the relative survival.

Relative survival is thus defined as the ratio of the observed survival to the expected survival:

Does relative survival really make any difference? Table 6.11 shows data for patients with cancer of the colon and rectum, both relative survival and observed survival from 1990 to 1998. When we look at the older age groups, which have high rates of mortality from other causes, there is a large difference between the observed and the relative survival. However, in young persons, who generally do not die of other causes, observed and relative survival for cancer of the colon and rectum do not differ significantly.

TABLE 6.11

Five-Year Observed and Relative Survival (%) by Age for Colon and Rectum Cancer, 1990–1998: SEER Program, 1970–2011

Age (year) Observed Survival (%) Relative Survival (%)

<50 64 65

50–64 61.9 65.4

65–74 54.3 62.9

>75 35.5 55.8

Another way to view relative survival is by examining the hypothetical 10-year survival curves of 80-year-old men shown in Fig. 6.20A to D. For reference, Fig. 6.20A shows a perfect survival curve of 100% (the horizontal curve at the top) over the 10 years of the study period. Fig. 6.20B adds a curve of observed survival—that is, the actual survival observed in this group of patients with the disease over the 10-year period. As seen in Fig. 6.20C, the expected survival for this group of 80-year-old men is clearly less than 100% because deaths from other causes are significant in this age group. The relative survival is the ratio of observed survival to expected survival. Since expected survival is less than perfect (100%) survival, and expected survival is the denominator for these calculations, the relative survival will be higher than the observed survival (see Fig. 6.20D).

Generalizability of Survival Data

A final point in connection with the natural history and prognosis of disease is the question of which patients are selected for study. Let’s look at one example.

Febrile seizures are common in infants. Children who are otherwise healthy often experience a seizure in association with high fever. The question arises as to whether these children should be treated with a regimen of phenobarbital or another long-term anticonvulsant medication. That is, is a febrile seizure a warning of subsequent epilepsy, or is it simply a phenomenon associated with fever in infants, in which case children are unlikely to have subsequent nonfebrile seizures?

To make a rational decision about treatment, the question we must ask is, “What is the risk that a child who has had a febrile seizure will have a subsequent nonfebrile seizure?” Fig. 6.21 shows the results of an analysis by Ellenberg and Nelson of published studies. 7

Each dot shows the percentage of children with febrile seizures who later developed nonfebrile seizures in a different study. The authors divided the studies into two groups: population-based studies and studies based in individual clinics, such as epilepsy or pediatric clinics. The results from different clinic-based studies show a considerable range in the risk of later development of nonfebrile seizures. However, the results of population-based studies show little variation in risk, and the results of all of these studies tend to cluster at a low level of risk.

Why should the two types of studies differ? Which results would you believe? Each of the clinics probably had different selection criteria and different referral patterns. Consequently, the different risks observed in the different clinic-based studies are probably the result of the selection of different populations in each of the clinics. In contrast, in the population-based studies (which may in fact be randomly selected), this type of variation due to selection is reduced or eliminated, which accounts for the close clustering of the data, and for the resultant finding that the risk of nonfebrile seizures is very low. The important point is that it may be very tempting to look at patient records in one hospital and generalize the findings to all patients in the general population. However, this is not a legitimate approach because patients who come to a certain clinic or hospital often are not representative of all patients in the community. This does not mean that studies conducted at a single hospital or clinic cannot be of value. Indeed, there is much to be learned from conducting studies at single hospitals. However, these studies are particularly prone to selection bias, and this possibility must always be kept in mind when the findings from such studies and their potential generalizability are being interpreted. NUR 811 Population-Based Nursing Assignment

Conclusion

This chapter has discussed five ways of expressing prognosis (Box 6.3). Which approach is best depends on the type of data that are available, data collection methods, and the purpose of the data analysis.

Box 6.3

Five Approaches to Expressing Prognosis

Case-fatality
5-year survival
Observed survival
Median survival time
Relative survival

Review Questions for Chapter 6

Question 1 is based on the information given in the table below:

Year of Treatment No. of Patients Treated no. of patients alive on each anniversary of beginning treatment

1st 2nd 3rd

2012 75 60 56 48

2014 63 55 31

2015 42 37

Total 180 152 87 48

A total of 180 patients were treated for disease X from 2012 to 2014, and their progress was followed to 2015. The treatment results are given in the table. No patients were lost to follow-up.

1 What is the probability of surviving for 3 years? _______

NUR 811 Population-Based Nursing Assignment

NUR 811 Population-Based Nursing Assignment

NUR 811 Population-Based Nursing Assignment

Module 4—Wk 6-7 2 Week Module–Week of Feb. 22-March 6, 2021 NUR 811

Leave a Reply

Leave a Reply Cancel reply

About Us

Quick links

We Accept

Contact Info