CTE Tech Report No. 18 * EDC Center for Children and Technology

Education Development Center, Inc.
Center for Children and Technology

Exploring the Sampling Laboratory

CTE Technical Report Issue No. 18

December 1991

Prepared by:

Chip Bruce

Scientists once sought a deterministic understanding of phenomena, one which had no place for variability and uncertainty. Today, across fields as diverse as quantum mechanics, genetics, epidemiology, cognitive psychology, education, economics, and astrophysics, scientists not only expect stochastic processes, but incorporate probabilistic and statistical concepts into their theories. This change in science has been called the "probabilistic revolution" (Gigerenzer & Murray, 1987). As a result of the probabilistic revolution, statistical reasoning has become indispensable for interpreting scientific statements, making inferences, and engaging in scientific inquiry. Similarly, the everyday world, as represented in the daily newspaper, is one which demands statistical literacy. In order to understand environmental hazards, economic conditions, tests of new drugs, or political surveys, the reader must be able to assess quantitative data in terms of variability, sample size, bias, measures of central tendency, and other statistical concepts.

Recognizing the growing importance of statistics, educators added data analysis, probability, and statistics to the mathematics curriculum. The National Council of Teachers of Mathematics curriculum and evaluation standards (NCTM, 1987) call for statistics and probability in all grades, K- 12, with particular emphasis on data exploration, analysis, and interpretation. Supporting this call, a joint committee of the American Statistical Association and NCTM has developed Quantitative Literacy (QL) (Landwehr & Watkins, 1987; Landwehr, Swift & Watkins, 1987; Newman, Obremski & Scheaffer, 1987; Gnanadesikan, Scheaffer & Swift, 1987), a set of materials on statistics and probability for middle school students.

At the core of statistical reasoning lies an understanding of sampling processes. In order to make inferences about a population, students must understand what information a sample contains and what it can or cannot reveal about a population. But sampling can be complex and difficult to understand. For many students, it may be the first time they are asked to think of the world in terms of estimates and probabilities rather than in terms of knowable, quantifiable facts. The ability to conceptualize a problem as a question of confidence in a method rather than as a question of identifying the appropriate formula for calculation requires students to revise their mental models of mathematics in basic ways. This revision is one reason why statistics is difficult to learn. While many students may be able to state useful definitions for "sample" and "population" or manipulate the formula for a confidence interval, they exhibit confusion about the conceptual bases of statistical inference, even after completion of a course on statistical reasoning. We chose sampling as one focus within the Reasoning Under Uncertainty (RUU) project (Rubin, Bruce, Conant, DuMouchel, Goodman, Horwitz, Lee, Mesard, Pringle, Rosebery, Snyder, Tenney, & Warren, 1990; Rubin, Rosebery, & Bruce, 1988) because of its importance within statistical reasoning and because of the difficulty many students have in mastering basic sampling concepts.

This report documents a program (Sampling Laboratory) with which students can explore the processes of sampling and making inferences from samples. It also describes a curriculum built around the Sampling Laboratory, a field test of its use in high school classrooms, and studies of the learning of statistical reasoning related to sampling. It is intended to be a tool for those interested in issues related to the teaching and learning of reasoning from samples, and in particular, of the Sampling Laboratory.

Section 1 presents background on the Sampling Laboratory, including previous research on the learning of statistical reasoning and earlier curricula, such as QL and RUU. The Sampling Laboratory software and information on how to use it are described in Section 2. A module for teaching about sampling as realized in high school classrooms is presented in Section 3. Results of the implementation of these modules, including a study of students' learning of statistical concepts, are given in Section 4. Future directions are discussed in Section 5.

1 Background

Most of the research on statistical reasoning has compared student models for statistical reasoning with what we call the "standard model" of statistical reasoning. Although there is an active debate among statisticians (see for example the historical analysis in Gigerenzer & Murray, 1989) on underlying inference models, there is general agreement about the central aspects of statistically-based reasoning. The "standard model" runs as follows: A set of data can be represented pictorially in a number of ways. In particular, a sample can be represented as a histogram. The relationship between a sample and its pictorial representation is notational, or definitional. Thus, one can talk about (a) correctness - does the picture accurately represent the data according to the definition of the graph type? (b) usefulness - within the allowable parameters (e.g., bin size), does the picture show the data in a clear and productive way for some purpose?

The relationship between a sample and a population is one of contingent similarity; that is, if the sample is unbiased, it will tend to have similar shape, spread, central tendency, etc. to the population, and this tendency will be greater for larger samples. Thus, the appropriate value terms are (a) randomness - is the sample in fact unbiased with respect to the population of interest? and (b) goodness - is the sample large enough to merit the appropriate level of confidence in any conclusions drawn from it?

Some other aspects of the standard model are the following:

(a) There is a clear separation between the real world, conceptualized through the sample and the population, and abstractions, such as graphical representations, measures of central tendency, confidence judgments, etc.

(b) Samples are not more or less "right." It is not "wrong" to have a sample that looks very unlike the population. Following good statistical practice in drawing a sample does not ensure that the sample will look like the population; it merely allows one to specify precisely the likelihood of similarity.

(c) One expects samples to vary, not because of bad design or bias (although they can have a large effect), but because of randomness inherent in the sampling process.

(d) A histogram is supposed to represent a sample accurately; a sample is supposed to represent a population. These representations have radically different epistemological status. One is definitional; the second is probabilistic.

(e) Reliability of estimation by sampling is dependent on sample size, but relatively independent of population size.

(f) The size of a confidence interval for a given sample and a given confidence level is directly proportional to the sample spread and the confidence level, and inversely proportional to the square root of the sample size.

(g) The process of inferring population parameters from sample statistics is critically dependent upon the assumption that the sample is unbiased.

1.1 Research on Statistical Reasoning

In this section we simply want to mention some of the background research for our work on the Sampling Laboratory, not, by any measure, to give a complete review of research on statistical reasoning. Many researchers have studied people's statistical heuristics and judgments of subjective probability. One heuristic proposed is the use of representativeness (Kahneman & Tversky, 1974) as a measure of the likelihood of a sample being drawn from a population. Kahneman and Tversky define representativeness for a sample as "the degree to which it is (i) similar in essential properties to its parent population and (ii) reflects the salient features of the process by which it is generated." (p. 431). If the sample is unordered, this definition reduces to saying that the closer the sample statistic is to the population parameter, the more representative the sample. Bar-Hillel (1982) added the notion of "accuracy" to that of representativeness. In her experiments, subjects described as "accurate" those samples whose sample statistic exactly matched the population parameter.

Kahneman and Tversky (1982) also found that people use an availability heuristic to judge the frequency of a sample "by assessing the ease with which the relevant mental operation of retrieval, construction, or association can be carried out." (p. 164). In examples where availability guided people's thinking, they found, the problem was often stated so that the mechanism of constructing the sample was emphasized, rather than its final composition. Thus, there is some evidence from their research that the way a problem is stated influences the representation subjects use to explore it. Finally, research has shown that the sequence of questions can influence final estimates because initial estimates tend to have an effect on the entire sequence of answers subjects offer. Slovic and Lichtenstein (1971) report that subjects often construct a final estimate by small adjustments from their initial estimate; the implication is that different orders of questions can influence subjects to follow different reasoning routes and arrive at different answers .

Rubin, Bruce, and Tenney ( 1990) report a study of students reasoning from samples, which shows students struggling with dual aspects of the central idea of statistical inference: that a sample gives us some information about a population - not nothing, not everything, but something. In practice, this allows us to put bounds on the value of a characteristic of the population - usually either a proportion or a measure of center (mean or median), but not to know precisely what that characteristic is.

Under this view, sample representativeness is the idea that a sample taken from a population will have characteristics similar to those of its parent population. Thus, the proportion of girls in a classroom is likely to be close to the proportion of girls in the entire school. Sample variability is the contrasting idea that samples from a single population are not all the same and thus do not all match the population. Thus, some classrooms in a school are likely to have many more girls than boys, even if the school population is evenly divided .

One of the keys to mastering statistical inference is balancing these two ideas, interpreting more precisely the meaning of "likely" in each. Because they are contradictory when seen in a deterministic framework, students may over-respond to one or the other depending on the context. Over-reliance on sample representativeness is likely to lead to the notion that a sample tells us everything about a population; over-reliance on sample variability implies that a sample tells us nothing. Finding the appropriate point on the continuum between the two extremes is complex and needs to take into account confidence level, population variance and sample size. For a given confidence level and population variance, the effect of sample size relates closely to the representativeness/variability continuum: the larger the sample, the more likely it is to be representative of the population. Smaller samples are more likely to vary .

The analysis of student responses indicated that most students have inconsistent models of the relationship between samples and populations, even for problems in which the underlying mathematical models are isomorphic. In some situations, the notions of sample representativeness hold sway, in others, those of sample variability do. Sample size does not seem to operate appropriately to separate the two; in fact, of the three problems analyzed, sample representativeness appears to be a stronger guiding factor in the problem with the smallest sample size.

In related work, Snyder (1989) conducted an ethnography of a high school classroom, focusing on student learning in a statistics course (Reasoning Under Uncertainty, see next section). He found that students did not pick up the connection between science and statistical reasoning, which is prominent in experts' discussion of statistics. In interviews, students were unanimous in noting scant use of statistical concepts in their science courses. Some students came away with the idea that the world is full of fuzzy variables, and that statistical techniques are often helpless in the face of this chaos.

This course required students to go beyond manipulating equations to make connections between mathematics and the real world, and to grasp underlying concepts such as the distinction between populations and samples. Generally speaking, the students had little trouble mastering the few equations in the course, but had difficulty with conceptual distinctions. They were particularly confused about when to apply the formula for proportions versus the formula for means. Students were hampered in their conceptual grasp of sampling by not being clear about the status of unknown population parameters and about how randomness produces variability, even in unbiased samples.

In a study reported in Rubin, et al. (1990), Bruce and Snyder interviewed all the students in a high school statistics class. They also interviewed two teachers, a statistician, a physicist, a demographer, two experimental psychologists, and a computer scientist. They organized the interview around a problem that appears in Moore (1985) in various problems and examples. It was selected because it was complex enough to elicit a variety of responses and because it called for interpretation and policy judgments that went beyond simple calculations (see Appendix E).

Analyses of the interviews revealed that students had strengths in several areas, especially in the area of descriptive statistics. They also had difficulties in several areas related to more inferential thinking, a few of which we mention below:

Sample = population. One problem reflects what appears to be a conflation of the standard model for inference so that the relation between sample and population is almost an equality relation. Thus, several students said that a sample is supposed to represent the population; further questioning revealed that they meant that it was supposed to look like the population in terms of location, shape and spread. To the extent that it did not, students thought the person doing the sampling had made an error. They did not distinguish sample-to-population representation from that in the statement: The histogram is supposed to represent the sample.

Sampling variability. Related to the first idea is students' notion that samples should not vary. If the work is done correctly, they think, there should be no sampling error. Here we may be seeing confusion in part traceable to the unfortunate choice of "error" as the term within standard statistics for the effect of random sampling variation. Other problematic terms are "normal," "bias," "random," "standard," "population," "individual," and "confidence."

Data and process. A striking difference between most of the students and some of the adult experts was that the students rarely asked questions about the processes that generated the data set. This may simply reflect the social setting of the interview and the outside-adult/student relationship. But we suspect that students considered a problem in statistics to be complete as stated; there was no need to ask further questions, or to know the underlying process. It is noteworthy that even an excellent text, such as Moore (1985), tends to present many short problems, so that no problem is presented with much detail on the domain of study (in these interviews, milk production). Other texts (such as Tanur, Mosteller, Kruskal, Link, Pieters, & Rising, 1989), contain longer case studies, but tend to be used as supplementary materials. Although students often carry out surveys of their own in class, they do not generalize their insights about the importance of the details of data collection to problems from a textbook.

Normality and niceness. Many students seemed to equate normality, as in normal distribution, with perfection or niceness. Their goal was to have a nice-looking picture, but often the picture wasn't nice because it was difficult to do work perfectly: "it's hard to get everything just right." A related idea was that statistical work (the survey, the calculations) should be done correctly "to show you've learned it." This desire conflicted with the indeterminacy inherent in sampling.

Randomness. Students had difficulty, as one might expect, with the difficult concept of randomness. In some cases it seemed to be equated to fairness.

Explanation and persuasion. Perhaps the most disturbing outcome of the interviews was that a number of students seemed to interpret the question about explaining complex statistical ideas to the public as "how could you distort the statistical analysis to mislead the public?" Thus they saw the purpose of explanation to be persuasion. In this case, they saw the job of the health official to be reassuring the public, no matter what the data showed.

Statistics and the real world. Statistics is about using mathematical concepts in relation to real world data and important questions. But the interviews showed that the surrounding school context did not support this message. Despite the assertion that statistics was important in science, social studies, and humanities areas, students saw only trivial instances of statistical reasoning in their other courses, e.g., a mention of "means" in a science class. One student said she signed up for the statistics course because it was "something different to take...it's not like it comes up in everyday life...not in my math courses or anything."

1.2 Curricula

The Sampling Laboratory builds upon previous curricular work on sampling, in particular, Quantitative Literacy and Reasoning Under Uncertainty. QL "is an introduction to statistics. In addition to learning the most up-to-date statistical techniques,...students...get practice in division, percents, ratios, ordering numbers, and many other topics in arithmetic.

Familiar statistical concepts such as reading tables, the mean (average), and scatter plots are included as well as less familiar ones such as the median, stem-and-leaf plots, box plots, and smoothing. All of these techniques are part of a new emphasis in statistics referred to as data analysis (or sometimes as exploratory data analysis - EDA). The techniques of data analysis are easy to do and are often graphical. They can reveal interesting patterns and features in the data.

The techniques in QL encourage students to ask questions and generate hypotheses about the data. This is an important part of data analysis. By using these methods students will be able to interpret data that are interesting and important to them." (Landwehr, Swift, & Watkins, 1984, introduction page).

The objective of the Reasoning Under Uncertainty project has been to develop and test a computer-supported environment in which high school students learn how to think in probabilistic and statistical terms. The central ideas are to use the computer as a tool for data gathering, manipulation, and display, and to have students investigate questions that are meaningful to them. In contrast to the usual emphasis in statistics courses on formulas and computational procedures, RUU emphasizes reasoning about statistical problems. The students should be able to engage in statistical reasoning about uncertainties that either they or society face. Such a course conforms well to the National Science Board's suggestion that "elementary statistics and probability should now be considered fundamental for all high school students."

To facilitate involving the students in statistical and probabilistic thinking, the computer-supported environment provides a series of data sets that the students explore for meaning in terms of statistical principles. Data sets that interest them- for example, on sports, health issues, and social trends- promote functional learning via activities that can serve the students' own goals. In this setting, students can discover and construct their knowledge via participatory, experimental learning.

RUU is a semester-long course with four modules, each with several units:

1 Describing Groups

1.1 What do statistical questions and answers look like?

1.2 Measures of central tendency

1.3 Understanding variability

2 Answering Questions--Sampling from Groups

2.1 Why sample?

2 .2 Confidence

3 Making Comparisons

3.1 Asking statistical questions: Collecting data through surveys and experiments

3.2 Answering statistical questions: Visualizing and analyzing data

4 Understanding Relationships

4.1 Answering question about multivariate data

4.2 Making predictions

4.3 Association versus causation

4.4 Newspaper stories (optional unit)

The primary software used in the curriculum is ELASTIC, a statistical spreadsheet that handles both categorical and numerical variables. Users can easily display summary statistics, build new variables from those that were already defined, and create histograms, bar graphs, scatter plots and box plots. It also allows the student to look at subsets of data - for example, all females earning over $30,000, or all boys under five feet tall, or all maze times under two minutes. It then allows the student to create graphs - histograms, box plots, scatter plots, and bar charts - for any of the variables, or for selected subsets. ELASTIC also includes two exploratory environments:

Stretchy Histograms allows students to create and manipulate distributions interactively. Measures of location and variability--mean, median, and quartiles - change dynamically as the distribution is modified.

Shifty Lines provides an environment for experimenting with lines on scatter plots. A potential best-fit line can be moved around the screen while a scale records how it differs from the best possible fit. The software also allows students to identify particular points on their scatter plot and to investigate how a regression line would change if points were deleted from the data set.

Module 2 makes use of a precursor of the Sampling Laboratory, called Sampler, a program in which students can explore the behavior of multiple samples drawn from a single population. Experiments using Sampler can illuminate relations among sample size, number of samples, and confidence limits of inferences about the underlying population.

Ethnographies of Reasoning Under Uncertainty (RUU) classrooms, which involved systematic observation of many class sessions and focal interviews with teachers, students and administrators, are described in Page (1989), Rubin, et al. (1990), and Snyder (1989). One of these (Page, 1989) focused on implementation questions, examining supports within the school for implementing RUU. Its key conclusions are as follows:

The use of the computer played an important role in fostering student-centered learning. Working in pairs was effective and seemed to lead to greater understanding.

There was an apparent absence of competition among students in the classroom, and an unexpected presence of competition among the faculty, in the area of recruiting students for elective courses. RUU was significant in this regard because the use of computer and the data collection activities made the course attractive to students.

The introduction of the innovation (RUU) was "relatively easy...[It] was right out of a textbook: Proper training, proper planning, an established course, a teacher with the right abilities and attitude. The results are excellent, from all indications."

2 Sampling Laboratory

Based on the research sketched above, we identified a set of goals for teaching about sampling processes. A major decision, based on the Snyder and Bruce research reported in Rubin, et al. (1990) as well as the Quantitative Literacy curriculum, was to focus on estimates of population proportions. Within that area we have identified a set of basic concepts related to sampling that we would like our activities to support. Although they are stated here in terms of estimating population proportions, they extend easily to other population parameters such as median or mean:

(a) In general, you cannot calculate a population proportion directly: the population is too large; it costs too much to take measurements; or you don't have access to all the individuals in the population.

(b) A sample is not the same as a population, but it can give you some information about a population.

(c) Randomly-chosen samples vary considerably, especially with a small sample size. Thus the sample proportion you get is likely to be different from the population proportion .

(d) This variation occurs even if you are careful to avoid bias in choosing a sample. It is a consequence of randomness in the sampling process, not of human error.

(e) Although the sample proportion may vary from the population proportion, it varies in a predictable way. Samples are more likely to be similar to the population than very different from it. If you could look at many samples from the same population, you would see that the greatest number of samples would have the same proportion as the population, and the further from the population proportion you looked the fewer samples you'd see. Thus, despite sampling variability, a random sample can be used to make a reasonable estimate of a population proportion.

(f) The goodness of the estimate is directly dependent upon the size of the sample. As the sample size increases, it is less and less likely that the sample proportion will differ greatly from the population proportion.

(g) This sample size effect is non-linear. (The size of the confidence interval is inversely related to the square root of the sample size.)

The Sampling Laboratory, which runs on a Macintosh Plus, addresses most of these goals. It has most of the functionality of our original Sampler, although it is restricted to proportions.

2.1 Special Features of the Sampling Laboratory

The Sampling Laboratory supports the following:

Concrete representation of samples. The Sampling Laboratory uses icons to represent each individual in a small sample in order to emphasize the difference between samples and populations, which are represented as histograms. We are working on other concrete representation ideas for populations, samples and the sampling process, including one based on sampling by specifying a region in a space of colored pixels and one based on sampling from a pipeline in which individuals are produced temporally (see Figure 6).

Relationship among populations, samples, and sampling distribution. In an earlier program (Sampler), students had trouble understanding how the histogram of the sampling distribution was derived from the individual samples. The Sampling Laboratory indicates the relationship for each sample by flashing lines on both the sample and sampling distribution graphs, pointing out the correspondence (see Figure 6).

Comparison of distributions of sample proportions. The Sampling Laboratory allows students to compare and contrast sets of samples from different populations or sets of different size samples from the same population (see Figure 7, below). Students can re-examine a sampling distribution produced earlier or compare it to a later distribution.

Separation between setup and run. Students or teachers can set up a sampling process in terms of population, sample size, and number of samples, but postpone the actual production of the sampling distribution. This allows activities such as drawing 10 samples of sizes 10, 20, 40, and 80 from a population as a single operation (see Figure 8).

Box plot summaries of sampling distributions. The Sampling Laboratory can display a box plot summary for a sampling distribution and allows students to set the percentage of samples contained in the box (see Figure 9).

Confidence intervals and summary window. A summary window displays a set of box plots representing many sampling distributions. From a chart of box plots representing a set of sampling distributions, the software can then display a confidence interval for any sample proportion (see Figure 10). This approach follows that taken in Landwehr, Swift, & Watkins (1987).

2.2 Using the Program

2.2.1 Data Sets

Figure 1 shows the opening screen for the Sampling Laboratory. The user can open an existing data set or create a new one

FIGURE 1

2.2.2 Objects

The Sampling Laboratory allows a student to create any number of objects, which can be used to construct populations. Each object type has one or more possible categories. The screen for creation of an object type is shown in Figure 2. In the example, objects of type M&M's are defined as having the categories, red, brown, yellow, green, tan and orange. Objects of type voters might have the categories, Bush, Dukakis, and undecided.

Figure 2

Figure 3 shows two object types and the associated categories. The user is focusing on the Voters object type. The categories for each object type are shown on the right in a scrollable list.

Figure 3

2.2.3 Populations

After selecting a particular object, students can define different populations by assigning a set of weights to the categories, and optionally assigning a name to the population. This is done by typing in weights or percentages, or by selecting uniform. In the example below (Figure 4), the student has selected the object type, M&M's, has labeled the population being created, "30% red," and is setting weights for each category.

Figure 4

2.2.4 Experiments

For each population so defined, the student can run experiments, which are not full experiments in the sense of experimental design, but the production of sampling distributions from a specified population. To set up an experiment, the student sets a sample size and a number of samples to be drawn. A comment of any length can be added to describe the experiment further. The experiment can be run immediately or at any later time. After setting up and running several experiments, the student has a computer record of these experiments. In the example below (Figure 5), the student has run and commented on two experiments, one for a population with 20% red M&M's and one for a population with 30%.

Figure 5. Experiments on two populations of M&M's.

When the student decides to run an experiment, three windows are shown. One shows the population, the second shows the sample, and the third shows the sampling distribution from all the samples taken in that experiment. This third window thus approximates the set of likely samples from the given population. The samples can be drawn, as in the original Sampler, in either a step-by-step mode or a continuous run mode, which the student can PAUSE at any time.

In Figure 6, the process has been interrupted after 19 of 210 samples have been taken. The 19th sample is shown in the upper right. It has 10% red M&M's, even though the population percentage is 20%. The sampling distribution in the lower left shows a large spread, which one expects from the small sample size (10). It does seem to be centered near the population percentage. In addition to continuing the simulation at this point, the student could also investigate the pattern of proportions in the other categories (green, orange etc.) by changing the focus category.

Figure 6. Viewing the sampling process

Sampling distributions can then be compared. In Figure 7, the student is comparing two distributions of 20 samples each, one drawn from a population with 20% red M&M's and one drawn from a population with 30% red M&M's. In the first case, the modal column for the sampling distribution is at 30%, even though the population percentage is 20%. In the second case, the 20% and 30% columns have the same height, and the population percentage is 30%. Thus, the spreads are not clearly distinguishable. This is not too surprising. With a small sample size (10) and population proportions that are close together (20% and 30%), one would need to see a large number of samples to have a sharp distinction between the two sampling distributions.

Figure 7 Comparing two sampling distributions

2.2.6 Comparing Sampling Distribution Box Plots

Another way to compare sampling distributions is to compare box plot summaries for the distributions. Each box plot is a summary representation for the set of samples produced in a Sampling Laboratory experiment for a given population proportion. This distribution approximates the theoretical "likely sample set."

The Sampling Laboratory allows the student to set a percentage of sample proportions to be included within the box, the remaining sample proportions to be represented by the whiskers. A large number of experiments can then be compared easily. In Figure 8, a student has set up five experiments, to look at samples from populations with proportions ranging from 10% to 50%. In this example, each of the experiments has already been run with 20 samples of size 10 each. The comment column shows that the actual sample proportions are close to but not identical with the population proportions.

Figure 8. Five experiments on populations of M&M's.

Choosing the "Show Box Plots" button produces a single display of the box plots for the five populations (Figure 9). In this case, the student has set the box plot percentage to be 90%, meaning that the box must include at least 90% of the sample proportions. There are toggle controls for indicating the actual percentage of samples included within each box and each whisker, or for showing the whiskers at all.

This sort of display is the type used in Landwehr, Swift, & Watkins (1987). For each population proportion (reading along the y-axis), it shows the set of samples produced in the experiment. This approximates the set of theoretically likely samples from the population. Reading up from a point on the x-axis, one can see the populations whose likely sample sets include a given sample proportion. For example, an actual sample proportion of 15% which might be produced by collecting real data falls within the box plots of 10%, 20%, and 30% populations, in this example. Thus, an approximation to the theoretical 90% confidence interval is the interval [.1, .3].

Figure 9. Box plot comparison of five sampling distributions.

2.2.7 Confidence Intervals

As the sample size and the number of samples increase, and as we examine finer gradations of population proportions, we can come arbitrarily close to the theoretical confidence interval. The Sampling Laboratory also supports construction of confidence intervals. The user simply clicks on the x-axis at the point corresponding to an actual sample. The program performs a linear extrapolation to connect the box plots in an envelope. It then highlights the region of population proportions that corresponds to the confidence interval about that sample (figure 10), in this case, [.066, .338].

Figure 10. A constructed confidence interval for an actual sample.

3 Sampling Laboratory Curriculum

The Sampling Laboratory curriculum has the following characteristics:

(a) A connection to sampling issues in the real world of high school students (see section 3.1 below).

(b) Awareness of the misleading nature of many words in the standard statistics vocabulary, e.g., "normal," "error," "confidence" (see section 3.2 below).

(c) Activities using concrete materials (e.g., bottle caps, see Appendix A) and real-worlds data (e.g., gender distribution in families).

(d) Inquiry-oriented activities, in which students explore statistical questions such as "do a coin and a tack have the same chance of landing UP?," defining their own experimental method and decision criteria.

(e) Significant use of the Sampling Laboratory, especially in conjunction with inquiry-oriented activities.

The curriculum was realized in two classrooms, which provided a wide range of student abilities and challenges for incorporating sampling lessons into different contexts. In each class we conducted before and after interviews with students to assess what they were learning. The classes were the following:

(a) A statistics class at Belmont High School (BHS) which has been taught using the Reasoning Under Uncertainty curriculum and the ELASTIC software for the past three years. Here the Sampling Laboratory activities were used as a four-week module in a semester-long course on statistics.

(b) A general math course at Cambridge Rindge and Latin High School (CRLS). This is a course for students who have not been successful in standard mathematics courses. The focus of the four-week Sampling Laboratory module was on relating concepts of statistical reasoning to students' everyday concerns. Statistics was not a part of the rest of the course.

In addition, students in another class at CRLS also used the Sampling Laboratory:

(c ) An advanced placement math course at CRLS. Sampling was introduced near the end of the semester after students had taken the advanced placement test.

3.1 Examples of Sampling

Everyday experience is one place to begin discussion of sampling, showing where it enters into the fabric of a typical day. For example, Peter Mili, a teacher at Cambridge Rindge and Latin High School (CRLS) suggested three questions as tapping the topics students discuss frequently between classes and outside of school: How many students want condoms to be available in school? How many students support a Coke boycott? Are there different probabilities of violence in different parts of the city? Other topics we used or envision using are the following:

Breakfast. One may sample oatmeal, randomizing it by stirring. Breakfast food advertises various proportions of nutrients on the package. These samples have to be destroyed (another example is sampling flashbulbs). Their sugar content is the subject of an RUU activity sheet (1-10). Marketing as well as quality control involves accurate sampling. The debate about the healthful effects of oat bran illustrates the reasoning based on samples. So far as food generally, the story of the removal and the return of red M&M's (RUU, p. 2-12) gives a good example of marketing research. A question might be: If you sampled your friends, would that give you a good idea of what Americans (or other populations) have for breakfast?

Clothes. Here again is the issue of quality control and inspection: Are all instances of a product the same? Here, too, are marketing surveys of what items (e.g., shoes or some appealing example) are popular with whom, what features will make a new item saleable, how much people are willing to pay. Pump-up basketball shoes are a spectacular example. What proportion of students at the high school wear certain items? Would this be regional or national?

Media. Radio stations do surveys in order to aim music at specific groups (RUU, p. 2-12). The Nielsen ratings for TV are another example of surveys (RUU activity sheet 1-18). How are royalty payments for recordings played on the air figured? This is a good stumper explained in Moore (p. 5). An interesting case is trying out a new movie or piece of music, based on marketing knowledge, but where the expectation can go wrong. We are sampling variables that change through time. Do we really decide what to like or is it decided for us? How do we decide what movies to go to? Sampling determines much of the entertainment that is offered to us.

Risk. CRLS students showed some interest in this topic. Are teens unusually at risk for accidents, violence, early death? Highway safety is one aspect (RUU, pp. 4-30 ff). Deaths in Boston is another aspect - there's a discussable map in RUU (p. 4-39). Medical tests sample both population trends and one's own individual health, and samples may vary in both cases. Drug testing is a hotly debated example. Has teen drug use declined, as the government claims? Our very health system rests on testing and sampling. One extended case of how that works is a description of the introduction of the Salk vaccine at the end of Module 3 in RUU.

Politics and Government. The heavy use of polling in elections is a prominent example, and there's now an obligatory mention of "margin of error." Politicians are guided in their strategies by opinion polls. How would one predict the outcome of a school election? The census provides crucial data and determines how government money is spent. There are interesting questions here of undercounts of ghettos and of the homeless, and whether the final figures should be adjusted on the basis of estimates drawn from samples. The government provides non-census data too - what is the reason for this and is it reliable? Moore gives a good discussion of the Bureau of Labor Statistics (p. 111 ff.) and unemployment which might interest kids. Should one be skeptical of this data? On what basis?

Schools. SATs, IQs, and other such national tests are built into our educational system. Are they fair? Classroom testing is an interesting if subtle example too, given individual variation and differences of context. For example, is it fair to test students after a long vacation? The issue of standardization across samples is a fascinating one: At Belmont High School (BHS) a good discussion was initiated by asking kids their lowest and highest grades for the term, which prompted some accurate estimates of which teachers the grades came from. Can we compare grades or tests from different schools or classrooms?

3.2 Key Terms

One of the problems identified in research on statistical reasoning is that many terms are confusable with ordinary language terms. The module addressed several of these:

Normal. Common sense contrasts this term with abnormal, which leads some students to expect to escape abnormality by obtaining normal curves, to want the sample to be a normal distribution. Perhaps the stress should be on the verb "norming," the curve as a convenient norm. One can build out from histograms made from equally distributed variables such as heads and tails, or Moore's dark and white beads (p. 15). Then a normal curve can be shown in the case of tabulating many samples so far as the proportion of heads or dark beads. The peak is at the "break-even point," necessarily tailing off in both directions.

Error. Moore's employment of the statistical sense of this term runs in one case against the ordinary sense. He distinguishes between sampling errors, which cause results to be different from the results of a census, and non-sampling errors, which might be present in a census (p. 22). Moore's text commits us to the term, but perhaps the first and confusing sense, having to do with a properly planned act of sampling, might be put in scare quotes - "error" - and shown to be different from error in the sense of doing something wrong. Certainly the data from our interviews suggests it should be flagged in some way, since some students weren't ready to encounter sampling variability.

Moore goes on to distinguish random sampling error, which is just this ordinary variation of samples, and nonrandom sampling error, such as through convenience sampling or an inappropriate sampling frame. Randomness can be used to explain this, but there is a further pitfall. This second category of sampling error, while intuitively O.K., is likely to be confused with non-sampling error (missing data, response error, processing error, etc.) because both involve doing something wrong. In this context Moore's distinction is not that useful, and perhaps could be played down. The key distinction is between sampling variability and bungled procedure.

Random. This is a subtle idea - trying to give a rule for the ruleless, to delineate the incarnately slippery. Some Belmont students got the idea of simple random sampling so firmly in mind that they thought doing things right brings one to the zero case of randomness, ruling out variability from the population. Moore usually contrasts randomness with long term coherence. The idea of randomness might stick better if it were presented not as a correlative or contrastive idea but as the name of an autonomous process that is simply there. Dice throwing is a nice example which does not get conflated with later normal curves of a set of samples of some proportion. It might help to dwell on random sampling variability and such ideas as independence and small causes acting independently, such as talking about the physical basis for the results of dice-throwing and cheating with loaded dice. Perhaps also it would help if students were asked to struggle with defining randomness themselves.

Confidence. In our interviews we asked about the continuity of ordinary and statistical senses of this term, and our experts and students had it both ways. But there is an intuitive basis here to build on. The BHS teacher Alice Mandel gave a good practical example in her interview of taking on everyday issues (elections, etc.) with scaling and weighted numbers as a way of focussing our subjective expectations so they're not "a nebulous cloud that you're trying to pack down" (p. 7). Such ideas as betting odds were employed by some of the Belmont students to explicate what confidence level means.

Moore explains clearly what confidence level does not mean (pp. 302-303), but he lays down a fine-spun taboo about phrasing it for the student who wants to say the true p falls within the confidence interval with a certain probability. The conceptual difference here is between following a method that achieves a rational result in general and the strong tendency to want to make an assertion about the unique case at hand. The point about method seems likely to require extensive explanation if it is raised (neither Alice nor her students raised it), but it might be interesting to introduce it late in a course to see how students respond.

Bias. Here the intuition is so robust that the problem is to make a transition from the human tendency to objective method. The student must grasp that a person being biased is only one matter that can affect a sample being biased, and that the latter and more objective case of bias may or may not be brought about by bias in the person sense. What has to be added is an appreciation of what bias does to a distribution - how it makes it diverge in a certain direction from the features of the true population. The student must see bias as an effect in the distribution. Telephone surveying (Moore, p. 22) and voluntary response to questionnaires (Moore, p. 7) are good examples for discussion of bias: you can estimate the direction of bias.

3.3 The Sampling Module at BHS

Below is an account by Alice Mandel of the sampling module developed for use at BHS. Appendix A is a module handout and Appendices B and C are quizzes for the module. Section 4.1 gives more details about the actual classroom experiences.

Day 1: Population size experiment using a deck of cards with the diamonds missing and then 3 decks of cards with the face cards missing. This activity demonstrates the independence of population size on inferences from samples, and by implication, the power of sampling as a procedure for estimating population parameters.

Day 2: Bottle cap experiment. (See Appendix A.)

Day 3: Finish bottle cap experiment. Worksheet on Random Digits from QL.

Day 4: Use Sampling Laboratory to simulate bottle cap experiment.

Day 5: Constructing 90% box plots on Macs using Sampling Laboratory. Design survey sampling procedure.

Day 6: Continue interpretation of charts of 90% box plots on Macs and on paper.

Day 7: Continue 90% box plots on Mac. Discuss application 8 from QL.

Day 8: Begin discussion of 90% confidence intervals (without formulas).

Day 9: Reading 90% box plots and change to 90% confidence intervals.

Day 10: Confidence intervals in the news; margin of error.

Day 11: Quiz on 90% box plots (see Appendix B). Quiz included question that required use of the Sampling Laboratory.

Day 12: Confidence Interval formula for proportions.

Day 13: Using confidence interval to find desired sample size. Collect survey project data.

Day 14: M&M Day. Took repeated samples of M&M's and calculated percent brown and margin of error with graph of l/n. Also made a histogram of sample proportions which looked normal almost immediately. HW: Work in survey project.

Day 15: Work on JC Penney/Phone Book/Dictionary activities.

Day 16: Continuation

Day 17: Return quizzes. Review.

Day 18: Module test. (See Appendix C.)

A major part of the course was to do a survey project. Each student formulated a hypothesis, defined independent and dependent variables, devised a sampling procedure, designed and conducted a survey, gathered data, organized and analyzed the data using ELASTIC as well as hand-generated graphs, analyzed results, formulated conclusions, and did a critique of their own study. Excerpts from these projects are included in Appendix D.

3.4 The Sampling Modules at CRLS

Below is an account by Peter Mili of the sampling module developed for use at CRLS. Section 4.1 describes the classroom implementation in more detail.

"We introduced the students to sampling using some newspaper and magazine reports, along with much discussion. Included was an activity where we tried to estimate how long it would take to ask a large number of people a simple question (minutes, hours, days,...).

"We had the students each create a question that they were interested in knowing about, and then had them ask 20 students at CRLS. Our final activity of the unit was to interpret these results and report them with a margin of error.

"We did the coin flipping in class and worked on collecting and organizing the information in tables and graphs. Then we went to the software where we had the students run experiments and try to understand all the representations (windows). This involved getting printouts and working with the students individually. We also had to "take a step back" at this point and have the students do some paper and pencil constructions of histograms and box plots. We felt that they needed this for better understanding.

"We found it beneficial to add written interpretation to the printouts. Specifically, we wrote the actual number of samples above each bar to correspond with the indicated percentages from the horizontal axis. With the printout of the box plot, we listed the actual numbers that corresponded to the range of the box and whiskers just beneath the graph. We then reinforced and summarized the data by writing a series of sentences which interpreted the information provided by the graphs.

"At this point we went back to the software so that the students could run experiments with different population proportions (10, 20...) in order to create the box plots so that we could talk about confidence intervals. We took printouts back to class and discussed how we use them to get a confidence interval for the sample proportion. We did not use this vocabulary, instead we used "margin of error" and "between" and "likely/unlikely" to describe the confidence. We then had the students write a sentence to report the results of their surveys."

Below are examples of survey questions devised by students at CRLS. Each student interviewed 20 people for the project.

(a) If you had to choose between "rap" music or "rock" music, which would you prefer?

Pick one. "Rock" / "Rap"

(b) Do you agree with the graduation requirement of 16 credits (4 years) in physical education?

Pick one. Agree / Disagree

Pick one. Yes / No

(d) Should the United States government legalize cocaine?

Pick one. Yes / No

4 Learning About Reasoning From Samples

As a complement to our work on developing new software and hardware configurations, we have been conducting studies of student learning. These studies have included ethnographic studies of classrooms and open-ended interviews. There are several goals of this work:

(a) To identify and characterize areas of difficulty for students in learning statistical reasoning, especially concepts related to confidence judgements on estimations of population parameters from sample statistics.

(b) To identify and characterize the impact of current teaching practices (classroom activities, text materials, software) on these areas.

(d) To develop guidelines for the design of software and new classroom activities.

4.1 Ethnography

Preliminary results of an ethnographic study of Sampling Laboratory classrooms are based on observations of 12 sessions of the BHS classroom and 4 of the CRLS. They show both the workings of the Sampling Laboratory, and the new sampling activities, which in the current year focussed on proportional sampling and multiplicity of samples and populations, and persisting challenges and difficulties the students experience, into which the program and its history offers at least experience and insight. The 1990 focus of the sampling module was a creative challenge for the teachers in the two experimental sites and their adaptation of it of their particular contexts was imaginative.

In a pivotal session of Alice Mandel's course, students were asked what was wrong with a shuffled deck as the cards were turned up one at a time. A wrong guess meant being out of the game, the prize of which was a candy or cake which the winner could eat in front of the class. Two kids guessed too soon, but Chris detected the absence of diamonds on the seventh card. Then Alice tried the same experiment with four decks. Ryan was out at two cards, though he confessed that his premature guess was "so dumb." Elise guessed at seven that there were no face cards. After a follow-up discussion with Alice, the students went to the Macs. Here they referred to their previous physical experience with tossing bottle caps and tested the effect of different sample sizes on the width of 90% box plots (April 30th).

In the class at CRLS, where the students were at a lower level of math skills, Peter Mili started the unit by talking about the current U.S. census and reading some clippings of recent surveys. He zeroed in on a 1986 study of cocaine use which reported that 5.8 million Americans used the substance in the month studied. Peter then asked the kids to imagine the logistics of surveying this many subjects by phone, to think through the arithmetic of the time required for phone calls and the money required for surveyors. The students collaborated with Peter in answering these questions. Wayne got confused between thousands and millions, but finally the group decided it would take 100 people 580 ten-hour workdays each to complete the task. At this point Cassandra said: "So how did they do that? So they take a small amount and..." This session was followed by a coin-tossing class, then a class with sampling on the Macs (April lOth-12th). As in Alice's class, there was a balance between students' intuitive ideas and math potentials, and an underscoring of the multiplicity of samples and populations, as well as task of making a reasonable estimate of the population proportion from the sample.

In the BHS course, the chart of samples and populations (Figure 1O) - represented in software, printouts, paper handouts, overhead projector, and blackboard sketches - became the working schema to which students and Alice referred. This was so true that on May 16th David accused Jen of getting a confidence interval not by actual inspection of the box plots but by just guessing - i.e., by using the general shape and units of the chart. The working lingo of Alice's class shifted from last year's means and distributions to box plots and proportions. Students seemed to connect margin of error and proportion more easily than standard deviation and mean in 1989, possibly because the order of percentages is simpler and more familiar.

Alice built up slowly to the idea of statistical confidence, using it at first in the everyday sense as she talked about the graphic apparatus. Confidence in the technical sense was introduced on May 1st with the proportion formula. Alice tied this explanation of confidence to the box of the box plot as described in a QL handout (Section III, Application 6), where the proportions inside the box are said to be "likely sample proportions." She emphasized the multiplicity theme the same day by having Will chart the calculations of width of confidence interval beside sample size and Jen graph the same findings. Questioning the students about the graph, she got them to talk about the bearing of its non-linearity on deciding how much of a sample to pay for. Alice addressed the issue of sample variability and randomness by injecting into the lore of her class the expression "bleep happens," whose memorability was heightened by adolescent humor. Like the graphic scheme, this notion seemed to sink in, and students enjoyed embroidering on it. David expressed the increasing accuracy of larger samples as "the bleep gets bleeped out" (May 7th).

Some of the 1989 BHS difficulties (see section 1.1) surfaced. Early on, in the discussion of varying widths of confidence intervals resulting from change in sample size, Kim asked "What is it?," reflecting a desire for one number and displaying puzzlement over the suspension of this central tendency over a plurality of samples (May 1st). Another previous problem was "getting it right." This year gave more examples: Jen said, "So if you have a perfect sample, you get 40%?" (May 2nd).

A further challenge came in understanding the idea of independent and dependent variables, which the students had to employ in their final projects. Chris explained his application of the concept with good understanding in his project paper, but it was harder for others. On May 21st, David, who was running late in his project work, had a running dialogue with Alice on this point while she was meeting the demands of the kids at the Macs. His main confusion seemed to be the idea that something could alternatively be one or the other, depending on the question. After this, he talked with Bart and Ryan. Ryan asked Bart, "What's your hypothesis [in your project]?" Bart: "I didn't have any." In his project paper, Ryan said there was no dependent variable in his study of whether school interferes with student meals. As in the 1989 BHS class, the relation between mathematical variables, which can be connected constructively, and the causal or functional variables of science, about which people often have knowledge or strong intuitions, was not easy for the students to sort out. On their behalf it should be said that they tackled slippery social and psychological items in their project and struggled with issues that trouble professional researchers.

Alice kept going a running distinction between simulations and real-world samples (as did Peter), an issue closely linked to the foregoing. That the students were alerted to this is shown in Jen's query, two weeks after the card demonstration: "Did you shuffle the cards? Did you have it all planned out?" (May 16th).

The four CRLS students were a very diverse group, and their response to the curriculum was highly individual. Wayne and, to a lesser degree, Darla, had difficulty with the sheer size of some of the numbers involved and with keeping straight the difference between actual counts and percentages. Wayne believed he could influence the toss of a coin but came around by watching the others to the opinion that "it depends on how you flip it...you have to flip it the same every time" (April 11th). Both were excited by their real world survey question and by working on the Macs, and both seemed to gain understanding in their handling of percentage and their survey question. George had an intuitive sense of some probability issues, such as polls and gambling, but missed the middle part of the unit. Cassandra showed the clearest benefits. She started with skimpy math skills but showed in class discussion a good progress in grasping the point of surveys and the variability of samples. Cassandra was able to give Wayne help in calculating his confidence interval in the wind-up session (May 4th).

Peter Mili exhibited wonderful skill and ingenuity in bringing the ideas of sampling to the group. In the opinion of the ethnographer, his approach merits a repeat trial with a class more similar in skills. It should also be noted that Peter was assisted during the unit by the students' regular teacher, Julie Hochstadt, and by a graduate student aide in education.

4.2 Interviews

Using pre- and post-course interviews based on two similar problems (see Appendix F) we interviewed four students each from the CRLS and BHS classes. There were thus eight pre- and eight post-interviews, each organized around two problems.

In the CRLS group, Wayne had trouble keeping numbers in mind to answer the questions and perhaps also had difficulty with standard English, being from Jamaica. He said the unit helped him with percentages. He tended to translate statistical questions - such as B4, B5 - into causal questions where he had some opinion. George had a sense of betting odds and referred many of the questions to his own scheme of smart/middle/dumb or high/middle/low in both pre- and post-interviews. He did not know what a confidence level was and had a vague sense of margin of error, interpreting A.1.3 in follow-up questions as meaning "It's wrong...it's wrong by 9 points...the other guy's ahead...not necessarily."

Darla was also more comfortable with causal language, though she took this reasoning farther than Wayne or George. In her pre-interview, she came up with her own solution to the educational problems of American students saying that the problem was hanging out and not studying, and recommended work with parents and scholarships for students. In the post-,she spoke of the ethnic and racial terrain of elections. Darla seemed to lack confidence in her ability to venture into numerical reasoning. At the end of the pre-interview, she reproached the interviewer, saying "These questions are complicated," and gave the example of an arithmetic problem which she could solve. She found it hard to distinguish sample and population in her post-interview.

The unit changed Cassandra's knowledge of margin of error from admission of ignorance in the pre- to a clear verbal explanation in post-. To B7, she said she would bet based on the margin of error. Cassandra was casual in her use of numbers but otherwise adept at getting at the ideas, though she did not grasp confidence level. A sample of her thinking is shown in the following stretch of interview in respect to question A.2.6 in her pre-interview:

Interviewer: "How did you figure that out?"

Cassandra: "I just thought...Cause I figured if like, there's so many green and so little red, that, even though you mix them up, it's like the greens overpowering, dominate...you know."

In the exit interview she said she liked working on the computer: "It had all of charts right there." Box plot was the main thing she learned. Of political surveys, Cassandra said: "I won't look at them the way I used to. There's a margin of error."

Elise was Cassandra's counterpart in the BHS interviewees. She began by wondering whether "margin of error" means "outlier," then answered candidly she didn't know. In the post-, she clearly identified margin of error and how it worked. Elise said "confidence" meant "they're 90% sure that it is 53%...the other 10% they're not sure about." Unlike Cassandra, Elise had solid math skills and explained the shrinking of confidence interval as sample size increases in terms of square root in the formula. When asked about distributions in the second problem, Elise groped at first but then tied the discussion to box plots. The contrast in Elise's answers is seen in one of her pre- remarks about B.1.7.:

Interviewer: "Can you ever make a bet on the basis of a poll?"

Elise: "If you're stupid, yeah!"

Cindy had some knowledge - perhaps sheerly semantic - of margin of error in her first interview: "That the person winning might win more, or might not win." In her post-session, she gestured outwards to mean "that there's an error." She added, "It's easier for me to look at something and do it out." Confidence Cindy explained as being 90% sure that the data is accurate. She said that 100% confidence wouldn't really be a sample, "so 90% is good because you can really get an excellent idea of it." Like all the other BHS kids except Elise, Cindy said she simply used the formula.

Will had an initial sense of margin of error and answered B.1.2 as meaning that 86 Japanese students passed "more or less seven." Like George, Will seemed to have some familiarity with betting, spoke of "spread" (B.1.7) and explained margin of error in the second interview by saying to A.1.2. that "it could mean something or could just be the luck of the draw." Will had difficulty explaining confidence level: "Ms. Mandel would kill me...uh...90% accurate." When asked about distribution, Will said: "You gotta distribute a certain amount into different...if you were doing this on a computer, there'd be different columns, and you'd have to distribute it in each column." After answering the questions in the second interview, Will said that he was not good with graphs, that other kids in the class had more computer courses, and that "I'm going into business and I don't like computers that much, but I know that they're there and I'm gonna have to use them."

Kim moved from characterizing margin of error as "it could be wrong" to a correct numerical use. In fact, she overused it in post-, ascribing overlapping margins of error to the American population and the Japanese sample in problem B.l. Kim said that confidence level meant that they're "not sure...They're 90% sure it's valid." When pressed on the ambiguous "it," Kim identified it with "the percentage of students in the population." She referred questions about the formula (POST, B.l.ll) to the graphic scheme: "If you use 2 for z that's for a 90% box plot."

As might have been expected, the interviews show the BHS kids having a better grasp of the concepts, since they started with better math skills and took the sampling unit as part of a larger statistics course. On the other hand, in such matters as causal reasoning, intuitions, and misconceptions (for example, the idea that a larger sample will include more differences and consequently make for a wider confidence interval), the groups were similar. As perhaps can be seen from some of the student remarks, the interview itself and the kind of understanding it explores was secondary from the student point of view, for which the matters of importance were passing tests, working on the computer, and doing the individual projects. In this sense the mastery of statistics in use is probably better than a strict interpretation of the interviews indicates, because what the students aimed at was - in Cindy's words - "to look at something and do it out."

5 Future Directions

Further work on the Sampling Laboratory is needed. Some of the ideas that have emerged from our field testing are these:

Visual representation of the sampling process. The current Sampling Laboratory displays a population as a bar chart. A sample from that population magically appears as another standard bar chart or, under user control, as a bar made up of small triangles, each representing a unit in the sample. A sampling distribution is another bar chart showing where each sample proportion falls within the set of all samples generated in an "experiment."

This method of representing the sampling process is too abstract for some students. We have considered a number of ways to make the process more concrete. One idea is to build on Judah Schwartz's Ample Sample program. In this approach, the population is an array of icons with variation in color or shape. A sample is a selected region of the population. Both the population and the sample proportion are indicated directly by the density of the focus category's icon. The sampling distribution could be constructed by stacking up samples of similar densities.

An alternative is to use the time dimension to show the population. Elements of the population could spew out of a pipe. If one did not know the generating function for the population, this representation would make the need for sampling more apparent. Samples in this representation would be portions of the population stream .

Context-dependent help and explanation features. These would be connected to each window and data object. Because the Sampling Laboratory maintains extensive information about each sampling distribution, it can serve as a resource for the student who wants to explore questions such as "Which samples contributed to this box plot having the shape it has?".

Bias . Bias in sampling can invalidate any information about a population inferred from a sample. It would be valuable to have a way to introduce bias in the sampling process in order to observe its effects.

Stratified sampling. Stratified sampling can increase the accuracy of information about populations that have multiple subpopulations of interest. It would be useful to have methods for defining subpopulations and for sampling from these subpopulations disproportionately.

Links to sampling theory. The current Sampling Laboratory provides the basis for an empirical approach to understanding concepts ordinarily encountered in abstract forms. Its power as a tool would be enhanced if its operation could be linked to theoretical constructs. For example, we could show the theoretical binomial curve on the sampling distribution window. Then, the student could select a region and see the probability that a sample would fall in that region and compare that probability to the actual sampling distribution. Other features could help students see the relation between sample size and confidence interval or between confidence level and confidence interval. We could also allow specification of population sizes to show that it does not affect reliability of estimates of the population proportion.

Decision theory. Questions about sample size, confidence levels, and confidence intervals ultimately are meaningful only if there are costs and benefits associated with choices one makes, e.g., that each sample imposes a cost, or that there is some value in being correct about which population a given sample came from. This suggests a successor program in which students must make choices about sampling, using limited resources in order to achieve some goal, such as finding a "small" region that "almost certainly" includes the population proportion. Definitions of parameters like region size (width of confidence interval) would arise from some real problem context.

Extension to other measures. Sampling Laboratory was restricted to estimations of population proportions for several reasons. But many students could go beyond this to explore other measures of central tendency such as mean or median.