How could a study “in the green” carry risk of replicating results?
Introduction to this series:
This series of posts frames its analysis through the lens of risk, rather than the conventional program-provider lens of potential and promises, or the conventional evaluator lens of a pass/fail checkmark that a program “works”. This series is for district-level decision-makers choosing or renewing programs. Your real-world use may be very different from the ‘lab’ setting in studies, which can affect your results. Differences could be in how you implement the program (see our blog on implementation risk) or in the results you ultimately replicate, or not. It’s important to understand and assess major risk areas. Understanding these risks helps you evaluate programs, weigh tradeoffs and choose risk mitigation strategies, and learn how to apply that program effectively to get results.
Related Posts:
- Lens: Why the lens of “risk”?
- Info: How risky are different sources of program information?
- Studies : How risky is your replicating results?
Overview and a Note
This post highlights several study risks and stresses the need for more than one study. It ends with recommendations for repeating studies in different settings to find reliable patterns and ensure results across all student groups.
Note: Highlighting the risks of a dearth of studies doesn’t mean single studies aren’t valuable—they are. Each research study is essential for evaluating how programs can be applied to work. But to address a myriad of risks, we do need many more studies than one or a few, each with comprehensive evaluations and reporting.
The risk types:
Following are seven study risks, and at the conclusion is a description of a continual evaluation strategy which addresses these risks. Information from use of this strategy will help program users determine specifically what needs to be done to achieve results.
1] Program revision risk:
Programs evolve, but studies only capture them at one point in time. In some cases, that point may be over a decade ago. It’s important to assess how program changes since the study might affect results. Vendor support for educators may have also changed too, such as training courses or support model. These changes can affect how much or how well the program is used and, in turn, the results.
2] High stakes assessment transfer risk:
Studies often focus on a specific state’s standardized test, which varies by state and year. States usually outsource standardized tests and may use different providers. Grade-level standards lists also vary by state and year. Bottom line: a historical study or a study from a different state likely didn’t use your current test. While this may not be a high risk, as results are usually consistent across independent, standardized tests, it’s still important to consider which assessments were used. For example, would a 3rd party assessment, on Kindergarten grade-level, in 1 charter district, in 2011 give you sufficient confidence about predicting impacts on your current state assessment for grades 3-5?
3] Studied situation risk:
Each study reports of course on its particular districts, grade-levels, and classrooms. But districts vary. A program could have different effects depending on how it’s used in different settings. So district-level differences could lead to differences in results. This is known as a risk to “external validity”: will your (different) scenario replicate the results? Variations could include student groups, grade levels, support strategies, content assignments, teacher training, usage time, or device availability. It’s useful to get the full details of studied conditions especially if they are limited to a single district, and consider how might your planned situation be different and what are the risks of replication of use and of outcomes.
4] Low rigor study risk:
Unfortunately single success stories are often promoted and one chart often summarizes a study and is used for decision making. Yet, to manage risk one must ask for full documentation and a full report’s rigor level. To address this issue of rigor, the U.S. Department of Education’s WWC has provided comprehensive and accessible descriptions delineating requirements for study rigor. As one example of the WWC handbook’s comprehensive nature, WWC deems outcome measures developed by the program developer themselves as “nonindependent” and thus not sufficiently rigorous for WWC review. A second highly recommended, practical and accessible resource on studies, rigor and reporting is provided by the research firm Empirical Education, in collaboration with the edtech industry trade group SIIA.
Beyond WWC comparison group studies (Tiers 1 and 2), the U.S. Department of Education has added two lower rungs on the ladder of rigor – Tier 3 correlational studies and Tier 4 research basis. Tier 3 is in practice about dose. And understanding the amount and achievability of dose to get results is indeed crucial. The risk in the arena of dose and study claims is excessive extrapolation. More use does get more results, but be cautious of results extrapolations in marketing claims to impractically high doses that may carry substantial risk of not being attainable at scale (see below). Tier 4 covers a product’s basis in research and stands on its own. To evaluate whether a program will work for you, review both its research basis (you don’t need to read the papers) and its published logic model (you should demand and understand this).
5] Experimental design fidelity risk:
Full experimental study designs direct studied schools and teachers on precisely how to use the program, from training to application and dosage. These “as-designed” experiment program usage conditions will not be 100% faithfully implemented. Moreover to first order their fidelity of following the research design is ignored as a research practice, out of concern for an “x-factor” causing both the changes in usage and also whatever changes in the outcome. This approach is called ‘Intent to Treat’ (ITT), where only assignment yes or no is considered as the ‘dose’ binary variable. Meanwhile in the real world there can be entire assigned classes or even schools which failed to implement the experimental design. Treatment sites could end up not participating, such as just a couple of logins or very small program minutes. Control sites might actually ignore their assignment, and access full-program minutes. Obviously these “leakages” would dilute the measured differences between program users and their comparison controls. ITT misses this dilution. At a minimum, you need to be told the fidelity of use for the studied schools – so you can estimate that replicability – and for the control schools (see below).
6] Dose Achievability Risk:
Results should be reported based on how much the program was used. Yes, there will be increasing results with more use. To apply the results for your situation then, you need to know how to reach the reported dose. What training, startup, support, and weekly usage are required? Then, compared to your expected use plan, what percent of your students are estimated to achieve what minimum level of dose? What percent of your schools are estimated to achieve that dose on average? Results based on unrealistically high usage aren’t meaningful.
7] Control condition risk:
Control classroom activities are often called ‘business as usual,’ meaning we don’t track what really happened. There can be risk that the control group’s activities were NOT business as usual in the course of a full experimental study where the group using the program is a perturbation to a district’s “business as usual”. Particularly if the study covers just a single district, there could be systematic districtwide control condition deviations. For example, the control group may have had less access to devices. The control condition might receive less professional development in the academic content area being studied. The control teachers across the district, or the district itself, could choose to change their “instruction as usual” from the prior year this experimental year. To flag potential for systematic impacts estimate bias due to a district’s control condition changes it should be anticipated and at least qualitatively reported out in any study’s discussion sections.
What is a Low Risk Studies Strategy?
Many risks, too few studies—how do we move forward? Here’s a set of strategies from a program provider’s perspective:
Study All Schools:
Studying all school cohorts on a large scale reduces risks from small sample sizes or concentrated local factors.
Study Every Year:
Each new school year provides new subjects to study, new school cohorts adopting the program. And of course each new school year’s study is on the latest in the program’s version, program supports, and outcomes assessments.
Study All Assessments:
Reporting results from all state standardized tests and aggregating them into national studies reduces the risk of relying on just one measure. It also strengthens the revelation of patterns of program effectiveness that are robust across assessments.
Study Use “in the Wild”:
Studying programs implemented in real-world educational settings via QED helps reduce the risks associated with infidelity to controlled experimental conditions and also the risk of a perturbed control that was not “business as usual”. In a QED, using universally available school-level outcomes data, the control districts have zero use of the program being studied; the controls are literally doing their “business as usual” with zero regard for any “study”. The pairing of “study all schools” (above) with this QED “in the wild” will also mitigate the “studied situation” risk as it will organically include the full population of the real-world range of demographics and conditions.
Report all Dose Levels:
Report outcomes at different dose levels. This provides valuable insights into program effectiveness and informs implementation objectives and strategies. Reporting the percentage of students reaching specific dose levels, including subgroup breakouts, helps assess program impact and identifies potential risks associated with dose attainment.
Report How to Get to a Dose:
Identify pragmatic implementation strategies and practices that facilitate students’ attainment of specific dose levels. These “implementation guides” manage risks associated with implementation levels and thus achieving results based on meeting a dose.
Report Subgroup Results:
Breakout study results specific to different student subgroups. This helps identify potential risks and benefits for diverse populations. Breakouts should include students at different proficiency levels.
Conclusion
The risk of too few studies—or studies that don’t fully describe their conditions—is not a reason to dismiss evidence. It’s a reason to interpret it with care, and to demand more from providers. It’s critical to understand what’s missing, and to make a thoughtful judgment about how likely the studied conditions and outcomes are to match your own. Even a program with a strong evidence claim still carries risk if used differently, in a different setting, or studied under unrealistic conditions.
But when district leaders ask the right questions through a risk lens, they become better equipped to serve students and teachers. By demanding studies that reflect today’s scenarios, that apply across a range of users, and that address real implementation factors, your elevated expectations can reshape the entire research market.