The Development of Computer Science Concepts Inventory for Middle School Students: Preliminary Results
Background, Purposes and the Intended Uses of the Assessment
The CS Concepts Inventory is intended to measure students’ understanding of the four core concepts of CS—variables, conditionals, loops and algorithms— taught at the middle school level. Additionally, we incorporated the concepts of debugging, comprehension and development into the assessment. The assessment was guided by a conceptual framework informed by a Focal Knowledge, Skills and Abilities—FKSAs framework developed by Grover and Basu (2017), the K-12 CS Framework (K–12 Computer Science Framework, 2016) and the Computer Science Teachers Association (CSTA) Standards (CSTA, 2017). The assessment utilizes elements from a block-based programming environment as the context for every question, based on findings that suggest learners, especially novice ones, experience less conceptual and cognitive difficulties using these tools (e.g., Grover, Pea & Cooper, 2015; Robins, Rountree, & Rountree, 2003).
The current version of the CS Concepts Inventory was written for students in grades sixth through eighth. We believe the assessment can be used in either longitudinal contexts, such as in pre-intervention-post design or as a sole administration such as a pre or post assessment only. The assessment is designed in multiple-choice format with four to five answer choices. There are 22 items in total on the assessment. We suggest giving students approximately 40-45 minutes to complete the administration.
The Process of the Development of the Assessment
Some of the items utilized for the assessment were adapted from Weintrop and Wilensky (2015) and work done in the labs of Thomas Price and Tiffany Barnes (Department of Computer Science, NC State University). The remaining items were developed based on the FKSAs, K-12 CS Framework, and CSTA Standards (noted above). There were two pilot studies conducted on the assessment that tested two different versions of the instrument. The original draft version that was used in the first pilot study consisted of 20 items and was administered to 22 middle school students. The results of this first study was used to determine the spread of item difficulty level, as well as readability aspects of the assessment. The results of the first pilot study showed that the items were concentrated in the medium and hard level of difficulty. Thus, in the second pilot study, we added seven additional items with the hope of having a broader range of item difficulty levels, especially in the lower level of difficulty. For each pilot study, we asked three undergraduate and graduate students that were considered as novices in computer science, to give feedback regarding the clarity and readability of the assessment items and directions.
The revised version of the assessment consisted of 27 items and used in a second pilot study that was administered to 245 middle school students. Confirmatory Factor Analysis (CFA), Cronbach’s alpha, Composite Reliability – CR (Raykov, 1997) and Rasch modeling analysis were used to assess the reliability and validity this revised version of the assessment. Those analyses yielded results indicating a satisfactory level of model characteristics for assessment in research settings. This second round of analysis resulted in a model that consisted of 22 final items.
Evidence of Construct Validity and Reliability
To collect the evidence of validity and reliability, we combined the Classical Test Theory (CTT) method and the Item Response Theory (IRT) method. We first ran a CFA to confirm that the assessment is unidimensional and measures one latent trait – students’ understanding of CS concepts. Next, we calculated Cronbach’s alpha and performed CR tests to examine the internal consistency of the assessment. Finally, we ran Rasch modeling analysis to test the validity and reliability of the assessment, given that Rasch modeling analysis provides more robust results of validation. Based on these results, decisions were made on which items to retain and which to remove. Items were removed from the model when they were not reaching statistical significance (p > .05) on the resulting CFA. By removing the items, it increased the value of Cronbach’s alpha. Items were also removed when the values of outfit and infit MNSQ were beyond the acceptable range of 0.70 – 1.30 (Wright & Linacre, 1994). All the analyses were done using WINSTEP 4.0.1 and Stata 15. Figure 1 is a diagram depicting the procedure used to validate the assessment.
After four misfitting items were removed from the model, we obtained the result of CFA with X 2/dƒ = 1.33, p = .001, RMSEA = .039 (90%CI = .026, .051), CFI = .863, TLI = .849 and SRMR = .059. Even though the CFI and TLI values did not meet the acceptable value of > 0.95 (Hu & Bentler, 1999), we had X 2/dƒ and RMSEA values that met the cutoff, which are < 2 (Tabachnick & Fidell, 2007) and < .06 (Hu & Bentler, 1999), respectively. Given that we are still in the process refining this assessment, and CFA is dependent upon sample size, we believe this result is expected to change with future administrations of the instrument as the sample size increases. Moreover, Diamantopoulos and Siguaw (2000) argue that RMSEA is an important aspect of CFA because it detects the lack of fit between the obtained data and the model. Even though we consider all the results from the CTT method, in this case, CFA results, we rely more heavily on the results computed through Rasch analysis. One of the reasons is Rasch analysis is deemed to be sample-independent and thus does not depend on sample size (Bond & Fox, 2013; Boone, Staver & Yale, 2013). Moreover, Rasch analysis has an assumption that higher ability students have a higher probability to correctly answer both more difficult items and easier items than lower ability students (Bond & Fox, 2013; Boone, Staver & Yale, 2013), which is the primary role of an assessment. This assumption is reflected in MNSQ values, when the values are outside the standards mentioned above, the item does not behave as it should be. Table 1 shows all the MNSQ values, and they are in the range of acceptable values except outfit MNSQ value for Item_4_Loo3 which is 0.69. We believe this value is still acceptable due to its close proximity to the cutoff. Regarding the internal consistency of the assessment, all of the reliability values computed through both CTT (&alpha and CR) and Rasch methods (person and item reliabilities) were acceptable which is > .70 (DeVellis, 2003). The final analyses yielded the following values: Cronbach’s alpha = 0.784, CR = 0.828, and Rasch person and item reliabilities were 0.76 and 0.94, respectively.