Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

+1 -800-456-478-23

White Paper
Eliminate hiring bias

Eliminating Hiring Bias with the CyberGEN.IQ Cybersecurity Assessment

How the Test Allows You to Easily Meet EEOC Standards

The CyberGEN.IQ cybersecurity aptitude assessment is truly the only one of its kind—developed to help businesses across all industries find the perfect candidate for cybersecurity roles quickly and reliably, while reducing the potential for  hiring bias. We’ve developed a whitepaper that dives into how and why the assessment  was developed for hiring managers, HR departments, professionals, and students, and walks you through each part of the assessment to demonstrate how it accurately analyzes the cognitive abilities of the candidate. 

For the full story, make sure to download your own copy today!

1. Introduction: What is CyberGEN.IQ, and How it Came to Be

The CyberGEN.IQ Assessment is a computer-based aptitude test, developed and used specifically for identifying the aptitude of cybersecurity professionals. Originally created in 2015 by the University of Maryland’s Center for the Advanced Study of Language (CASL – now known as the Applied Research Laboratory for Intelligence and Security, or ARLIS), the test was designed based on the cyber needs of both industry and military organizations. The test does not involve any knowledge-based questions and relies strictly on psychometric tests or personality-type questionnaires.

The National Institute for Cybersecurity Education (NICE) has built a framework, called the NCWF (NICE Cybersecurity Workforce Framework) [1], which classifies 52 cyber work roles into seven categories, and further into 33 specialty areas. Each work role is characterized with the tasks that people within it perform and the knowledge, skills, and abilities (KSAs) that are required to successfully perform those tasks. The specialty areas and categories are defined by the functions that people within them perform for an organization, such as securely provisioning a network or analyzing information. This functional classification is useful for building a plan to recruit a workforce, but the function requirements may not completely determine the cognitive abilities that are required for jobs; work roles in different categories may be more cognitively similar to a targeted work role than work roles that share a category with it.

From this framework, CASL created a model of the cognitive abilities that are required for particular cybersecurity jobs. As shown in Figure 1, jobs that primarily require real-time cognition are contrasted with those that primarily require exhaustive consideration of options, and jobs that primarily require generating proactive products are contrasted with those that primarily require reacting to external threats or circumstances.

Figure 1: Schematic of the dimensions on which example cyber careers differ. The quadrant names (in bold uppercase font; e.g., ATTACKING) correspond to a major job task that has the characteristics described on its axes (for instance, “defending” requires real-time responding, while “development” requires proactive, exhaustive deliberation. Example job titles, which appear within quadrants, are taken from the NICE framework. 

In the test development, they also interviewed a set of cybersecurity professionals to extract more information about their specific roles and requirements [2]. This allowed the team to build a theoretical framework and then make predictions about job roles. Through the interviews, they were able to identify common responses and traits held across a variety of cyber roles. 

Once the model was built, CASL then designed an aptitude test to identify individuals with the specified cognitive abilities. Some tasks were taken from well known psychological studies or tasks, while others were developed and tested internally. A list of the tasks within the test can be found in Figure 2, along with the cognitive construct they fall into.

Figure 2: Constructs measured by the tasks administered in the CyberGEN.IQ Assessment. 

Due to the high demand of cyber security professionals needed within the US military, CASL submitted the test to the Defense Manpower Data Center (DMDC) for official approval. It was approved for military use, and continues to be used today to help identify untrained military personnel with the aptitude to perform at a high level in the cyber field. 

The CyberGEN.IQ Assessment was adapted from the initial test, with slight variations to more directly target industry professionals. It has now been successfully deployed by top companies across the nation. 

The rest of the paper is organized as follows: Section 2 gives a brief description of the items in the test, Section 3 examines our scoring procedure relative to discrimination laws, Section 4 lays our the US laws relevant to the use of aptitude tests in hiring, Section 5 looks at the Americans with Disabilities Act, Section 6 responds to the Age Discrimination in Employment Act, Section 7 explains much of the theoretical underpinnings of the test and rationale for inclusion, and Section 8 discusses future work. 

2. CyberGEN.IQ  Assessment 

This section describes the tasks involved in the CyberGEN.IQ Assessment. While this content is all taken from their work, for a more complete description, please see the previous work by Tseng [3], Campbell [4], and O’Rourke [5]. 

Anomaly Detection Rule-Based. 

The Anomaly Detection Rule-Based (ADR) task measures the construct of anomaly detection, in the Responsive Thinking construct category. Anomaly detection represents the ability to detect information that is anomalous in a larger context, such that it does not conform to the expected pattern. ADR specifically assesses this ability by having examinees explicitly learn the rules that govern a system, and then detect patterns in the system that break these rules. 

Coding Speed. 

The Coding Speed (CS) task measures the construct of pattern recognition and scanning, in the Responsive Thinking construct category. Pattern recognition and scanning represents the ability to scan, detect patterns in, and react quickly to incoming information. 

CS specifically assesses this ability by having examinees quickly form associations between numbers and symbols, and then correctly represent those associations. 

Dynamic Systems Control. 

The Dynamic Systems Control (DSC) task measures the construct of complex problem solving, in the Critical Thinking construct category. Complex problem solving represents the ability to learn and effectively manipulate systems which are complex, opaque, and dynamic [6]. DSC specifically assesses this ability by having examinees learn the rules of a complex, dynamic system, and then use these rules to manipulate the system into a specific state. 

Matrix Reasoning. 

The Matrix Reasoning (MR) task measures the construct of rule induction, in the Critical Thinking construct category. Rule induction ability represents the ability to determine the rules that govern a pattern. 

MR specifically assesses this ability by having examinees select which of eight response-option figures correctly completes a 3 x 3 grid of figures, based on the rules that govern the pattern of the other figures in the 3×3 grid. 

Need for Cognition.

The Need for Cognition (NFC) task is an index of the need for cognition construct, in the Exhaustive Consideration construct category. Need for cognition represents the degree to which individuals enjoy participating in mentally demanding tasks. NFC can be assessed using a well-established survey [7]. 

Need for Cognitive Closure. 

The Need for Cognitive Closure (NFCC) task is an index of the need for cognitive closure construct, in the Exhaustive Consideration construct category. Need for cognitive closure represents the need to arrive at a solution during problem-solving [8], and can be decomposed into five component traits: 1) desire for predictability, 2) preference for order and structure, 3) tolerance for ambiguity, 4) decisiveness, and 5) close-mindedness. NFCC can be assessed using a well-established survey by Roets and Van Hiel [9]. 

Number Picker. 

The Number Picker (NP) task measures the construct of tolerance for risk, in the Exhaustive Consideration construct category. Tolerance for risk represents the likelihood of an individual to be risk-taking or risk-averse, and is a factor known to influence decision making. NP specifically assesses this ability by having examinees select high- or low-risk strategies for amassing points. 

Paper Folding. 

The Paper Folding (PF) task measures the construct of spatial visualization, in the Critical Thinking construct category. Spatial visualization represents the ability to form and manipulate visuospatial representations. PF specifically assesses this ability by having examinees use their spatial visualization ability to determine where holes punched in folded pieces of paper would appear once the paper is unfolded. 

Remember and Count.

The Remember and Count (RAC) task measures the construct of visuospatial working memory, in the Critical Thinking construct category. Visuospatial working memory is the workspace for briefly holding and manipulating visuospatial information [10]. 

RAC specifically assesses this ability by having examinees recall the color, location, and order of a sequence of triangles, with a spatial processing task between the presentation and the recall of the sequence of triangles. 

Recent Probes 1-Shape. 

The Recent Probes 1-shape (RP) task, developed by Sternberg [11], measures the construct of psychomotor speed, in the Real-Time Action construct category. Psychomotor speed represents the ability to respond quickly and to control the speeded motor response in the face of interference. RP specifically assesses this ability by having examinees monitor a sequence of images for a single target. 

Spatial Integration. 

The Spatial Integration (SI) task [12] measures mental model ability (i.e., modeling program execution), in the Proactive Thinking construct category. Mental model ability represents the ability to construct abstract, internal representations of a situation, real or imagined, derived from a narrative or other form of input [13] and provide a basis for inference-making and successful recall of information [14]. SI specifically assesses this ability by having examinees construct a mental model to represent the spatial arrangement of four items, when the spatial relationships of these four items are presented separately. 

Statistical Learning. 

The Statistical Learning (SL) task measures a requisite ability for the construct of anomaly detection, in the Responsive Thinking construct category. Anomaly detection requires at least two abilities: the ability to detect patterns in input, and the ability to apply rules based on those patterns. In this case, SL measures an individual’s ability to learn sequences by extracting the transitional probabilities between successive items in a continuous stream during passive viewing. SL specifically assesses this ability by having examinees view a sequence of images, and then perform a recognition test for the high probability sub-sequences. 

Vigilance Task. 

The Vigilance Task (VT) measures the construct of vigilance, which is in the Responsive Thinking construct category. Vigilance represents the ability to remain vigilant or sustain attention during a task that occurs over a prolonged period of time. VT specifically assesses vigilance by having examinees monitor a sequence of rapidly presented symbols nested in visually complex backgrounds for specific, low probability targets. 

The CyberGEN.IQ Assessment is unlikely to contain any sensitive material, as many of the task items are colors, graphs, letters, numbers, shapes, or symbols. However, the task items in SI are images of common objects, which can be reviewed for potential sensitivity issues. Also, NFC is a scale taken from Cacioppo, Petty, and Kao [7] and NFCC is a scale taken from Roets and Van Hiel [9], which can also be reviewed for potential sensitivity issues. It should also be noted that the instructions and the stimuli for NP were designed to avoid any mention of gambling, in case an examinee has religious objections to gambling. 

3. Scoring Procedures 

For a full description of the scoring process, including calculations and methods, please see Tseng et al [3]. 

Title VII imposes restrictions on how to score candidate tests such as the CyberGEN.IQ Assessment. ”Employers are not permitted to (1) adjust the scores of, (2) use different cutoff scores for, or (3) otherwise alter the results of employment-related tests on the basis of race, color, religion, sex, or national origin.” (Id. at 2000e-2(l)). 

It is important to note here that Haystack Solutions is not the employer and does not make any hiring decisions. We do not set any cutoff scores or make direct hiring decisions. We simply provide a platform and performance data to the employer. Haystack, however, does play a vital role in ensuring the data the employer receives is not biased or discriminatory. 

To that end, we do not adjust or alter scores based in any way on race, color, religion, sex, or national origin. Our scoring algorithms only view the raw candidate responses to the aptitude tasks, and report only on that. There are no instances in which the individual’s 

4. Formal Statement of Laws

The United States Equal Employment Opportunity Commission (EEOC), created in 1965, is ”responsible for enforcing federal laws that make it illegal to discriminate against a job applicant or an employee because of the person’s race, color, religion, sex (including pregnancy, transgender status, and sexual orientation), national origin, age (40 or older), disability or genetic information” [15]. The EEOC website gives guidance and official guidelines specifically for employment testing and selection procedures [16]. It is therefore vital to provide evidence to show that the CyberGEN.IQ testing procedure acknowledges these guidelines and actively works to ensure it does not violate any discrimination laws. 

Title VII of the Civil Rights Act of 1964 prohibits employment discrimination based on race, color, religion, sex, or national origin. Title VII permits employment or candidate testing, so long as they are not “designed, intended or used to discriminate because of race, color, religion, sex or national origin.” (42 U.S.C. 2000e-2(h)). 

While the test must be shown not to discriminate unjustly, the scoring procedures must also be done in an equal manner. ”Employers are not permitted to (1) adjust the scores of, (2) use different cutoff scores for, or (3) otherwise alter the results of employment-related tests on the basis of race, color, religion, sex, or national origin.” (Id. at 2000e-2(l)). 

Also of note is Title I of the Americans with Disabilities Act (ADA), which ”prohibits private employers and state and local governments from discriminating against qualified individuals with disabilities on the basis of their disabilities.” 

The Age Discrimination in Employment Act (ADEA) ”prohibits discrimination based on age (40 and over) with respect to any term, condition, or privilege of employment… The ADEA also prohibits employers from using neutral tests or selection procedures that have a discriminatory impact on persons based on age (40 or older), unless the challenged employment action is based on a reasonable factor other than age. 

The EEOC has adopted the Uniform Guidelines on Employee Selection Procedures (UGESP) under Title VII to provide guidance for employers in designing and using employee testing. The rest of the sections in this paper are dedicated to identifying each of these issues and providing evidence for methods used to validate the CyberGEN.IQ Assessment. 

5. Americans with Disabilities Act

The CyberGEN.IQ Assessment has been created with the goal of maximizing accessibility, whether that be across language, culture, prior work experience, or individuals with disabilities. While the test is administered via a computer and requires clicking and typing, these individuals are all applying for or interested in roles within computer science. Our testing, therefore, is ”job-related and consistent with business necessity” [16]. 

Extensive thought and research went into the development of the test to make it as inclusive as possible. While testing did not focus specifically on individuals with disabilities, we have not seen indications that the test reflects or measures an individual’s impairment. We will continue to monitor user performance and expand the options able to be self-selected for disabilities. 

6. Age Discrimination in Employment Act

Similar to the ADA, Haystack Solutions has assessed the CyberGEN.IQ test relative to the Age Discrimination in Employment Act (ADEA). There is literature showing correlations between age and some features being tested such as Risk Aversion [17] and Need for Cognition [18]. However, we have also seen correlations between these features and course performance, and there is research supporting the relationships between these features and job performance [19] [2]. 

Each job role also requires a different skillset, and therefore some roles rely more heavily on certain areas of the assessment. For example, a defensive cyber operative may need to be more risk-averse than an offensive one. These nuances can be both quantitatively (through course performance) and qualitatively (through subject matter expert knowledge) learned. 

7. Guidelines 

Title VII prohibits intentional discrimination based on race, color, religion, sex, or national origin, citing both “disparate treatment” and “disparate impact” discrimination [16]. 

Since all participants are given the same test, disparate treatment is not applicable. Much re-search has been done to identify possible sources of disparate impact and to limit the effects. 

The tasks in CyberGEN.IQ have been approved by the Defense Manpower Data Center (DMDC) for use within the US Military. As part of this approval process, the ARLIS team completed the DMDC Checklist [3]. This work identifies areas of known or expected subgroup differences, as well as data from two main studies – one during test development with university students and one during field testing with the US Air Force. 

7.1 Guidelines

All tasks in the CyberGEN.IQ Assessment are computer-administered, so no differences across administration modes are expected. No differences in scores across subgroups are expected to be observed for ADR, CS, DSC, NFCC, RP, and SL. The remainder of this section discusses possible differences in scores across subgroups that may be observed for the other tasks in the assessment. 

In MR, the figures can vary on one or more of the following dimensions: number, rotation, shading, shape, and size. Therefore, differences in computer monitor settings (such as brightness and screen resolution) may affect an examinee’s score on the task. If computer monitor settings differ across subgroups (such as gender, race, or ethnicity), systematic differences in MR scores across these subgroups may be observed. There is some evidence that matrix reasoning tasks tend to favor males over females to a small degree (e.g., [20] [21] [22], but the finding is debated [23] [24] [25]. One study[26] found that racial differences on matrix reasoning existed under conditions that emphasized mental ability, but not under low-stakes terms; this may be due to stereotype threat [27] rather than specific to the task. Finally, there is evidence that nonverbal reasoning abilities tend to decrease with age [28]. 

Based on prior research, it is expected that certain subgroups will perform differently on NFC. For example, women have a tendency to score higher than men in previous research [29], and there may be a significant correlation between scores on NFC and educational attainment. 

The Even Gamble measure of NP, a measure of risk, may show higher scores for males than females. It is established in the literature that males have a greater risk tolerance than females [30] [31]. Given that PF indexes spatial visualization ability, it is possible that male examinees will outperform females, as there is an established advantage for males in spatial processing [32]. In RAC, examinees recall the sequence of triangles presented in the Remember portion by indicating the color, the location, and the order of each triangle in the sequence. Further, examinees indicate the number of dark blue circles in the Count portion. Therefore, color vision deficiency may decrease an examinee’s score on the task, and differences in computer monitor settings (such as brightness and screen resolution) may affect an examinee’s score on the task. The sequence of triangles is presented rapidly. Therefore, differences in computer hardware may affect an examinee’s score on the task. If color vision deficiency, computer monitor settings, or computer hardware differ across subgroups (such as gender, race, or ethnicity), systematic differences in RAC scores across these subgroups may be observed. 

Given that SI indexes the ability to create spatial mental models, it is possible that male examinees will outperform females, given the established advantage for males in spatial processing [32]. While color does not play a systematic role in the task stimuli, the task stimuli do contain bright colors. It is possible, therefore, that individuals with color vision deficiency will perform differently than their normal vision peers due to reduced availability of color-based cues. In VT, examinees monitor a stream of letters (O, D, and backward D), responding only to the target letter (O). The letters are nested in a visually complex background. Therefore, differences in computer monitor settings (such as brightness and screen resolution) may affect an examinee’s score on the task. The letters are presented rapidly; therefore, differences in computer hardware may affect an examinee’s score on the task. If computer monitor settings or computer hardware differ across subgroups (such as gender, race, or ethnicity), systematic differences in VT scores across these subgroups may be observed. Colorblind individuals with certain kinds of color deficiency tend to have better pattern detection for non-colored stimuli, so it is possible that they will perform better than their normal-vision peers [33]. 

7.2 Data 

Coefficients of the standard error of measurement (SEM) were calculated by female and male sub-groups for the USAF group. Cronbach’s alphas, odd-even split-half correlations, and the weighted sum of deviation scores were used to estimate the SEM. During test development, the only options for gender were male or female. During future versions, we will use more inclusive language. 

There are no large differences in SEM coefficients between the female and male subgroups. The somewhat larger difference in SEM coefficients between the female and male subgroups for RP is most likely due to an outlier in the male sub-group (i.e., an examinee who had a substantially higher score than the other examinees). Coefficients of the standard error of measurement (SEM) were calculated by not Hispanic/Latino and Hispanic/Latino subgroups for the group (see fig. 3) Cronbach’s alphas, odd-even split-half correlations, and the weighted sum of deviation scores were used to estimate the SEM. 

Figure 3: Standard error of measurement (SEM) estimates for the USAF study by not gender subgroups

Figure 4: Standard error of measurement (SEM) estimates for the USAF study by not Hispanic/Latino and non-Hispanic/Latino subgroups 

Figure 5: Differences in testing functioning for the ITF group by female and male subgroups. Note p < .10, * p < .05, * * p < .01, * * *p < .001 To determine whether a given task had a normal distribution of scores or non-normal distribution of scores, the skew of the sample’s score distribution was divided by the standard error of the skewness statistic. If the result had an absolute value less than 2, the scores were considered to be normally distributed, and the subgroups were compared using a t-test. However, if the result had an absolute value greater than 2, the scores were considered to be non-normally distributed, and the subgroups were compared using a Wilcoxon signed-rank test.

There are no large differences in SEM coefficients between the not Hispanic/Latino and Hispanic/Latino subgroups. The somewhat larger difference in SEM coefficients between the not Hispanic/Latino and Hispanic/Latino subgroups for RP is most likely due to an outlier in the not Hispanic/Latino subgroup (i.e., an examinee who had a substantially higher score than the other examinees). 

Differences in test functioning between the female and male subgroups were statistically significant for RAC and RP, and were marginally significant for DSC. For RAC, the male subgroup had significantly higher scores (M = .406) than the female subgroup (M = .304), t(286) = -3.256, p = .001. For RP, the male subgroup also had significantly higher scores (M = 82.733) than the female subgroup (M = 74.657), W = 5086.0, p = .049. The RP difference may be due to a single exceptionally high score in the male subgroup. For DSC, the male subgroup had marginally higher scores (M = .337) than the female subgroup (M = .242), W = 5217.0, p = .075. No other differences in test functioning between the female and male subgroups were observed. 

Figure 6: Differences in test functioning for the ITF group by not Hispanic/Latino and Hispanic/Latino subgroups. Note p < .10,*p < .05,* * p < .01, * * * p < .001 To determine whether a given task had a normal distribution of scores or non-normal distribution of scores, the skew of the sample’s score distribution was divided by the standard error of the skewness statistic. If the result had an absolute value less than 2, the scores were considered to be normally distributed, and the subgroups were compared using a t-test. However, if the result had an absolute value greater than 2, the scores were considered to be non-normally distributed, and the subgroups were compared using a Wilcoxon signed-rank test. 

Differences in test functioning between the Hispanic/ Latino and not Hispanic/Latino subgroups were marginally significant for NP and SL. For NP, the Hispanic/Latino subgroup had marginally higher scores (M = .429) than the not Hispanic/Latino subgroup (M = .376), W = 4847.0, p = .069. For SL, the not Hispanic/ Latino subgroup had marginally higher scores (M = .636) than the Hispanic/Latino subgroup (M = .615), t(283) = 1.883, p = .061. No other differences in test functioning be- tween the Hispanic/Latino and not Hispanic/Latino subgroups were observed. 

7.3 Rationale for Inclusion 

Anomaly Detection Rule-Based. 

The Anomaly Detection Rule-Based task was developed to measure the ability to learn and apply a set of given rules. This is one component of the overall “anomaly detection” ability hypothesized to be part of success in the cybersecurity field [19]. While overall ability to detect anomalies is analogous to cybersecurity tasks such as identifying signs of a compromised system, detecting events that deviate from normal behavior requires first learning what is normal behavior, or learning the patterns and rules of a system. The Anomaly Detection Rule-Based task targets the pattern-application component of anomaly detection. 

Coding Speed. 

The Coding Speed task measures the ability to recognize and scan for patterns. This ability may be associated with fluid intelligence [34] [35], memory [36], and processing speed [37] and is hypothesized to relate to success in cybersecurity fields because these professionals need to recognize and react to threats quickly [19]. 

The Coding Speed task displays a row of symbols and corresponding numbers and asks participants to press the appropriate number for each symbol, similarly to early paper-and-pencil versions of Coding Speed. Scores are based on both speed and accuracy. 

Dynamic Systems Control. 

The Dynamic Systems Control task is designed to measure complex problem-solving ability, which is the ability to gain and demonstrate an understanding of systems that are complex, opaque, and dynamic [6]. The task was adapted from the MicroDYN task used in research on education and cognitive psychology (see Schweizer, Wüstenberg, and Greiff [38], for an example). It includes both rule identification and rule application components, assessed through an exploration phase and a rule application (‘control’) phase for each item. The ability to discover and then effectively apply rules to control an abstract system should indicate an ability to learn complex systems and react to unique challenges in a cyber context [19]. 

Matrix Reasoning. 

The Matrix Reasoning task was developed by CASL as a measure of reasoning ability, modeled after existing matrix reasoning type measures, such as the Raven’s Progressive Matrices [39] test. The ability to inductively learn rules may relate to success in the cybersecurity field because professionals may need to determine the rules that a system follows without receiving prior documentation on the system [19]. Additionally, rule induction has been found to predict programming abilities with novice learners [40]. 

Need for Cognition. 

Individuals who tend to enjoy and participate in cognitively strenuous activities are said to have a high Need for Cognition. Cacioppo, Petty, and Kao [7] created a survey specifically assessing this trait, and their survey continues to be used to predict how people respond to various tasks and information. 

The Need for Cognition survey [7] is an 18-item survey consisting of statements that respondents can either agree or disagree with to varying degrees. This survey measures a trait that may be related to greater fluid intelligence [41] [42], which is defined as the ability to solve new problems by inductive and deductive reasoning. In fact, a higher Need for Cognition in conjunction with a strong performance on working memory tasks such as Operation Span and Matrix Reasoning can better explain fluid intelligence than cognitive constructs alone[43]. 

Those who score high on Need for Cognition are also more apt at problem-solving [44] [45] [46] and decision making [47]. 

Need for Cognitive Closure.

The Need for Cognitive Closure scale measures individual differences in need for cognitive closure, which is a complex trait related to wanting to have answers to questions [8]. This need can be measured by five related subconstructs: desire for predictability, preference for order and structure, tolerance for ambiguity, decisiveness, and close-mindedness. According to Campbell, O’Rourke, and Bunting [48], the ability of an individual to persist in difficult information search tasks can be linked to tolerating a lack of cognitive closure. Thus, an individual who may score low on the NFCC scale may perform well in intensive search tasks in cybersecurity roles. 

Number Picker (NP). 

The Number Picker task is a measure of tolerance of risk. This construct was included in the USAF CATA, because weighing and responding to risk is a factor that is important for a variety of job tasks, including offensive and defensive actions [19]. This specific task was developed to assess whether test-takers could weigh the evidence and make decisions under risk conditions. 

Paper Folding (PF). 

Spatial visualization has been found to be a predictor of performance in virtual information and interface navigation [49] [50], a predictor of speed in hierarchical information retrieval systems searches [51], an indicator of success in computer-based programming, and linked to long-term success in science, technology, engineering and mathematics [52]. Visuospatial abilities may be specifically linked to cyber success in many ways; for example, being able to think through many possibilities can help while writing a complicated program, or plotting out courses of action. This construct may be relevant for success in both cybersecurity training and on the job [19]. The Paper Folding test developed by CASL (PF- CASL) uses new shapes and folding patterns to assess abilities in spatial visualization and manipulation of objects. These items were developed after analyzing shape congruency, fold directionality and obfuscation, and positioning of punched holes in previous paper folding tasks, including Thurstone’s Punched Holes task, and the Paper Folding task developed for the ETS Kit of Factor-Referenced Tests [53]. Results from previous studies ([48]; [54] on which item characteristics affected item difficulty were consulted during item development. 

Recent Probes 1-Shape (RP). 

Recent Probes 1-shape (RP) The Recent Probes task, developed by Sternberg (1966) [11], assesses working memory, particularly the susceptibility to proactive interference (i.e., when prior learning impairs current processing). Prime and probe images are presented with a delay in between, and participants are asked to decide whether the images match or not.

The task is designed to specifically measure the ability to resist proactive interference [55];[56], since participants may experience difficulty rejecting incorrect probes that were presented as primes recently (i.e., in the item immediately before). In the case of the single object prime and probe task (where the object may be a word or figure, for example), it may be that participants are less prone to interference, since there are fewer objects presented as primes at the same time. Cowan and Saults [57] found that individuals with high working memory capacities showed speed advantages when presented with fewer objects as primes (3-4 items) compared to 6 or 8 object primes, and there were no differences between the high and low working memory capacity groups with the more challenging 6 or 8 object primes. These findings suggest that the single object task, like the Recent Probes 1- shape task, would be a better indicator of working memory capacity and speed rather than the ability to resist proactive interference.

Remember and Count. 

Remember and Count was developed as a measure of visuospatial working memory, the workspace for briefly holding and manipulating information from the spatial domain [58]. As detailed by Campbell and colleagues [19], working memory is hypothesized to relate to success in cyber training and job performance because it has been shown to relate to a variety of ability measures [59], to learning logic rules [60], and to learning computer programming specifically [61]. 

Spatial Integration. 

The Spatial Integration task [12] measures mental model ability (i.e., modeling program execution) in the Proactive Thinking construct category. Mental model ability is the ability to construct abstract, internal representations of a situation, real or imagined, derived from a narrative or other form of input [62] and provide a basis for inference making and successful recall of information [63]. The Spatial Integration task requires examinees to construct a mental model to represent the spatial arrangement of four items, when their spatial relationships are presented separately. 

Statistical Learning.

Given that it is necessary to be aware of patterns and regularities in data in order to detect anomalous items, statistical learning (i.e., the ability to detect patterns in stimuli without awareness) is a key ability underpinning anomaly detection. The ability to discern patterns in a data stream is likely to be a key responsive thinking skill for cyber operators [19]. The goal of the statistical learning task is, therefore, to assess an individual’s ability to learn sequences by extracting the transitional probabilities between successive items in a continuous stream during passive viewing. 

Pattern Vigilance. 

The Vigilance task measures an individual’s ability to remain vigilant or sustain their attention during a task that occurs over a pro- longed period of time. It was hypothesized that some cyber security fields require individuals to monitor for information for long periods of time, often with targets occurring very rarely [19]. 

8. Future Work 

There are four main areas of future work planned for the continued monitoring and improvement of the CyberGEN.IQ Assessment: literature reviews, test development, constant test monitoring, and testing more under-represented groups. 

8.1 Literature Reviews 

While much research went into the initial design of the CyberGEN.IQ, the psychological, and psychometric fields are constantly changing and new research is being published. At Haystack, we are partnered with the creators of the test to help keep us all up to date on the most recent developments in the field. Also of vital importance is the ability to identify changes in the cyber security field. As the job demands change, we may need to adapt the test to fit these new requirements. 

8.2 Test Development 

Through the updates in science and research findings, we plan to continue to enhance the test in a variety of potential ways. If we find some tasks are displaying 

signs of disparate impact, we will look to identify alternative ways to test those constructs. We also plan to introduce a pre-testing procedure, where we add new questions to the existing tasks. This will allow us to perform more specific item discrimination analyses and add or remove sections of the test. 

8.3 Constant Test Monitoring 

A large part of the disparate impact guidelines involves analyzing course performance to identify any areas of discrimination. As part of the DMDC checklist, a large analysis was completed, looking at many key metrics such as SEM and test reliability across groups. We are in the process of building an automated pipeline to manage these tasks, al- lowing us to access the most up-to-date data at any time. This will also allow us to catch any issues as they happen, as opposed to retroactive analyses. 

8.4 Under-Represented Groups

Many test-takers in the initial study were male, under 30 years old, native English speakers, and white. We plan to test a more diverse group of individuals so we can present more and better data on subgroup differences (if any exist).