Criterion validity is the most powerful way to establish a pre-employment test’s validity. Also called concrete validity, criterion validity refers to a test’s correlation with a concrete outcome. In the case of pre-employment tests, the two variables being compared most frequently are test scores and a particular business metric, such as employee performance or retention rates. Show
The relationship between test performance and a business metric can be quantified by a correlation coefficient (ranging from -1.0 to +1.0), which can be used to demonstrate how strongly correlated two variables are depending on how close the number is to -1.0 or +1.0. The more correlated the two variables are, the more predictive validity the test has. In the case of pre-employment testing, the more correlated test scores are with job performance, the more likely the test is to predict future job performance. And as with most correlations, criterion validity can only be established with large sample sizes, making it somewhat challenging to measure. There are two main types of criterion validity: concurrent validity and predictive validity. Concurrent validity is determined by comparing tests scores of current employees to a measure of their job performance. Comparing test scores with current performance ratings demonstrates how correlated the test is for current employees in a particular position. For example, a company could administer a sales personality test to its sales staff to see if there is an overall correlation between their test scores and a measure of their productivity. Predictive validity, however, is determined by seeing how likely it is that test scores predict future job performance. If an employer's selection testing program is truly job-related, it follows that the results of its selection tests should accurately predict job performance. In other words, there should be a positive correlation between test scores and future job performance. Determining predictive validity is a long-term process that involves testing job candidates and then comparing their test scores to a measure of their job performance after they have occupied their positions for a long period of time. < View All Glossary Terms In quantitative research, you have to consider the reliability and validity of your methods and measurements. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:
Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity, which deal with the experimental design and the generalizability of results. Construct validityConstruct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method. What is a construct?A construct refers to a concept or characteristic that can’t be directly observed, but can be measured by observing other indicators that are associated with it. Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organizations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.
There is no objective, observable entity called “depression” that we can measure directly. But based on existing psychological research and theory, we can measure depression based on a collection of symptoms and indicators, such as low self-confidence and low energy levels. What is construct validity?Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct? To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression. The other types of validity described below can all be considered as forms of evidence for construct validity. Content validityContent validity assesses whether a test is representative of all aspects of the construct. To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened.
A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the results may not be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge. Face validityFace validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.
You create a survey to measure the regularity of people’s dietary habits. You review the survey items, which ask questions about every meal of the day and snacks eaten in between for every day of the week. On its surface, the survey seems like a good representation of what you want to test, so you consider it to have high face validity. As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method. Criterion validityCriterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test. What is a criterion variable?A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a “gold standard” measurement. Criterion variables can be very difficult to find. What is criterion validity?To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.
A university professor creates a new test to measure applicants’ English writing ability. To assess how well the test really does measure students’ writing ability, she finds an existing test that is considered a valid measurement of English writing ability, and compares the results when the same group of students take both tests. If the outcomes are very similar, the new test has high criterion validity.
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
Test reliability and validity are two technical properties of a test that indicate the quality and usefulness of the test. These are the two most important features of a test. You should examine these features when evaluating the suitability of the test for your use. This chapter provides a simplified explanation of these two complex ideas. These explanations will help you to understand reliability and validity information reported in test manuals and reviews and use that information to evaluate the suitability of a test for your use. Chapter Highlights
Principles of Assessment Discussed Use only reliable assessment instruments and procedures. Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used. Use assessment tools that are appropriate for the target population. What makes a good test?An employment test is considered "good" if the following can be said about it:
Test reliabilityReliability refers to how dependably or consistently a test measures a characteristic. If a person takes the test again, will he or she get a similar test score, or a much different score? A test that yields similar scores for a person who repeats the test is said to measure a characteristic reliably.How do we account for an individual who does not get exactly the same test score every time he or she takes the test? Some possible reasons are the following:
Principle of Assessment: Use only reliable assessment instruments and procedures. In other words, use only assessment tools that provide dependable and consistent information. These factors are sources of chance or random measurement error in the assessment process. If there were no random errors of measurement, the individual would get the same test score, the individual's "true" score, each time. The degree to which test scores are unaffected by measurement errors is an indication of the reliability of the test.Reliable assessment tools produce dependable, repeatable, and consistent information about people. In order to meaningfully interpret test scores and make useful employment or career-related decisions, you need reliable tools. This brings us to the next principle of assessment. Interpretation of reliability information from test manuals and reviewsTest manuals and independent review of tests provide information on test reliability. The following discussion will help you interpret the reliability information about any test.The reliability of a test is indicated by the reliability coefficient. It is denoted by the letter "r," and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00 indicating perfect reliability. Do not expect to find a test with perfect reliability. Generally, you will see the reliability of a test as a decimal, for example, r = .80 or r = .93. The larger the reliability coefficient, the more repeatable or reliable the test scores. Table 1 serves as a general guideline for interpreting test reliability. However, do not select or reject a test solely based on the size of its reliability coefficient. To evaluate a test's reliability, you should consider the type of test, the type of reliability estimate reported, and the context in which the test will be used. Table 1. General Guidelines for
Types of reliability estimatesThere are several types of reliability estimates, each influenced by different sources of measurement error. Test developers have the responsibility of reporting the reliability estimates that are relevant for a particular test. Before deciding to use a test, read the test manual and any independent reviews to determine if its reliability is acceptable. The acceptable level of reliability will differ depending on the type of test and the reliability estimate used.The discussion in Table 2 should help you develop some familiarity with the different kinds of reliability estimates reported in test manuals and reviews. Table 2. Types of Reliability Estimates
Standard error of measurementTest manuals report a statistic called the standard error of measurement (SEM). It gives the margin of error that you should expect in an individual test score because of imperfect reliability of the test. The SEM represents the degree of confidence that a person's "true" score lies within a particular range of scores. For example, an SEM of "2" indicates that a test taker's "true" score probably lies within 2 points in either direction of the score he or she receives on the test. This means that if an individual receives a 91 on the test, there is a good chance that the person's "true" score lies somewhere between 89 and 93.The SEM is a useful measure of the accuracy of individual test scores. The smaller the SEM, the more accurate the measurements. When evaluating the reliability coefficients of a test, it is important to review the explanations provided in the manual for the following:
Test validityValidity is the most important issue in selecting a test. Validity refers to what characteristic the test measures and how well the test measures that characteristic.
Principle of Assessment: Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used. A test's validity is established in reference to a specific purpose; the test may not be valid for different purposes. For example, the test you use to make valid predictions about someone's technical proficiency on the job may not be valid for predicting his or her leadership skills or absenteeism rate. This leads to the next principle of assessment. Similarly, a test's validity is established in reference to specific groups. These groups are called the reference groups. The test may not be valid for different groups. For example, a test designed to predict the performance of managers in situations requiring problem solving may not allow you to make valid or meaningful predictions about the performance of clerical employees. If, for example, the kind of problem-solving ability required for the two positions is different, or the reading level of the test is not suitable for clerical applicants, the test results may be valid for managers, but not for clerical employees. Test developers have the responsibility of describing the reference groups used to develop the test. The manual should describe the groups for whom the test is valid, and the interpretation of scores for individuals belonging to each of these groups. You must determine if the test can be used appropriately with the particular type of people you want to test. This group of people is called your target population or target group.
Principle of Assessment: Use assessment tools that are appropriate for the target population. Your target group and the reference group do not have to match on all factors; they must be sufficiently similar so that the test will yield meaningful scores for your group. For example, a writing ability test developed for use with college seniors may be appropriate for measuring the writing ability of white-collar professionals or managers, even though these groups do not have identical characteristics. In determining the appropriateness of a test for your target groups, consider factors such as occupation, reading level, cultural differences, and language barriers.Recall that the Uniform Guidelines require assessment tools to have adequate supporting evidence for the conclusions you reach with them in the event adverse impact occurs. A valid personnel tool is one that measures an important characteristic of the job you are interested in. Use of valid tools will, on average, enable you to make better employment-related decisions. Both from business-efficiency and legal viewpoints, it is essential to only use tests that are valid for your intended use. In order to be certain an employment test is useful and valid, evidence must be collected relating the test to a job. The process of establishing the job relatedness of a test is called validation. Methods for conducting validation studiesThe Uniform Guidelines discuss the following three methods of conducting validation studies. The Guidelines describe conditions under which each type of validation strategy is appropriate. They do not express a preference for any one strategy to demonstrate the job-relatedness of a test.
Professionally developed tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted. If you develop your own tests or procedures, you will need to conduct your own validation studies. As the test user, you have the ultimate responsibility for making sure that validity evidence exists for the conclusions you reach using the tests. This applies to all tests and procedures you use, whether they have been bought off-the-shelf, developed externally, or developed in-house. Validity evidence is especially critical for tests that have adverse impact. When a test has adverse impact, the Uniform Guidelines require that validity evidence for that specific employment decision be provided. The particular job for which a test is selected should be very similar to the job for which the test was originally developed. Determining the degree of similarity will require a job analysis. Job analysis is a systematic process used to identify the tasks, duties, responsibilities and working conditions associated with a job and the knowledge, skills, abilities, and other characteristics required to perform that job. Job analysis information may be gathered by direct observation of people currently in the job, interviews with experienced supervisors and job incumbents, questionnaires, personnel and equipment records, and work manuals. In order to meet the requirements of the Uniform Guidelines, it is advisable that the job analysis be conducted by a qualified professional, for example, an industrial and organizational psychologist or other professional well trained in job analysis techniques. Job analysis information is central in deciding what to test for and which tests to use. Using validity evidence from outside studiesConducting your own validation study is expensive, and, in many cases, you may not have enough employees in a relevant job category to make it feasible to conduct a study. Therefore, you may find it advantageous to use professionally developed assessment tools and procedures for which documentation on validity already exists. However, care must be taken to make sure that validity evidence obtained for an "outside" test study can be suitably "transported" to your particular situation.The Uniform Guidelines, the Standards, and the SIOP Principles state that evidence of transportability is required. Consider the following when using outside tests:
How to interpret validity information from test manuals and independent reviewsTo determine if a particular test is valid for your intended use, consult the test manual and available independent reviews. (Chapter 5 offers sources for test reviews.) The information below can help you interpret the validity evidence reported in these publications.In evaluating validity information, it is important to determine whether the test can be used in the specific way you intended, and whether your target group is similar to the test reference group. Test manuals and reviews should describe
Table 3. General Guidelines for Interpreting Validity Coefficients
Scenario OneYou are in the process of hiring applicants where you have a high selection ratio and are filling positions that do not require a great deal of skill. In this situation, you might be willing to accept a selection tool that has validity considered "likely to be useful" or even "depends on circumstances" because you need to fill the positions, you do not have many applicants to choose from, and the level of skill required is not that high. Now, let's change the situation. Scenario Two You are recruiting for jobs that require a high level of accuracy, and a mistake made by a worker could be dangerous and costly. With these additional factors, a slightly lower validity coefficient would probably not be acceptable to you because hiring an unqualified worker would be too much of a risk. In this case you would probably want to use a selection tool that reported validities considered to be "very beneficial" because a hiring error would be too costly to your company.Here is another scenario that shows why you need to consider multiple factors when evaluating the validity of assessment tools.Scenario Three A company you are working for is considering using a very costly selection system that results in fairly high levels of adverse impact. You decide to implement the selection tool because the assessment tools you found with lower adverse impact had substantially lower validity, were just as costly, and making mistakes in hiring decisions would be too much of a risk for your company. Your company decided to implement the assessment given the difficulty in hiring for the particular positions, the "very beneficial" validity of the assessment and your failed attempts to find alternative instruments with less adverse impact. However, your company will continue efforts to find ways of reducing the adverse impact of the system.Again, these examples demonstrate the complexity of evaluating the validity of assessments. Multiple factors need to be considered in most situations. You might want to seek the assistance of a testing expert (for example, an industrial/organizational psychologist) to evaluate the appropriateness of particular assessments for your employment situation.When properly applied, the use of valid and reliable assessment instruments will help you make better decisions. Additionally, by using a variety of assessment tools as part of an assessment program, you can more fully assess the skills and capabilities of people, while reducing the effects of errors associated with any one tool on your decision making. A document by the: U.S. Department of LaborEmployment and Training Administration 1999 |