In a world of continuous upskilling, assessments are the compass that guides us on our learning journeys. They show us where we’re at and what areas could use improvement. Whether you're a corporate trainer, an educator, or a student, chances are that you have come across a variety of assessments or have even put together an assessment yourself. One way or another, you may have wondered: “Is this really the best way to assess this?”
To answer this question, we will look into the science and practice behind crafting effective assessments. This marks the first in a series of blogposts on data-driven upskilling. In the upcoming posts, we'll dive deeper into topics such as a framework for assessing data literacy and using assessment results to tailor personalized upskilling plans. Throughout this series, we aim to provide a comprehensive guide to help you enhance your team’s data skills and mindset.
Let's take a closer look at what it takes to craft an effective assessment.
The Power of Assessments
Assessments are a way of measuring and evaluating someone’s skills, knowledge and even mindsets and preferences. Commonly known applications range from job candidate assessments to driving license tests and personality inventories.
In the corporate setting, assessments are the go-to method for uncovering hidden talents and identifying skill gaps in the employee pool. And their results are used to make important decisions, such as hiring or upskilling decisions, that can have a significant impact on the organization.
Currently in high demand are data literacy assessments. With stricter data security regulations and AI office solutions being here to stay, organizations have recognized that all employees come in contact with data. Organizations want to understand how well their employees can interpret data, handle data tools and make data-driven decisions. They not only want insights but want to be able to intervene and upskill their employees if necessary. That is where assessments come into play.
Given the impact assessments can have, one would hope that much care goes into their development. So, what are some best practices for crafting assessments?
The Art of Crafting Assessments
“Intelligence is what the [intelligence] test measures.” (Boring, 1923, The New Republic)
Every psychology student will recognize this quote. Whether it’s a useful definition of intelligence is debatable; it is, however, a helpful mantra to keep in mind when designing a test or assessment. It should remind you to always begin with asking yourself: “What is it exactly that I want to test?”
In other words:
Define a clear objective
Let’s assume your goal is to assess data literacy. What specific skills or knowledge areas do you want to measure? Are you interested in the aggregate level of data literacy across the organization? Or do you intend to measure each individual’s data literacy level and match it to the best fitting learning journey? Think about the purpose of your assessment and describe your objective as clearly and distinctly as possible.
With your objective in mind, you move on to selecting the materials that will make up your assessment:
Choose the right format
Assessments can take various forms, such as surveys, practical skill tests, or scenario-based simulations. Consider what format aligns best with your objectives and audience. While text-based assessments are an efficient way of assessing knowledge and mindsets, task-based assessments are a great way of assessing performance and skills. The latter can even be used to tap into habits and preferences that the participants aren’t able to put into words.
Use the appropriate question types
Sometimes, an assessment is as easy as asking a list of questions. Sounds simple, right? Well, not to ruin the mood or anything, but there are indeed different ways of asking questions, and each comes with pros and cons.
Take open-ended questions in a survey for example. They prompt the participant to enter their response in a text input field. As such, they give the participant the freedom to express their thoughts, but the use of natural language may complicate the analysis. Closed-ended questions such as multiple-choice questions, on the other hand, provide the participant with limited response options to choose from. This makes closed-ended questions easier to analyze at a larger scale, but they may miss a crucial answer that would have been informative.
To get the best of both worlds, it’s common practice to end multiple-choice questions with a free text input option for “other”. If you’re looking to get a train-of-thought type of response, though, nothing beats a well phrased open-ended question.
Plan the question flow
A series of questions can be either static, meaning each question appears in a predefined order, or dynamic, meaning the questions adapt to the participant’s responses. For example, if an initial block of questions reveals that the participant is a data novice, it would be pointless to test them on their programming knowledge due to a static question order. The downside of dynamic series of question is, however, that they require more planning, and are not supported by every survey tool.
Even though difficult assessments may draw an angry mob to your office or leave job candidates confused and disappointed, there are some advantages. A stricter assessment can be necessary to distinguish between participants with stronger and weaker skills. By contrast, an easy, participation-trophy type of assessment may result in very little variance in the scores and therefore little insight into individual differences. In general, it’s good to aim for a balance between easy and challenging assessment elements so that participants feel engaged and complete the assessment.
Reverse-engineer the scoring
Think backwards: start with picturing the analyses you wish to run on the assessment scores. Maybe you would like to infer distinct categories of data literacy, or maybe you’d like to compute a correlation between data skills and data mindsets. Then, based on the analysis type, reverse engineer the measurement scales and statistical criteria your assessment scores need to meet. For example, to compute a linear correlation the two variables need to be on an interval or ratio level of measurement, meaning the values on the measurement scale can be ranked and there is an equal distance between each value on the scale.
Fig. 1: With your assessment's objective in mind, you’ll need to strike a balance between the various assessment formats, question types, item sequences and difficulty levels best suited for your purpose.
Ok, now you’ve got the materials to craft an assessment. But how do you know whether your assessment measures what it’s supposed to? Enter: psychometrics.
The Science of Psychometrics
Psychometrics is the science of measuring psychological attributes such as skills, knowledge, mindsets, and opinions. It provides the methodology to check whether your assessment is accurate, precise, and trustworthy.
At the core of psychometrics is a series of criteria that represent the North Star of test quality. If you want to design a truly accurate and trustworthy assessment, you should aim to meet these criteria. Spoiler: you will never meet them 100%; everything above 70% is acceptable, above 80% is good and above 90% is excellent. That's why, in practice, using common sense and following the tips outlined in the next section often gets the job done. It doesn't hurt to be aware of the North Star, though, so you know which direction you should be heading.
The test quality criteria are:
Your assessment’s results should be unaffected by who administers the assessment.
This is especially relevant for in-person assessments, for which testers need to be trained carefully to standardize the way they guide participants through the assessment. It is less of an issue for computerized assessments that remove the person in the middle. In this case, however, it’s crucial that the instructions and test content are crystal clear and unambiguous for every and any participant.
Your assessment should be consistent and dependable. That means, if you were to administer the same assessment to the same person twice, it should reveal the same score.
To achieve high reliability, it is best practice to conduct pilot tests, use consistent scoring, and eliminate ambiguous questions. To check if you have succeeded, repeat an assessment (test-retest reliability), or compare test items that should measure the same latent variable (internal consistency reliability), and compute a correlation coefficient as a metric of reliability. A common practice in questionnaire design is to include the same question in two or more versions and check if people respond consistently.
The assessment should measure what it’s supposed to.
For a data literacy assessment, for example, you want to ensure it evaluates actual data skills and not unrelated knowledge. You can do so by testing whether participants’ skills in the assessment correlate with their skills on the job. For instance, check if participants who have advanced programming skills according to your assessment also perform well according to code reviews on the job.
Fig 2: High validity means your assessment hits the target. High reliability means your assessment repeatedly hits the same spot (whether it’s the target or not). With low reliability, it’s impossible to achieve high validity, because when an assessment reveals different results you simply don’t know which result is on the target. Therefore, you should always aim for both high validity and high reliability.
Fairness and prevention of bias
Avoid assessment elements that favor certain demographics or backgrounds. If your assessment instructions exclusively provide male examples, don’t be surprised if female participants interpret the instructions differently. It is also helpful to be aware of cultural biases. For example, in some regions it is normal to absolutely agree or disagree, whereas in other regions extremes are considered impolite and participants will at most somewhat agree or disagree. Factor this in when designing your response options.
Further, avoid questions that imply what type of answer would be socially desirable. Given the opportunity, participants will – consciously or not – tend to present themselves in a better light. This can be detrimental if you are trying to uncover skill areas that need improvement. To reduce this type of response bias, choose your wording such that participants are incentivized to present themselves as truthfully as possible.
So, we know that we should aim towards the North Star of test quality. How do we do that in practice?
Crafting an Assessment
Time to roll up our sleeves and walk through the practical steps of crafting an assessment. For illustration purposes let’s stick to our data literacy example.
Build a question bank
Create a collection of questions that cover different aspects of data literacy. You could also look into automated item generation, for which you build a test item template and let a computer algorithm generate test items. Make sure to include multiple questions on the same topic and phrase them both positively and negatively to counteract “yea-saying” tendencies, the phenomenon of respondents tending to agree with questions if in doubt.
Randomize questions and answers
By shuffling the order of questions and answer options per participant you reduce the risk that the order of questions or assessment elements affects your results.
Pilot test the assessment
Administer the assessment to a small group and gather data on whether each assessment element or question contributes to capturing the objective in an effective and meaningful way. Are some questions answered identically by everyone? Toss them out. Do you see any evidence of response bias, like a larger than expected group of individuals who present themselves as “highly skilled”? Rephrase the affected questions and edit the answer options.
Evaluate the pilot results using your domain knowledge and common sense. Do the results tell a coherent story? Does anything stand out as unusual or surprising? Now is the time to play devil’s advocate and investigate if any results may be due to flaws in the assessment design.
Consider item analysis
If precision and accuracy of your assessment are your highest priorities, you should consider item analysis, the statistical method for evaluating and selecting the test items that make up an assessment. It is used to analyze the difficulty and discriminative power of each assessment element and – in an iterative process – optimize for a test that accurately measures the attributes you’re trying to measure. For example, item analysis can help you select items that effectively differentiate between levels of data literacy.
In cases where a false result can have far-reaching consequences, such as in diagnostic tools, item analysis is an absolute necessity. Because it requires multiple iterations and more resources, however, it is often skipped in the development of corporate assessments. So, if you're after the most effective assessment, plan in budget for item analysis and optimization.
Bonus: Give participants immediate feedback to reinforce learning
Use the momentum your assessment results created and follow up with an action plan. After taking our data literacy assessment, the participants would ideally get immediate feedback on their strengths and weaknesses and be directed to their personalized upskilling journey.
Fig 3: A well-designed assessment is the ideal starting point for a tailored upskilling journey. Shown here are the development steps from a data literacy assessment to an upskilling program that can be dynamically linked to each participant's assessment results.
Assessments help us navigate and understand learning progress. They can be powerful tools to tap into hidden talents, uncover skill gaps, and illuminate paths to improvement.
With a good understanding of the challenges and pitfalls of assessment design, you will be well prepared to embark on the journey of crafting an effective assessment. Follow the principles of objectivity, reliability, validity and fairness and you will design assessments that not only measure learning but foster an environment of growth.
Next time in the "Data-Driven Upskilling" series, we will zero in on a framework for assessing data literacy specifically. Stay tuned for more insights on how to harness the full potential of data for upskilling your team!
Boring, E. G. (1923). Intelligence as the test measures it. The New Republic, 35, 35–37.