Designing Effective Exams & Test Questions

Perhaps one of the most challenging aspects of teaching a course is developing exams. Tests, when created effectively, can be very useful measures of student mastery of course concepts. This is especially true when they are specifically linked to course objectives or outcomes. Effective exams have the following characteristics:

They are reliable. Reliability is demonstrated when an exam produces data that is consistent over time (Banta and Palomba, 2015). Tests that are too long, have confusing directions, and/or have an unclear scoring protocol are all examples of unreliable assessments.
They are valid. Validity is achieved when a test measures exactly what it was created to measure (Banta and Palomba, 2015). An exam has major problems with validity when its items are not connected to course learning outcomes or produces unexpected results. For instance, if all high-performing students in a class respond to a test question incorrectly, the item is most likely invalid.
They are free from bias. There are two types of bias when it comes to testing. Both forms have to do with validity. Construct validity bias refers to whether the exam measures what it was intended to measure. Content validity bias refers to whether the test items are comparatively more challenging for one group of students than for others.

Use the following tabs for types of exam item and related examples, as well as general strategies for effective exam and test question development.

Types of Exams Items

There are numerous types of exam items that can be used to assess student comprehension and competence. When deciding which type of exam item to use instructors should consider what skills, concepts, or knowledge they want students to demonstrate. Each type of exam item offers advantages and disadvantages and both should be weighed before deciding on which item type to use to best measure student learning. Below you will find descriptions of some common types of exam items:

Multiple choice– comprises of a statement or question stemmed from a concept or learning objective followed with multiple possible options to select. Typically, only one answer is correct, but the test developer could insert multiple correct answers.

True/false– this is a special case of the multiple choice item where only two possible answer choices (true or false) exists. The item answer options are preceded by a statement that stems from a major concept or learning objective from the course.

Essay– consists of an open-ended question that allows for the test-taker to elaborate, in their own words, on a/the major concept(s) or learning objective(s) from the course. Typically, directions on what is to be expected from the answer is detailed before the questions is posed to the test-taker. Questions should be specific, but allow for the test-taker to share their understanding of the major concept(s).

Fill-in-the-blank– an incomplete statement that requires the test-taker to write in the missing word(s) to make the statement true and sensible. These statements typically require the test-taker to show they can identify keywords within a major concept.

Computational– an item that requires the test-taker to demonstrate analytical understanding of a stated problem through justifiable and logical calculations. These types of questions are normally found in math based or quantitative based exams. Test-takers must show the steps used to make connections between the given information and the answer to the question.

Oral– test-takers are prompted with a question and justify their answer through a spoken response.

Performance/demonstrative– test-takers exhibit an understanding of key concepts and skills by physically demonstrating the skill in a controlled environment. Often times, this form of testing is conducted in a role-playing or simulated setting.

Example Test Questions by Type

1. Multiple choice

Question: Suppose you wanted to measure the differences in students’ responses between their pre-test and post-test, which statistical test would be most appropriate for this scenario?

Independent t-test
Dependent t-test
Factorial ANOVA
One-way ANOVA

Answer: B. Dependent t-test

2. True/false

A conjecture is a statement that is believed to be true based on observations, but has yet to be proven true.

True or False?

Answer: True

3. Essay

Provide a response to the following question. Be sure to provide examples that illustrate and support your argument.

Question: How was international diplomacy conducted and viewed after the end of the First World War?

4. Fill-in-the-blank

Movement towards chemical attractants and away from repellents is known as __________.

Answer: Chemotaxis

Computational

Prove: x + 2 (9) = 54

Oral

Explain the difference between inductive and deductive reasoning and provide examples of each.

Performance/demonstrative

The student will demonstrate the proper procedures for performing CPR using a dummy.

General Strategies for Development

Connect individual test items to course learning outcomes/objectives. A test blueprint can help with this. Click the above tab, “Before the Test,” to read more about test blueprints.
Consult with multiple colleagues in designing test questions.
Involve students in the process by having them submit possible test questions that align with course outcomes.
Avoid creating tests that differ too much from other assessments in the course. If you must deviate, explain to students the format of the exam up front.
Avoid “trivia” questions. These questions that focus on “nice to know information” that may have been mentioned once or twice in class, but that aren’t relevant to major course concepts.

_{References:

Banta, T.W. & Palomba, C.A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education(2^nded.). San Francisco, CA: Jossey-Bass.}

Eberly Center for Teaching Excellence at Carnegie Mellon University. (n.d.). Creating Exams. Retrieved September 28, 2018, from https://www.cmu.edu/teaching/assessment/assesslearning/creatingexams.html

McMillan, J.H. (2018). Classroom assessment: Principles and practice that enhance student learning and motivation(7^thed.). New York, NY: Pearson.

Union University Center for Faculty Development. (n.d.). Qualities of an Effective Examination. Retrieved September 28, 2018, from https://www.uu.edu/centers/faculty/resources/article.cfm?ArticleID=135

Determining Learning Outcomes & Exam Type

Before creating your exam, it is important to first review the learning outcomes from your course – or more specifically – the unit/module you are covering. Your student learning outcomes should explicitly describe exactly what students should be able to know and do by the end of the unit/module. Moreover, they should serve as the basis for all the items for your exam. For help with creating clear learning outcomes, visit our Writing Clear Learning Outcomes webpage.

Once you have established the skills that the test should measure, it is now time to create the test items. Before you can do this, however, you first need to decide what type of exam items you will need to create to determine student mastery of learning outcomes. The most common types of exams, along with their pros and cons are included in this table.

Some considerations for selecting the right type of exam items include:

What skills do you want students to demonstrate? This is not only a consideration for your course, but also for the program curriculum.
How long do students have to take the exam? This will determine how many items the exam should have.
Will students have the resources they need to complete the test? In what type of space will students be able to take the exam?
Is there a departmental policy for the types of exams administered?
What is the purpose of the exam? Are there other ways that students can demonstrate their learning?

Test Blueprints

Once you have created your test items, it is time to determine whether the test is measuring exactly what it needs to measure. One useful tool that can help with this process is the test blueprint. A test blueprint outlines the learning outcomes that you intend to assess and helps to specify what content is most important to cover in the exam (McMillan, 2018). Test blueprints can be organized in a number of different ways; however, one of the most common forms of this tool uses Bloom’s Taxonomy as its basis. An example of this type of blueprint template can be found in this downloadable PowerPoint. Other examples of this tool are also included below:

Educational Research Exam example by Linda Suskie: This example includes a breakdown of the scoring and the unit learning outcomes are organized by topic.
Test Blueprinting, A Course-Embedded Tool: This tool from Bloomsburg University defines test blueprints and explains the advantages to using them. It also has a sample of a test blueprint that can be used as a model for your own exam planning.

When using a test blueprint, be sure to think about the following:

How will the material be distributed across the test?
How will the assessment grade/points be distributed and weighted?
How will the test balance knowledge/recognition of higher order items (Bloom’s Taxonomy or other method)?

While it can seem like additional work on the front end, using a test blueprint can help you to not only focus on using questions that promote higher-order thinking skills, but also to distribute appropriate emphasis on important learning objectives.

Takeaway Strategies

In sum, creating strong exams require intentional thinking and planning for how your students will demonstrate mastery of the concepts you present in class. To do this effectively, we at TLI recommend that you consider the following strategies when designing your tests:

When creating multiple-choice exams, avoid answer choices such as “all of the above” or “none of the above.” These response choices can be leading as students may deduce information too quickly to properly eliminate incorrect answers.
Use a parallel form and length for alternatives when writing multiple-choice questions.
When creating essay tests, use language that reflects higher-order levels of learning (e.g., analyze, evaluate, justify, etc.).
Keep test instructions clear, simple, and unambiguous.
Keep the length of the test reasonable given the timeframe students have to take it. Remember that you are a content expert and your students are – in many cases – novices. What might take you 30 minutes to complete will most likely take them longer.
Make the grading process clear for students on the exam so that students know how they are being assessed before you grade it.

_References:

_{Banta, T.W. & Palomba, C.A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education(2^nded.). San Francisco, CA: Jossey-Bass.}

_{McMillan, J.H. (2018). Classroom assessment: Principles and practice that enhance student learning and motivation(7^thed.). New York, NY: Pearson.}

_{Stallbaumer- Beishline, L. M. (n.d.). Outcomes Assessment Essentials: Test Blueprinting, A Course-Embedded Tool. Retrieved September 28, 2018, from http://facstaff.bloomu.edu/lstallba/_documents/OAE4_TestBlueprinting – Copy.pdf}

_{University of Texas at Arlington. (n.d.). Advantages and Disadvantages of Various Assessment Methods. Retrieved September 28, 2018, from https://www.uta.edu/ier/Resources/docs/AssessmentMethods.pdf}

_{Weimer, M. (2018, February 12). Multiple-Choice Tests: Revisiting the Pros and Cons. Retrieved September 28, 2018, from https://www.facultyfocus.com/articles/teaching-professor-blog/multiple-choice-tests-pros-cons/}

Every test should have its content examined for accuracy and consistency. A valid and reliable test helps ensure students were tested fairly and on the intended outcomes. If a test is too difficult or too easy (or if items are too difficult or easy), then they should be re-examined for corrections, amendments, or be discarded. This document will outline some of the “after test” checks an instructor can use to determine that a good test was constructed and used, in particular, item analysis, test validity, and test reliability will be discussed.

Item Analysis

Item analysis is the process for determining the quality of a test item. The purpose is to make sure that all items in the test display adequate variation in responses and strong correlation to all items of the same construct.

Each item in the test should have an adequate number of varying responses from students. That is, not all students should be responding to the item the same way. If many students are responding to the question similarly, then the item is either too easy (almost all students are correct) or there is a strong distractor in the question or response options (almost all incorrect). In addition, if only small percentages of students answer a question correctly, then the question should also be re-examined or discarded for future use.

A few statistical procedures can be used to determine if an item is well constructed:

Point Biserial is a correlation statistic that compares a dichotomous variable (only 2 possible answers; e.g. right or wrong) with a continuous variable (e.g. the student’s total score) to determine the difficulty of the question. Caution: the point biserial correlation does not indicate the test is valid, but can be an indicator of consistency (reliability) (Linn & Gronlund, 2000).
Cronbach’s Alpha can help determine how well all items are correlated together. A high coefficient (scale 0 to 1) can indicate that items are measures of the same underlying construct which is important if groups of items are intended to measure the same learning outcome. Note, an item that significantly increases the coefficient, if it were to be remove from the test, indicates that the item is likely not measuring the same construct and should be reworded or removed.
Kuder-Richardson Formula 20 (KR-20) is a special case of Cronbach’s Alpha and can be used for tests or constructs that are dichotomous (e.g. true/false or right/wrong).

See the following resources for additional information on item analysis:

Test Reliability

Test reliability means that the test has consistent responses and results (Banta & Palomba, 2015). A few examples of measuring the reliability of testing are:

Parallel forms reliability is a measure where a group of students are administered two forms of the same test. To reduce fatigue, using two similar questions throughout the same test could be administered and later correlated to measure consistency.
Inter-rater reliability measures the consistency of two or more raters’ assessment of a student’s performance.Inter-rater reliability are often used in performance assessments to make sure that subjective grading is minimized. For example, two instructors can use the same rubric to measure student performance on a presentation and compare their scores for consistency.

See the following references for additional information on test reliability:

Test Validity

Test validity means that the test measured what it had intended to measure (Banta & Palomba, 2015; Colton & Covert, 2007; McMillian, 2018). There are a few common procedures to use when testing for validity:

Content validity is a measure of the overlap between the test items and the learning outcomes/major concepts. Judgement of the validity typically involves a panel of subject-matter experts where experts rate the content of the test and come to a consensus with inter-rater reliability measures.
Construct validity is the degree to which the test actually measures the underlying construct (often the learning outcomes or major concepts). Cronbach’s Alpha or KR-20 could serve as reasonable measurement of construct validity.
Face validity means that it looks like the items are measures of the learning outcomes/major concepts (i.e. the eye-ball test).

See the following references for additional information on test validity:

_References:

_{Banta, T.W. & Palomba, C.A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education(2^nded.). San Francisco, CA: Jossey-Bass.}

_{Colton, D. & Covert, R.W. (2007). Designing and constructing instruments for social research and evaluation. San Francisco, CA: Jossey-Bass.}

_{Linn, R.L. & Gronlund, N.E. (2000). Measurement and assessment in teaching(8^thed.). Upper Saddle River, NJ: Prentice-Hall.}

_{McMillan, J.H. (2018). Classroom assessment: Principles and practice that enhance student learning and motivation(7^thed.). New York, NY: Pearson.}

The University of Tennessee, Knoxville

Teaching & Learning Innovation Division of Faculty Affairs

Types of Exams Items

Example Test Questions by Type

1. Multiple choice

2. True/false

3. Essay

4. Fill-in-the-blank

Computational

Oral

Performance/demonstrative

General Strategies for Development

Determining Learning Outcomes & Exam Type

Test Blueprints

Test Blueprints

Takeaway Strategies

Item Analysis

Test Reliability

Test Validity