Measurement in rehabilitation

Last updated: 14 September 2025

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.” Lord Kelvin. 1883

Every week, at least one new rehabilitation measure is published, and often more, despite thousands already existing. These measures are used daily to evaluate change and in patient assessments. Many measures are called assessments, and the terms are often used interchangeably. Measures are regarded as objective or scientific. While asking someone if they feel better may be polite, it is generally discouraged as a method to determine if they actually feel better. Clinicians and researchers often feel vulnerable when discussing the measures they employ, as they worry about their psychometric validity.

Measurement in rehabilitation is essential clinically, for service quality enhancement, and in research. In this post, I present an alternative approach that might encourage rehabilitation researchers. I am often asked to speak about measures and measurement, and many years ago, I decided I needed to move away from the standard approach of considering validity, reliability, etc., as it was so dull. This post adopts the approach I used to look at measurement in rehabilitation differently.

Table of Contents

Prelude

One day, someone phoned to ask me to “recommend a valid measure for use with stroke patients because we have been told you are an expert”. I asked what they were studying, but they replied, “That doesn’t matter. We just want a valid outcome measure.” They refused to tell me, so I said, “Well, one simple and valid measure is the Frenchay Aphasia Screening Test.” They replied that that test was no good because they were physiotherapists researching gait retraining. They did not understand what validity meant.

Introduction

The Oxford English Dictionary [OED] gives two primary meanings for ‘to measure’:

  • ascertain the size, amount, or degree of (something) by using an instrument or device marked in standard units
  • assess the importance, effect, or value of (something)”, where the verb, to assess, means “evaluate or estimate the nature, ability, or quality of.”

This immediately illustrates the conflation of measurement with assessment. Later, I will consider how they relate to each other.

Nevertheless, it highlights one crucial aspect of measurement, namely the comparison with a metric or standard unit. The issue of a standard metric, or a way to define a unit, so that numbers have a fixed meaning, remains a challenge in rehabilitation. One approach widely used is to form standard units, such as logits, using statistical techniques, including Rasch Analysis, which is discussed later.

This issue arises because most of the phenomena measured are constructs; they do not have an independent reality. Constructs are themselves challenging because often two measures nominally of the same thing will differ greatly, as the construct is interpreted differently by different people. Measurement of spasticity illustrates the difficulty; it is “poorly defined and poorly measured.”

I will review these and other background issues before proposing my alternative approach.

Background matters.

Words are never precise: the variegated cloud of meanings they carry about with them is their expressive power. But it also generates confusion, ’cause you know sometimes words have two meanings’.”

Carlo Rovelli. Helgoland. Chapter: Information. p.90

The matters explored here primarily concern the meaning of the words used and, as Carlo Rovelli emphasises, words themselves are imprecise.

Constructs

A construct is a collection of objects, observations, or concepts that are considered to share a common characteristic. All nouns name a construct. For example, a chair is a quite definite object. A chair is something you sit on, “typically with a back and four legs” [OED]. Many other objects may be used as chairs and will be called chairs in that context: suitcases, boxes, short stepladders, logs of wood, etc.

Constructs can be abstract, such as justice or fairness and those drawn from multiple observations, such as a medical diagnosis of fibromyalgia, but many, such as the chair, concern physical objects. Plato named the constructs’ the ideal form’.

In healthcare, including rehabilitation, most physical signs are constructs. Spasticity, which has already been mentioned, is one; visuospatial neglect is another. We summarise a set of behavioural and other observations and attribute them to the supposed deficit.

Thus, one must appreciate that, when measuring anything, the phenomenon is based on some central feature that is typically associated with the range of observations used to define it. This is a universal fact; a medical diagnosis such as Parkinson’s Disease is never certain – nor is the absence of the diagnosis.

The main messages are that:

  • The items that measure a phenomenon also define the phenomenon, and
  • If two different measures use different items to measure something with the same name, they will not be measuring the same thing.

Measurement, quantification, data types

Measurement is a constrained, single activity that aims to quantify a phenomenon. Measurement data fall into four categories.

Some data are simple categories, such as sex or handedness. A person is male or female, and may be right-handed, left-handed, or ambidextrous. The data are nominal; they represent the category names. Quantification is limited to counting the number of people in each category. An individual cannot be quantified; they can only be described. These data are called nominal, as they name the person’s category, and they are commonly used in measurement in rehabilitation.

The next type of data is ordinal, the most frequent type of data in rehabilitation measures.  Quantification ranks people into an order, and one can compare the ranking of one person with another in the same group or with people in some other group. The other group is usually a control group in trials. The comparative data can also be from representative groups of, for example, all people with the same diagnosis or all people seen in some other study. These datasets are sometimes referred to as normative data, but one must review critically the population used.

These data are ordinal; they rank the patients. The size of each unit is not standardised and varies. There may be considerable differences between two individuals with the same score, and a difference of one unit might be trivial or significant. They are relative rather than absolute. Examples include the Functional Independence Measure, the Rivermead Post-Concussion Symptoms Scale, and the Barthel ADL index, all commonly used in rehabilitation measurement.

One method for standardising the units is to convert scores into logit units using Rasch Analysis. This theory is mathematically sound, but it is only valid if the population being studied is the same as that used to derive the logit scores, both clinically and culturally. A detailed review in 2023, “Application of the Rasch measurement model in rehabilitation research and practice: early developments, current practice, and future challenges,” outlines the method and its limitations.

Rasch analysis converts ordinal data into interval data, which has standard units allowing two people with the same score to be considered equal on that measure. Additionally, a one-unit difference indicates the same amount of difference regardless of where the scores fall on the spectrum. However, there is no true zero point; for example, the Celsius temperature scale lacks a true zero, whereas the absolute temperature scale does. Rasch analysis logits, too, lack a true zero.

Last, some data are derived from a continuous physical phenomenon, such as weight or length, and the quantification is absolute, as there is an absolute zero. In rehabilitation, there are a few ratio measures, such as the distance walked in 6 minutes or the number of pegs moved on a board in one minute.

Assessment

Assessment was first used in the context of taxation; tax officials collected data about a rich person to determine how much money he (usually) should pay. Indeed, further back, a king would assess nobles and others to determine how many soldiers he should supply in support of a war.

In rehabilitation, assessment involves collecting and interpreting data to develop a formulation. In other words, data are gathered and analysed to achieve a specific goal, acquiring a good understanding of a person’s situation that is sufficient for planning necessary actions. Because it is goal-oriented, the data needed will vary depending on the individual’s circumstances, and initial data may determine what additional data should be collected. Information on the knowledge and skills required for formulation is available here and further guidance on rehabilitation assessment is available here.

An assessment will almost always include nominal data as well as measures. The term applies to the process and the totality of the data collected, not to the individual measures.

The common feature between measurement and assessment is data collection, which is typically undertaken using data collection tools, frequently in the form of a list of items to be considered. When collecting data on a phenomenon, the same tool can be used for either process.

What data?

It is essential to consider critically the data you need before selecting a data collection tool. You must ask yourself questions such as:

  • Why do I need this information?
  • How am I going to use it?
  • Would it matter if I did not have it?
  • What is the minimum I require to satisfy the identified need?

Too often, people:

  • Cannot identify the purpose behind collecting an item
  • Choose a measure that collects more data than is required to satisfy the purpose
  • Add a measure “because others use it” or “because it might be interesting” with no other purpose

Each set of data added will increase:

  • the burden imposed on the patient
  • the likelihood of missing or inaccurate data
  • the workload of people collecting and handling the information.

Data collection tools.

The premise of this approach is that subjects are a source of data, and data collection tools filter the data produced to extract the data sought. This approach will be used to discuss measurement in rehabilitation.

The range of data.

The first step is to identify a range of data items that encompass all possible manifestations of the construct of interest.  For example, if considering mobility, one may only be interested in walking and above, or one might be interested in everything from turning over in bed to marathon running.

The data may consist of many individual items, such as the Rivermead Mobility Index, or it could be an interval measure like the distance walked in six minutes. The former covers a broad range; the latter applies only to people who can walk. More information on the Rivermead Mobility Index, including a copy of a form, is available here.

Most of the following discussion concerns the typical measures used in rehabilitation, comprising a group of individual items either categorised as present/absent or, as with the Functional Independence Measure, categorised on a nominal numeric rating scale.

The crucial point is that the data should encompass the whole range of the construct, but no more.

Collecting data.

The idea is shown in the first figure below (“1. Basic set-up”), and I will enlarge on it here.

On the left is the subject, a patient. The subject can produce many types of data. A small proportion of these will relate to the construct of interest. The researcher will define the range of data that, among the items, covers the construct’s range of manifestations. This is shown in the central pink box. Other data from the subject are displayed above and below in yellow and blue boxes.

The arrows represent data coming from the subject. The data are not positively radiated, like light or radio signals, but it is available and can be obtained by observation, for example, of spontaneous behaviour, or by obtaining it from the person through structured questioning, using physical tools such as gait analysis, or asking them to perform tasks. Often, significant data can be obtained from family, friends or carers.

Powered By EmbedPress

Exceeding the range.

In the figure above, the dashed line within a blue oval represents the data collection tool used for measurement in rehabilitation. In this illustration, the tool covers a greater range of data than that of interest. For example, if one used the Barthel ADL index to measure toileting and continence, only three of the ten items would be relevant: bowel control, bladder control, and toileting. Arguably, one might include mobility and transferring. Feeding, dressing, bathing, and grooming are outside the construct.

The figure also shows how irrelevant data contribute to the output. The consequence is that the output produced contains additional information (noise), a signal unrelated to the desired construct. This reduces the likelihood of detecting change or difference between people, and if one set of the irrelevant data differs systematically according to toileting and continence status, it will bias the observed output.

Thus, the Barthel ADL index would be an invalid measure of the toileting and continence construct because it includes items not directly related to it.

Not covering the range.

The second figure (below) shows the situation when the data items do not cover the whole range associated with the construct. For example, the Rivermead Mobility index covers the whole range of mobility, whereas the Functional Ambulation Categories do not, nor do they claim to. Only the former should be used to cover all mobility.

The consequence is that some relevant data are missed, which lessens sensitivity and specificity. It will generally introduce bias, as some better (or worse) performing subjects do not contribute the correct signal. It leads to floor (or ceiling) effects.

Therefore, Functional Ambulation Categories would be an invalid measure of mobility, but a valid, albeit rough, measure of walking, focusing on the lower end of the range.

Powered By EmbedPress

Discrimination – insufficient or biased.

The third figure illustrates a data collection tool which is insufficiently sensitive because it cannot discriminate differences that are significant to the patient, clinician, or researcher. The person measuring needs to determine what level of discrimination is relevant. Large-scale epidemiological studies often only measure survival (i.e., death), and long-term studies on stroke typically use two questions.

However, most research involves relatively small numbers, and better discrimination is needed to study change or differences between groups. If a tool has too few items, data will be lost, reducing sensitivity.  If the tool is more discriminatory at one end of the range, bias is introduced, as illustrated in the fourth figure.

Powered By EmbedPress

Powered By EmbedPress

Oversensitive discrimination.

Conversely, the data collection tool may gather data that is overly detailed. Timing a ten metre walk to a tenth of a second by hand ignores the unavoidable variability and uncertainty involved in starting the walk and knowing when it ends.

Figure five (below) shows this schematically. A data item may impact two or more items. Consequently, the score will contain a lot of noise. This is one reason for only using yes/no categorisation in items. We showed that classifying the items of the Rivermead Mobility Index into four levels decreased its sensitivity.

Adding extra items unnecessarily also increases the burden on the patient and researchers, increasing the risk of incomplete data.

Powered By EmbedPress

Applicability

The most critical and often overlooked feature of every data collection tool is whether the tool will collect the data when used. The Paced Auditory Serial Addition Test is a measure of complex cognitive function. However, it is stressful, and the difficulty increases until the person fails. Unsurprisingly, many patients will not perform. While it may be an excellent measure of its construct, it cannot be applied to many people.

Figure 6 )below) shows three ways that a data collection tool may fail. The diagram illustrates patients on the left producing a potential dataset for the observer to ‘capture’.  

The first red line represents a situation where some people’s data are not collected, and this is systematic. For example, a test of reaction time that depends on pressing a finger on a button will not be usable by anyone with significant impairment of hand control. This will cause a biased loss of data and reduced sensitivity.

The second line illustrates random loss of data, which reduces the likelihood of detecting the signal; the apparent random loss may also be systematic and introduce bias.

The third line shows different observers, and one would need to check they did not differ systematically in using the tool.

Powered By EmbedPress

 

This approach highlights that, when considering how to collect data, you must:

  • specify in as much detail as possible what the purpose is, what you want to know
  • identify what minimum set of data will
    • cover the range of data needed to encompass all likely cases
    • with sufficient sensitivity to meet your requirements, and no more sensitive
  • find a data collection tool that
    • satisfies the specification
    • maximises data collection in the whole population studied

Given the thousands of measures already published, it is probable that a measure exists, but it may be challenging to find it. Looking at papers investigating similar problems to yours is the best first step.

Some tools will be close to your exact specification, and the issue is whether to

  • use an existing tool that has only minor defects
  • use two existing tools with slight defects and compare their performance
    • this is an additional research question but could provide invaluable information
  • modify an existing tool to overcome defects
  • devise a new tool

I advise against creating a new tool unless nothing currently available meets your requirements. Adding a small project that enhances understanding of existing tools, possibly improving them, or comparing the performance of two tools is more valuable. It will also result in an additional publication!

Psychometric properties.

In a useful review, Psychometrics: Trust but Verify, Thomas Vetter and Catherine Cubbin describe psychometrics as, “… in its broader sense, psychometrics is concerned with the objective measurement of the skills, knowledge, and abilities, as well as the subjective measurement of the interests, values, and attitudes of individuals—both patients and their clinicians.” (“Trust but Verify” is itself a Russian proverb)

Psychometric principles are commonly referred to when discussing measurement, because psychology is entirely concerned with measuring constructs such as intelligence, depression, and motivation. As almost all measurements in rehabilitation concern constructs, psychometric principles are relevant. This section discusses the main properties considering the model described.

Validity.

One approach to validity is whether the tool measures what it is supposed to, such as depression. This leads to the kind of question I was asked when contacted: “for a valid measure of outcome after stroke”. The person believed that validity was inherent to the measure. It is not.

A better approach is to ask whether the tool will answer the question you are asking and whether the tool is valid for your purpose. As the questioner discovered, a measure that is valid for one purpose may not achieve the objective of a different purpose.

Therefore, for example, if you want data that helps establish prognosis after stroke, you need to consider what outcome you are interested in and then what data will provide the relevant prognosis. Therefore, if asked what predicts a global outcome, one might choose urinary incontinence observed at three days post-stroke. If, instead, one were studying arm function, that would be a weak prognosticator compared to voluntary finger extension at one week.

This emphasises yet again that you must specify in as much detail as possible what you want to know before choosing a measure, and validity refers to whether the data collected will achieve your purpose.

Reliability; repeatability.

Reliable has two meanings.

One is equivalent to validity; the Oxford English Dictionary considers that reliability refers to being trustworthy and accurate as a measure. One interpretation of the definition is that the measure measures what it claims to measure. In psychometric jargon, this is also considered in terms of internal reliability, which establishes whether all data items are related to the construct being measured.

It is much better to realise that internal reliability is another aspect of validity and investigates whether there are items in the tool that are irrelevant to its purpose.

The other specific meaning is used extensively in psychometrics but not outside that field; it refers to repeatability. It asks if two people measure or one person measures a phenomenon twice, will they get the same result? It does refer to trustworthiness and accuracy, but from a different perspective.

Therefore, it is much better to refer to repeatability, which makes the meaning explicit.

There are three types of repeatability:

  • inter-rater, which assesses the difference when the same person repeats the measure within a short interval
  • intra-rater, which evaluates the difference when two people undertake the measure within a short interval
  • test-retest, which evaluates the difference when the measure is repeated over a longer interval by the same or different people. This is only applicable when no change in the phenomenon is expected.

These are all measures of the random noise that inevitably occurs in the measurement process. Random noise arises from inherent ambiguities in items, observer fatigue, normal variation in the subject, and other causes.

This emphasises that tools must be as simple, short, and straightforward as possible to minimise factors that increase random errors.

Feasibility.

This aspect of a measure is often overlooked, but it is essential. One should strive to use tools that:

  • impose the least burden on the patient and observer
  • are perceived as relevant by the patient and observer
  • require the least training and are self-explanatory
  • collect no more data than is essential

The psychometric approach is complementary to the first approach. However you approach measurement, you must analyse why you really need every single data item and whether a shorter, simpler tool will achieve your goals.

Conclusion

You can only draw valid conclusions from a well-designed study. This principle applies to all research, audits, and clinical practice involving individual patients. An essential part of study design is determining the minimum data required to reach your objective. You should be able to clearly justify why each data set is needed and how it will be utilised in the analysis. Only then should you begin searching for suitable data collection tools that are appropriate for your study’s context.

Scroll to Top

Subscribe to Blog

Enter your email address to receive an email each time a new blog post is published. 
Then press the black ‘Subscribe’ button.