To measure something is to compare its quantity against some standard unit of the same phenomenon. A metre is compared to a standard metre in Paris. Many ‘measures’ used in rehabilitation are counts of items that are or are not considered passed in some way. For example the Barthel Activities of Daily Living index is a count of 20 different states covering 10 domains. These quantify, but not against any standard. This page discusses various aspects of measurement as undertaken within rehabilitation, concluding that measurement in day-to-day clinical practice is probably over-valued, because it lacks sufficient regard for a patient’s experience, or values, or wishes.
Why measure in rehabilitation?
Measurement in research is essential. Quantification of factors within the biopsychosocial model allows investigation of the inter-relationships between factors:
- at a particular time, for example discovering how much leg weakness alters gait speed, or hand sensory loss affects dexterity;
- over time, for example discovering whether urinary incontinence three days after stroke is related to independence in dressing three months later after stroke;
- in terms of combinations, for example exploring whether the presence of pain in a shoulder increases or decreases gait speed.
Quantification of factors also allows comparisons to be made, between:
- individuals who differ in some way, such as men and women, or people who have or do not have anyone living at home with them;
- the most important comparison is between individuals who had programme ‘A’ or programme ‘B’ in a randomised trial;
- an individual at different times after an event, in other words studying the natural history of change – recovery or deterioration – either related to onset, if acute, or to enrolment into a study if slowly progressive.
If these data are collected within high-quality cohort studies involving representative samples, one can establish the natural history of change and what factors, if any, relate to prognosis. Next, one may also identify particular factors that, if altered, might improve outcome. Third, one can establish whether or not specific interventions are likely to be beneficial, harmful, or just ineffective.
Three types of measure.
This section will outline the three broad categories of measure used in rehabilitation.
The first group are those that do quantify something against a standard. The commonest example of a standard used to quantify a measure, often overlooked, is time. One can measure the time it takes to undertake fixed task repeatedly, and the variation is measured in fixed units (second, minutes etc). The great advantage of using time is that, when measured for a particular patient, one can choose a relevant task. It could be how long it takes to walk to the post office and back, how long it takes to get dressed, or how long it takes to type out a fixed paragraph. The measure is time, not the task.
There are other examples such as strength, weight, forced expiratory volume, and distance walked in six minutes. All are measured against a standard.
Timing performance of an activity (or counting repetitions performed in a certain time) are effective ways of quantifying performance, both in research and clinically.
The second group is counting defined items. This is the most commonly used type of measure. For example this may be number of activities that a person can do independently, from a defined list of activities. (As an example, see the Rivermead Mobility Index here.)
Sometimes in an activity there may be two classes other than the default ‘unable’: can do half (=+1); and can do half (=+1) and can do the other half (=+1). This gives a score for that activity of 0, 1, or 2. The activities should all relate to some clinically relevant construct such as mobility, or domestic activities. (As an example, see the Barthel ADL index scale here.)
Likert scores, which rate answer over five or seven categories, such as from “never” to “all the time”, again are just increasing the number of items counted as each lower one is added to the highest one.
This group of measures covers a wide range of impairments, activities, and social activities. Some, such as the Ashworth Scale (for spasticity) or Medical Research Council muscle strength scale, depend upon an examiner’s judgement. Most depend upon a patient’s answer to a question, or the way that a carer or clinician answers a question.
For the purposes of the next discussion, we will assume that a patient, carer or clinician answer honestly and that no-one is wilfully and knowingly attempting to mislead or decide.
There are three important points to recognise about this type of measure. The first is that, in almost all cases, the countable item is a behaviour, the behaviour being saying ‘yes’ or ‘no’, or ticking a particular box. This leads to the second point, the behaviours are inevitably subject to bias within the person giving the answer. It also leads to the third point, which is that the item’s answer almost always is the result of a judgement made by the respondent.
In one subcategory within this group, the questionnaires that concern performance of activities, it is in principle possible to verify answers through observation. One could watch someone dress, or cook a meal. Of course the very fact of observing someone may itself alter performance, which weakens the principle.
In reality, measures of this type are, in many cases, extremely good when used in research, despite all the potential weaknesses and problems mentioned.
The third group relies entirely on the patient’s report of experiences, rated either on a numerical rating scale (usually 0-10, but others are used) or on a visual analogue scale (usually 100mm long). These can be used to rate any construct wanted, more or less. When used, it is important to consider the end anchor points (e.g. “no pain at all” to “the worst pain I can imagine”) and to specify them in any research report or clinical letter.
This group of measures has two advantages. It can be used to rate whatever is wanted, provided the patient experiences the phenomenon and has the cognitive and communicative ability to provide a rating. It also reflects the patient’s perspective. They can also be used by carers and other not only to rate the carer’s experience, but also to rate the carer’s judgement about what the patient’s experience is. The validity of the surrogate response is debatable.
In research studies, these can be useful, and in most studies of pain, they are often the primary outcome measure. In clinical practice, they can be accepted as a true reflection of a patient’s experience, but clinical judgement is required to interpret the meaning in the overall clinical context. For example, the experience of pain may be due to anxiety or depression not due to tissue damage.
This does not necessarily mean that they are equally good in an individual clinical case when many other and different biasing factors come into account. If a gaining a benefit depends upon how you answer, a patient’s judgement and response may be biassed towards the more dependent rather than the less dependent answer.
The use of visual or numerical rating scales to measure a patient’s experience or perception has a long pedigree in research and, less so, in clinical practice. It allows great flexibility, as it can be adapted to measure many difficult-to-measure constructs. Interpretation of change or differences is less easy.
Choosing a measure
When selecting a particular measure for some particular purpose, there are a series of questions that need to be asked:
- will the data collected give the the information I need? This relates to validity; is it valid for my purpose? It is important to note that validity is set in the context of the purpose.
- what are the error limits of the measure? No measure is perfect, with errors being introduced through the observer, the patient, and the tool itself. Collectively these are often referred to as reliability, although repeatability is a more precise word.
- will the measure detect the difference I expect (or hope) to find? This relates to its sensitivity, which may be between groups of people, or in change seen in a person or group over time.
- can the measure be used? This relates to feasibility, as covers many practical issues such as time taken, effort required, simplicity, training needed etc. The most important aspect is acceptability to the patient who needs to understand why it is useful and to be willing to put in the effort.
- how will the data be recorded and analysed? This is a subset of feasibility, and concerns the nature of the data and what statistical manipulations are available. Statistical ‘rules’ are reasonably flexible, and methods are reasonably robust, so this is a less consideration in most instances.
There are thousands of measures available, and most phenomena that anyone might wish to measure may have more than ten measures that could be used. Therefore there are further questions to ask:
- what measures already exist measuring the construct I am interested in? These do not need to be specific to the condition you are interested in, but simply need to measure the construct. For example, a measure of walking will work for anyone who can walk, regardless of their condition.
- how do the identified compare on the criteria of feasibility, validity, sensitivity, and repeatability? (Usually in that order of importance.)
- is the best or most accessible one good enough for my purpose? It is important to appreciate that perfection is rarely achieved, and that adding one more new measure to the thousands of measures already in existence is probably a waste of your time.
When choosing a measure for research it is important to take an overview, and not to be obsessed with, for example, a particular psychometric property such as the availability of studies on validity. In a research study or audit, it is much better to have a complete data-set on everyone where the data appears ‘less than perfect‘ than it is to have only 70% of data-sets complete albeit with ‘perfect data‘.
It is always wise to look at what other people research your topic have used, and to use the best of those if possible, because it allows direct comparison and, eventually, meta-analysis.
Systematic reviews may help. There are some systematic reviews comparing measures, and if available the review should be used to select a measure. There will almost certainly be a systematic review of whatever topic you are researching, and that will usually also list the measures used in the studies. You should consider them.
It is also important to focus on what you really want to know. If your question concerns solely the ability to dress, then do not use a measure such as the Barthel ADL index as dressing is only one item of ten (this is a real example). And if the question is something like “Does the patient go shopping to buy all the food they need?“, then be prepared simply to ask that question. But do consider whether it is “to go” that you are interested in, or simply the arrival of the shopping (ordered on-line and delivered).
Last, do not forget that a measure is only a tool, something that provides you with data. It cannot answer any question. The use and interpretation of the data depends upon the design of the study, the characteristics of the population, the context and many other factors.
Clinical use of a measure
Clinically the terms assessment is used as if it were synonymous with measurement. As this section is covering clinical use of measures, which are referred to as assessments, it will first discuss the term assessment.
The term, an assessment has two meanings. The first is the process of collecting information as a prelude to making a formulation. This process is covered in two other pages, here and here. This process may entail the use of data-collection tools (called ‘assessments‘) to collect the data in a structured way. This meaning, referring to a process will not be discussed here.
The other meaning constitutes one of several words used for structured data-collection tools or instruments. Other words used include measurement and assessment tools. The meaning encompasses everything from a two-hour session using the Rivermead Perceptual Assessment Battery, through to timing a patient walk ten metres or measuring (‘assessing’) their weight. Assessment is used almost interchangeably with measurement. Measurement is much more specifically concerned with quantification whereas assessment includes collecting categorical data, and the process of analysis.
This section will now consider the use of data-collection tools clinically as a part of the process of assessment, which is a necessary initial part of the formulation of a patient’s position. These data-collection tools may collect simple, categorical data, or may quantify the data in one way or another as discussed earlier.
Under these circumstances one must recognise the following facts.
The process only has one goal, understanding the situation, and collecting data that is not for this purpose is wasting your time and the patient’s time. Formal, structured data-collection tools that cover many items of data often include items that are irrelevant. Two reasons are given to people who question collecting the superfluous data.
“They will ensure we do not miss anything.” This is only acceptable if the person stating that has good evidence that the data item is sufficiently sensitive and specific in the context of the patient population being seen to justify using it. It almost always cannot be justified on that basis.
“The items will help in understanding (or auditing) our service.” This is only acceptable if there is a genuine audit process being carried out, and it the value of every data item collected has been fully established. Even then, given that a significant if not large proportion of patients will have large amounts of data missing, it is difficult to justify.
In other words, arguing that items should be collected “because they might be useful” is very weak, and may lead to people not having the time to collect other important data, or not bothering to collect the data that is important.
All data-collection tools have an error rate. Using them as measures in research is reasonable because (a) the process of collection is likely to be more standardised and more careful and (b) no decisions are being made about individual patients.
However, in the context of an individual patient being seen clinically, measurement is much less dependable. For many measures a 20% difference is within experimental error. Therefore using any quantified data to make decisions needs to be placed into the context of all information, and it is rarely acceptable to make a decision on the basis of a single quantified piece of information.
Therefore, when undertaking the assessment process, the data collected should be determined by the data already known. Algorithms can be helpful, such as “if the patient scores less than 24 on a short orientation-memory-concentration test, the further assessment of cognition or language should be considered”. Even then, the clinical context will determine whether language or cognition is the most appropriate next area to be assessed. When measuring, the data should always be interpreted and used within the context of all relevant information, not on its own.
To conclude, measures (data-collection tools) used in a research or formal audit process are not necessarily useful or usable in a clinical context. Relatively short and simple structured data-collection tools focused on a restricted construct, such as walking, or short-term memory are likely to be helpful. An assessment proforma with many items covering many constructs, and many domains of the biopsychosocial model is unlikely to be useful. All data collected must be interpreted in the clinical context.
This overview of measurement in the context of research and audit has drawn attention to some points often overlooked. The word assessment refers to a process and to a data-collection tool. Measures are quantified data. Timing of activities is an under-recognised means of quantification. Multi-item questionnaires are data-collection tools where data are quantified simply by counting ‘positive’ items. The answers (to the questions) always depend upon someone’s judgement, record someone’s behaviour, and are subject to bias and uncertainty. A measure should be selected on its feasibility, and its ability to provide the data needed to answer the question being asked. Good research measures are not necessarily good in clinical practice. Clinical decisions require interpretation of all data available, and they should rarely if ever be determined by one measure or data-item.