Dispersion, Shape, and Normality in Appraisal Data

Dispersion, Shape, and Normality in Appraisal Data
Introduction
This chapter focuses on understanding the dispersion, shape, and normality of appraisal data. These concepts are crucial for selecting appropriate statistical methods and making valid inferences about the real estate market. We will explore various measures of dispersion, shape, and methods for assessing normality, along with their practical applications in real estate appraisal.
Measures of Dispersion
Measures of dispersion quantify the spread or variability of data. Understanding dispersion is vital because it indicates the homogeneity of the data and influences the reliability of statistical inferences. High dispersion suggests greater heterogeneity, potentially requiring different analytical approaches.
Standard Deviation and Variance
The standard deviation and variance are fundamental measures of dispersion, reflecting how data points deviate from the mean.
-
Variance: The variance measures the average squared deviation from the mean.
-
Population Variance (σ²):
σ² = Σ(xi - μ)² / N
where:
- xi = each data point in the population
- μ = population mean
- N = population size
- Sample Variance (S²):
S² = Σ(xi - X)² / (n - 1)
where:
- xi = each data point in the sample
- X = sample mean
- n = sample size
Note the use of (n-1) instead of n, for an unbiased estimate.
-
-
Standard Deviation: The standard deviation is the square root of the variance. It provides a more interpretable measure of spread in the original units of the data.
-
Population Standard Deviation (σ):
σ = √[Σ(xi - μ)² / N]
* Sample Standard Deviation (S):S = √[Σ(xi - X)² / (n - 1)]
-
Practical Application and Related Experiment
Consider a dataset of recent sale prices of similar residential properties in a neighborhood.
Property | Sale Price ($) |
---|---|
1 | 350,000 |
2 | 375,000 |
3 | 400,000 |
4 | 425,000 |
5 | 450,000 |
Calculate the sample mean (X), sample variance (S²), and sample standard deviation (S).
- Sample Mean (X): (350,000 + 375,000 + 400,000 + 425,000 + 450,000) / 5 = 400,000
-
Sample Variance (S²):
Property Sale Price (xi) (xi - X) (xi - X)² 1 350,000 -50,000 2,500,000,000 2 375,000 -25,000 625,000,000 3 400,000 0 0 4 425,000 25,000 625,000,000 5 450,000 50,000 2,500,000,000 S² = (2,500,000,000 + 625,000,000 + 0 + 625,000,000 + 2,500,000,000) / (5 - 1) = 1,562,500,000
-
Sample Standard Deviation (S):
S = √(1,562,500,000) = 39,528.47
This indicates that, on average, sale prices deviate from the mean by approximately $39,528.47.
Coefficient of Variation
The coefficient of variation (CV) is a relative measure of dispersion, expressing the standard deviation as a percentage of the mean. This allows for comparing the variability of datasets with different units or scales.
CV = (S / X) * 100
where:
- S = sample standard deviation
- X = sample mean
Practical Application
Compare the variability of sale prices in two neighborhoods.
- Neighborhood A: Mean sale price = $400,000, Standard deviation = $40,000
- Neighborhood B: Mean sale price = $800,000, Standard deviation = $60,000
CV (Neighborhood A) = (40,000 / 400,000) * 100 = 10%
CV (Neighborhood B) = (60,000 / 800,000) * 100 = 7.5%
Although Neighborhood B has a higher standard deviation, Neighborhood A has a higher relative variability in sale prices, as indicated by the CV.
Range
The range is the simplest measure of dispersion, calculated as the difference between the maximum and minimum values in a dataset.
Range = Maximum Value - Minimum Value
Practical Application
Consider a dataset of appraisal values for a commercial property. If the highest appraised value is $1,200,000 and the lowest is $1,000,000, the range is $200,000.
Interquartile Range
The interquartile range (IQR) measures the spread of the middle 50% of the data, providing a more robust measure of dispersion than the range, as it is less sensitive to outliers. The IQR is the difference between the third quartile (Q3) and the first quartile (Q1).
IQR = Q3 - Q1
Determining Quartiles
Given n observations arranged in ascending order, the position of quartiles can be determined as follows:
- Q1 Position: (n + 1) / 4
- Q2 Position (Median): 2(n + 1) / 4 = (n+1)/2
- Q3 Position: 3(n + 1) / 4
If the position is not an integer, interpolation is used to find the quartile value.
Practical Application
Given the following sorted dataset of land values per square foot:
10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35
n = 11
- Q1 Position: (11 + 1) / 4 = 3. Q1 = 15
- Q2 Position: (11+1)/2 = 6. Q2 = 22
- Q3 Position: 3(11 + 1) / 4 = 9. Q3 = 30
IQR = 30 - 15 = 15. This means the middle 50% of the land values have a range of $15 per square foot.
Measures of Shape
Measures of shape describe the symmetry and peakedness of a data distribution. These measures help assess how closely a dataset resembles a normal distribution.
Skewness
Skewness measures the asymmetry of a distribution.
- Symmetrical Distribution: Skewness = 0. The mean and median are equal.
- Left-Skewed (Negatively Skewed) Distribution: Skewness < 0. The mean is less than the median. The tail is longer on the left side.
- Right-Skewed (Positively Skewed) Distribution: Skewness > 0. The mean is greater than the median. The tail is longer on the right side.
Skewness = [n / ((n-1)(n-2))] * Σ[(xi - X) / S]³
where:
- X = sample mean
- n = sample size
- S = sample standard deviation
Practical Application
Analyze a dataset of time-on-market (in days) for residential properties. If the skewness is positive, it indicates that there are some properties that take a very long time to sell, pulling the mean above the median. If the skewness is negative, there would be some really fast sales.
Kurtosis
Kurtosis measures the “peakedness” of a distribution and the thickness of its tails.
- Mesokurtic: Kurtosis ≈ 3. Similar peakedness to a normal distribution.
- Leptokurtic: Kurtosis > 3. More peaked than a normal distribution, with heavier tails (more extreme values).
- Platykurtic: Kurtosis < 3. Less peaked than a normal distribution, with thinner tails (fewer extreme values).
Kurtosis = {n(n+1) / [(n-1)(n-2)(n-3)]} * Σ[(xi - X) / S]^4 - [3(n-1)² / ((n-2)(n-3))]
Practical Application
Analyze a dataset of vacancy rates for commercial properties. Leptokurtic = values clustered closer to the mean, and also extreme values at the tails.
Box and Whisker Plot
A box and whisker plot (box plot) is a graphical representation of the five-number summary: minimum, Q1, median, Q3, and maximum. It provides a visual representation of the distribution’s shape, skewness, and potential outliers.
Practical Application
Creating a box plot of appraisal values can visually identify skewness and outliers. A longer whisker on one side of the box indicates skewness in that direction. Points outside the whiskers are considered potential outliers.
Normality
Normality refers to whether a dataset follows a normal distribution (also known as Gaussian distribution). The normal distribution is symmetrical, bell-shaped, and characterized by its mean (μ) and standard deviation (σ). Many statistical tests assume normality.
Assessing Normality
Several methods can be used to assess normality:
-
Visual Inspection:
- Histograms: Check for a bell-shaped, symmetrical distribution.
- Normal Probability Plots (Q-Q plots): Check if the data points fall close to a straight line. Deviations from the line indicate departures from normality.
-
Box Plots: Check for symmetry and outliers.
2. Quantitative Tests: -
Shapiro-Wilk Test: Tests the null hypothesis that the data is normally distributed. A p-value less than a chosen significance level (e.g., 0.05) indicates a rejection of the null hypothesis, suggesting non-normality.
- Kolmogorov-Smirnov Test: Similar to Shapiro-Wilk, but generally less powerful.
-
Anderson-Darling Test: Another test for normality, often used as an alternative to Shapiro-Wilk.
3. Rule of Thumb Using Standard Deviation: -
In a normal distribution, approximately 68% of the data falls within ±1 standard deviation of the mean, 95% within ±2 standard deviations, and 99.7% within ±3 standard deviations. Deviations from these percentages suggest non-normality.
Normal Probability Plot
A normal probability plot graphs observed data values against expected values from a normal distribution. If the data is normally distributed, the points will fall approximately along a straight line.
Practical Application
Create a normal probability plot of a dataset of rents of commercial buildings. If the points deviate significantly from a straight line, especially in the tails, it suggests that the rent data is not normally distributed.
Central Limit Theorem and Inference
The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the underlying population distribution. This is crucial for statistical inference.
Even if the original data is not normally distributed, the distribution of sample means will tend towards normality as the sample size grows. This allows us to use statistical tests that assume normality (e.g., t-tests, ANOVA) even when the underlying data is non-normal, provided the sample size is sufficiently large (generally, n > 30 is considered sufficient).
Parametric and Nonparametric Statistics
-
Parametric Statistics: These statistical methods rely on assumptions about the distribution of the underlying population, often assuming normality. Examples include t-tests, ANOVA, and regression analysis.
-
Nonparametric Statistics: These methods do not require assumptions about the population distribution. They are useful when data is non-normal or when dealing with small sample sizes. Examples include Mann-Whitney U test, Wilcoxon signed-rank test, and Kruskal-Wallis test.
Practical Application
- If appraisal data is normally distributed, use parametric tests (e.g., t-test) to compare means.
- If appraisal data is non-normal, use nonparametric tests (e.g., Mann-Whitney U test) to compare medians.
Conclusion
Understanding dispersion, shape, and normality is essential for effective statistical analysis in real estate appraisal. By using the appropriate measures and tests, appraisers can make more informed decisions and draw more accurate conclusions about property values and market trends. When data deviates significantly from normality, consider using nonparametric methods or larger sample sizes to ensure the validity of statistical inferences.
Chapter Summary
This chapter, “Dispersion, Shape, and Normality in Appraisal Data,” from a training course on Statistical Analysis for Real Estate Appraisal, focuses on understanding the distribution of appraisal data to facilitate appropriate statistical inference. The chapter begins by describing the importance of understanding measures of dispersion, such as standard deviation and variance, to compare data sets and determine if parametric statistical methods based on the normal distribution can be applied. The standard deviation, calculated as the square root of the variance, is highlighted for its utility in further statistical analysis.
The chapter then introduces the concept of the coefficient of variation (CV) to compare relative dispersion among different data sets, calculated as the standard deviation divided by the mean, expressed as a percentage. It also discusses the range and interquartile range as simple measures of spread, emphasizing their relationship to the standard deviation in a normal distribution.
Moving beyond dispersion, the chapter delves into measures of shape, specifically skewness and kurtosis, to assess how closely a data distribution approximates a normal distribution. Skewness, which indicates the symmetry of the distribution, is examined using box and whisker plots and histograms, while kurtosis, which describes the peakedness of the distribution, is illustrated using examples of leptokurtic, mesokurtic (normal), and platykurtic curves.
The chapter also discusses normality, which is crucial for many parametric statistical tests. Quantitative tests and normal probability plots are presented as tools for assessing departures from normality. The Komolgorov-Smirnov (KS) test is introduced, with its p-value used to determine whether the hypothesis of a normally distributed population can be rejected.
The chapter concludes by emphasizing the importance of understanding the shape of the data distribution to choose appropriate statistical tests. If the data deviates significantly from normality, nonparametric tests may be more suitable, especially for small samples. The chapter suggests that if extreme values distort the arithmetic mean, the median is likely a better measure of central tendency.