Measures of Dispersion and Shape: Assessing Normality

Chapter Title: Measures of Dispersion and Shape: Assessing Normality
Introduction
In statistical analysis for real estate appraisal, understanding the distribution of data is crucial for making accurate inferences about property values and market trends. While measures of central tendency (mean, median, mode) provide information about the “typical” value, measures of dispersion and shape reveal the variability and form of the data, allowing us to assess how well the data conforms to a normal distribution. The normal distribution is a fundamental concept in statistics, and many parametric statistical tests rely on the assumption of normality. This chapter will cover key measures of dispersion and shape, and their role in assessing the normality of data sets commonly encountered in real estate appraisal.
1. Measures of Dispersion
Measures of dispersion quantify the spread or variability within a dataset. These measures are important because they provide insight into the range of values and how tightly clustered the data are around the central tendency. Comparing measures of dispersion to known distributions, such as the normal distribution, can help determine if parametric inferential statistics are appropriate.
1.1 Standard Deviation and Variance
- Definition: Standard deviation and variance are the most fundamental measures of dispersion. They quantify the average squared deviation of each data point from the mean.
- Formulas:
- Population Variance (σ²):
σ² = Σ(Xᵢ - μ)² / N
where:- Xᵢ = individual data point
- μ = population mean
- N = population size
- Sample Variance (S²):
S² = Σ(Xᵢ - X̄)² / (n - 1)
where:- Xᵢ = individual data point
- X̄ = sample mean
- n = sample size
- Population Standard Deviation (σ):
σ = √σ² = √[Σ(Xᵢ - μ)² / N] - Sample Standard Deviation (S):
S = √S² = √[Σ(Xᵢ - X̄)² / (n - 1)]
- Population Variance (σ²):
- Interpretation:
- A larger standard deviation or variance indicates greater variability in the data.
- The standard deviation is in the same units as the original data, making it easier to interpret.
- Example: Consider the garden-level apartment rents in Table 14.1. The sample standard deviation (S) is calculated as $84.71 (see Table 14.2 for the detailed calculation).
1.2 Coefficient of Variation
- Definition: The coefficient of variation (CV) is a relative measure of dispersion that expresses the standard deviation as a percentage of the mean.
- Formula:
CV = (S / X̄) * 100
where:- S = sample standard deviation
- X̄ = sample mean
- Interpretation:
- The CV is useful for comparing the variability of datasets with different units or scales.
- A higher CV indicates greater relative variability.
- Example: For the apartment rent data, the CV is ($84.71 / $815.83) * 100% = 10.38%.
1.3 Range
- Definition: The range is the simplest measure of dispersion, defined as the difference between the maximum and minimum values in the dataset.
- Formula:
Range = Maximum Value - Minimum Value - Interpretation:
- The range provides a quick indication of the overall spread of the data.
- Sensitive to outliers, which can inflate the range.
- Example: For the apartment rent data, the range is $995 - $600 = $395.
1.4 Interquartile Range (IQR)
- Definition: The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of the data.
- Calculation:
- Order the data from smallest to largest.
- Q1 is the value that separates the bottom 25% of the data from the top 75%.
- Q2 is the median (the value that separates the bottom 50% from the top 50%).
- Q3 is the value that separates the bottom 75% of the data from the top 25%.
- IQR = Q3 - Q1
- Interpretation:
- The IQR represents the range of the middle 50% of the data.
- Less sensitive to outliers than the range.
- Example: For the apartment rent data, Q1 = $760, Q2 (median) = $825, and Q3 = $860. The IQR is $860 - $760 = $100.
2. Measures of Shape
Measures of shape describe the overall form of the data distribution. They help determine whether the data is symmetrical or skewed, and whether it is peaked or flat.
2.1 Skewness
- Definition: Skewness measures the asymmetry of the data distribution.
- Types of Skewness:
- Symmetrical: The data is evenly distributed around the mean (skewness ≈ 0).
- Left-Skewed (Negatively Skewed): The tail of the distribution extends to the left (skewness < 0). The mean is typically less than the median.
- Right-Skewed (Positively Skewed): The tail of the distribution extends to the right (skewness > 0). The mean is typically greater than the median.
- Formula:
Skewness = [n / ((n - 1)(n - 2))] * Σ[(Xᵢ - X̄) / S]³
where:- Xᵢ = individual data point
- X̄ = sample mean
- S = sample standard deviation
- n = sample size
- Interpretation:
- A skewness value close to zero indicates a symmetrical distribution.
- A negative skewness indicates left skewness.
- A positive skewness indicates right skewness.
- Example: The apartment rent data has a skewness of -0.312, indicating a slight left skew.
2.2 Kurtosis
- Definition: Kurtosis measures the “peakedness” or “tailedness” of the data distribution.
- Types of Kurtosis:
- Mesokurtic: The kurtosis is similar to that of a normal distribution (kurtosis ≈ 3).
- Leptokurtic: The distribution is more peaked and has heavier tails than a normal distribution (kurtosis > 3).
- Platykurtic: The distribution is flatter and has thinner tails than a normal distribution (kurtosis < 3).
- Interpretation:
- A kurtosis value around 3 indicates a shape similar to the normal distribution.
- A high kurtosis indicates a peaked distribution with more extreme values.
- A low kurtosis indicates a flatter distribution with fewer extreme values.
- Example: The apartment rent data has a kurtosis of 0.42, indicating a platykurtic distribution (less peaked than a normal distribution).
3. Assessing Normality
Normality assessment is a critical step in statistical analysis, especially when using parametric tests that assume a normal distribution.
3.1 Empirical Rule and Standard Deviations
- Rule: For a normally distributed dataset, approximately:
- 68% of the data falls within ± 1 standard deviation of the mean.
- 95% of the data falls within ± 2 standard deviations of the mean.
- 99.7% of the data falls within ± 3 standard deviations of the mean.
- Application: Calculate the percentage of data points falling within these ranges and compare them to the expected percentages for a normal distribution.
- Example: For the apartment rent data, 69% of the observations lie within ± 1 standard deviation of the mean, 94% within ± 2 standard deviations.
3.2 Range and Interquartile Range
- Relationship to Standard Deviation:
- For a normal distribution, the range is approximately equal to 6 standard deviations (± 3 standard deviations from the mean).
- The IQR is approximately equal to 1.33 standard deviations.
- Application: Compare the actual range and IQR to these expected values.
- Example: The range for the apartment rent data is 4.66 standard deviations (less than the expected 6 for a normal distribution), and the IQR is 1.18 standard deviations (close to the expected 1.33).
3.3 Visual Inspection: Histograms and Boxplots
- Histograms: Examine the shape of the histogram for symmetry and bell-shapedness.
- Boxplots: Boxplots visually represent the median, quartiles, and outliers. Assess the symmetry of the box and the length of the whiskers.
- Interpretation: Departures from symmetry or unusually long whiskers may indicate non-normality.
- Example: The boxplot of the apartment rent data (Figure 14.3) shows a slight left skewness.
3.4 Normal Probability Plots (Q-Q Plots)
- Construction: A Q-Q plot graphs the quantiles of the dataset against the quantiles of a normal distribution.
- Interpretation:
- If the data is normally distributed, the points on the Q-Q plot will fall along a straight line.
- Deviations from the straight line indicate non-normality.
- Example: The normal probability plot for the apartment rent data (Figure 14.6) shows a reasonably close fit to the straight line, suggesting approximate normality.
3.5 Normality Tests
- Kolmogorov-Smirnov (K-S) Test: Tests the null hypothesis that the data comes from a specified distribution (in this case, the normal distribution).
- Shapiro-Wilk Test: A more powerful test than K-S, especially for smaller sample sizes.
- Anderson-Darling Test: Another powerful test that is sensitive to deviations in the tails of the distribution.
- Interpretation:
- The p-value from these tests indicates the probability of observing a sample with the given characteristics if the null hypothesis (normality) is true.
- If the p-value is less than a predetermined significance level (e.g., 0.05), the null hypothesis of normality is rejected.
- Example: The p-value from the K-S test in Figure 14.6 is 0.15, indicating that we cannot reject the hypothesis that the data was drawn from a normally distributed population.
4. Consequences of Non-Normality and Remedies
- Impact on Parametric Tests: Many parametric statistical tests (e.g., t-tests, ANOVA) assume normality. Violating this assumption can lead to inaccurate results, especially with small sample sizes.
- Nonparametric Tests: When data is not normally distributed, nonparametric tests can be used. These tests do not rely on assumptions about the distribution of the data.
- Data Transformations: In some cases, data transformations (e.g., logarithmic transformation, square root transformation) can be used to make the data more closely approximate a normal distribution.
- Central Limit Theorem: The Central Limit Theorem (CLT) states that the distribution of sample means will approach a normal distribution as the sample size increases, regardless of the underlying population distribution. This can justify the use of parametric tests with larger sample sizes, even if the data is not perfectly normal.
Conclusion
Understanding and assessing the dispersion and shape of data distributions are essential skills for real estate appraisers. By using the measures and techniques described in this chapter, appraisers can evaluate the normality of their data, select appropriate statistical methods, and make more informed decisions about property valuation and market analysis. Remember that perfect normality is rare, and the decision of whether to use parametric or nonparametric methods should be based on a careful evaluation of the data and the potential consequences of violating the assumptions of normality.
Chapter Summary
Measures of Dispersion and Shape: Assessing Normality
This chapter focuses on measures of dispersion and shape and their use in assessing the normality of data, particularly relevant in statistical analysis for real estate appraisal. Understanding data distribution is crucial for selecting appropriate statistical methods. Measures of dispersion, such as standard deviation and variance, quantify the variation within a dataset. The standard deviation, being the square root of the variance, is particularly useful for further statistical analysis. For a population, standard deviation is denoted by σ, while for a sample, it is denoted by S. The coefficient of variation (CV) provides a relative measure of dispersion by expressing the standard deviation as a percentage of the mean, enabling comparisons across different datasets. The range, a simple measure, is the difference between the maximum and minimum values. The interquartile range (IQR), the difference between the third quartile (Q3) and the first quartile (Q1), describes the spread of the central 50% of the data.
Measures of shape, specifically skewness and kurtosis, are essential for evaluating the normality of a distribution. Skewness indicates the symmetry of the data; a normal distribution is symmetrical (skewness = 0), while left-skewed data has a negative skewness, and right-skewed data has a positive skewness. Kurtosis describes the “peakedness” of the distribution. A normal distribution (mesokurtic) has a kurtosis of 3. Leptokurtic distributions are more peaked (kurtosis > 3), while platykurtic distributions are less peaked (kurtosis < 3).
Assessing normality involves examining these measures. A dataset is considered approximately normal if roughly 68% of the observations fall within one standard deviation of the mean, and about 95% within two. Visual tools like box and whisker plots, histograms, and normal probability plots can also reveal deviations from normality. Quantitative tests, such as the Komolgorov-Smirnov test, provide a statistical assessment of normality, with p-values indicating the probability of observing the data if the population were normally distributed. Low p-values (e.g., < 0.05) suggest departure from normality. If data significantly deviates from a normal distribution, nonparametric statistical methods are more appropriate. Also, the median may be a better indicator of central tendency than the mean if extreme values are distorting the arithmetic mean. The chapter emphasizes that while data may approximate a normal distribution, perfect normality is rare.