Descriptive Statistics: Central Tendency, Dispersion, and Shape

Chapter Title: Descriptive Statistics: Central Tendency, Dispersion, and Shape
Introduction
Descriptive statistics are fundamental tools for summarizing and understanding data. In real estate appraisal, these techniques are invaluable for analyzing property values, market trends, and other relevant factors. This chapter covers three essential aspects of descriptive statistics: central tendency, dispersion, and shape. Understanding these concepts allows appraisers to effectively describe and compare different data sets, identify potential outliers, and assess the suitability of various statistical methods for further analysis.
1. Central Tendency
Measures of central tendency aim to identify a typical or representative value within a dataset. These measures provide a single number that summarizes the overall location or “center” of the data distribution.
- Mean:
- The mean (or average) is calculated by summing all values in a dataset and dividing by the number of values.
- Formula: X = (Σxi) / n
- Where:
- X = sample mean
- xi = individual data point
- n = number of data points in the sample
- Σ = summation operator
- Where:
- The mean is sensitive to extreme values (outliers). A single unusually high or low value can significantly affect the mean.
- Example: Using the sample apartment rents (Table 14.1), the mean is calculated by summing all rent values ($29,370) and dividing by the number of units (36), resulting in a mean rent of $815.83.
- Median:
- The median is the middle value in a dataset when the values are arranged in ascending or descending order.
- If there is an even number of values, the median is the average of the two middle values.
- The median is less sensitive to outliers than the mean. It is a robust measure of central tendency when the data contains extreme values.
- Example: Using the sample apartment rents (Table 14.1), the median is the middle value, which is $825.
- Mode:
- The mode is the value that appears most frequently in a dataset.
- A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). If all values appear with equal frequency, there is no mode.
- The mode is useful for identifying the most common or typical value in a dataset.
- Example: Using the sample apartment rents (Table 14.1), the mode is $850, since this value appears most frequently.
- Stratified Random Sampling: When conducting statistical inference, stratified random sampling can improve the process. For example, if garden apartment units exist in one-bedroom and two-bedroom configurations in a given market, and it is known that one-bedroom units represent 35% of the garden apartment population, then a stratified random sample consisting of 35% one-bedroom garden apartment units and 65% two-bedroom garden apartment units would provide improved inferences on population parameters by ensuring that the unit type variability within the sample is consistent with the underlying population.
2. Dispersion
Measures of dispersion quantify the spread or variability of data points in a dataset. They indicate how closely the data cluster around the central tendency. A small dispersion indicates data points are close to the mean.
- Range:
- The range is the difference between the maximum and minimum values in a dataset.
- Formula: Range = Maximum value – Minimum value
- The range is a simple measure of dispersion but is highly sensitive to outliers.
- Example: Using the sample apartment rents (Table 14.1), the range is $995 - $600 = $395.
- When data is normally distributed, the range will be approximately equal to 6 standard deviations (+3S to -3S), and 99.7% of a normal distribution falls within this 6S range.
- Variance:
- Variance measures the average squared deviation of each data point from the mean.
- Population Variance Formula: σ² = Σ(xi - μ)² / N
- Where:
- σ² = population variance
- xi = individual data point
- μ = population mean
- N = population size
- Where:
- Sample Variance Formula: S² = Σ(xi - X)² / (n-1)
- Where:
- S² = sample variance
- xi = individual data point
- X = sample mean
- n = number of data points in the sample
- Where:
- The variance is expressed in squared units, which can be difficult to interpret.
- Standard Deviation:
- The standard deviation is the square root of the variance. It measures the typical distance of data points from the mean.
- Population Standard Deviation Formula: σ = √[Σ(xi - μ)² / N]
- Sample Standard Deviation Formula: S = √[Σ(xi - X)² / (n-1)]
- The standard deviation is expressed in the same units as the original data, making it easier to interpret than the variance.
- Example: Using the sample apartment rents, the sample standard deviation is calculated to be $84.71 (see Table 14.2).
- When data is normally distributed, approximately 68% of the observations are expected to lie within ± 1 standard deviation of the mean, 80% within ± 1.28 standard deviations of the mean, and 95% within ± 2 standard deviations of the mean.
- Coefficient of Variation:
- The coefficient of variation (CV) is a relative measure of dispersion. It expresses the standard deviation as a percentage of the mean.
- Formula: CV = (S / X) * 100
- Where:
- S = sample standard deviation
- X = sample mean
- Where:
- The CV is useful for comparing the variability of datasets with different means or different units of measurement.
- Example: Using the apartment rent data set, the coefficient of variation is $84.71 / $815.83 * 100% = 10.38%.
- The sample having the greatest coefficient of variation has the most widely dispersed data.
- Interquartile Range:
- A data set’s ordered array can be divided into four subsets of identical size by identifying quartiles.
- Quartiles are useful for analyzing the shape of the data distribution.
- Quartile 1 (Q1) ends at the midpoint between the lowest value and the median.
- Quartile 2 (Q2) ends at the median, and Quartile 3 (Q3) ends at the midpoint between the highest value and the median.
- Fifty percent of the ordered array of data falls between Q1 and Q3 in this interquartile range.
- When data is normally distributed, the interquartile range should be approximately equal to 1.33 standard deviations.
- Example: The interquartile range is Q3 – Q1, or $860 – $760 = $100. The interquartile range is 1.18 standard deviations ($100 / $84.71).
Decision Rules:- If the position point calculation is an integer, then the ordered observation occupying that position point is the quartile boundary.
- If the position point is halfway between two integers, then the midpoint between the next-largest and next-smallest ordered observation is the quartile boundary.
- If the position point is neither an integer nor halfway between two integers, then the position point is rounded to the nearest integer and the corresponding ordered observation is the quartile boundary.
3. Shape
Measures of shape describe the overall form of a data distribution. Key aspects of shape include symmetry and peakedness.
- Skewness:
- Skewness measures the asymmetry of a distribution.
- A symmetrical distribution has a skewness of zero.
- A left-skewed (negatively skewed) distribution has a longer tail extending to the left, and the mean is typically less than the median.
- A right-skewed (positively skewed) distribution has a longer tail extending to the right, and the mean is typically greater than the median.
-
Formula:
Skewness = [n / ((n - 1) * (n - 2))] * Σ[(xi - X) / S]³
- Where:
- X = sample mean
- n = sample size
- S = sample standard deviation
- xi = individual data point
- Example: For the apartment rent data, the skewness is -0.312, indicating left skewness.
2. Kurtosis: - Kurtosis measures the “peakedness” of a distribution, as well as the thickness of its tails.
- A normal distribution has a kurtosis of 3 (mesokurtic).
- A leptokurtic distribution is more peaked than a normal distribution and has heavier tails (kurtosis > 3).
- A platykurtic distribution is flatter than a normal distribution and has lighter tails (kurtosis < 3).
- Example: The apartment rent data set is less peaked (kurtosis = 0.42) than a normal distribution.
3. Normality: - Many statistical tests assume that data is normally distributed. It is important to assess the normality of a dataset before applying such tests.
- Methods for assessing normality include:
- Visual inspection of histograms and box plots
- Calculation of skewness and kurtosis
- Normal probability plots (Q-Q plots)
- Statistical tests for normality (e.g., Kolmogorov-Smirnov test)
- Data points that are perfectly normal will line up along a straight-line normal probability plot, whereas data points that depart from normal will depart from a straight line that is representative of a perfectly normal distribution.
- Where:
4. Parametric and Nonparametric Statistics
A parametric statistic is a statistic whose interpretation and validity is determined by understanding the distribution of the underlying population data from which a representative sample has been drawn. Many parametric statistics rely on an assumption that the population is normally distributed.
In contrast, a nonparametric statistic is a statistic whose interpretation and validity do not rely on knowing the distribution of the underlying population data from which a representative sample has been drawn. Nonparametric statistics involves the use of inferential methods that are valid regardless of the underlying population data distribution.
Inferences on medians (as opposed to inferences on means) derived from nonparametric statistics are useful for analyzing small samples when the underlying population distribution is unknown and the sample is so small that the central limit theorem cannot be relied upon to ensure approximate normality of the sampling distribution of the mean.
5. Central Limit Theorem and Inference
Although the most popular and user-friendly inference tests are based on the assumption that a sample has been derived from a normally distributed population, it may not be possible to always assume so. Therefore, an understanding of nonparametric statistics will be useful.
Conclusion
Descriptive statistics are essential for summarizing and understanding data in real estate appraisal. By calculating measures of central tendency, dispersion, and shape, appraisers can gain valuable insights into property values, market trends, and other relevant factors. These insights can then be used to make informed decisions and provide accurate appraisals.
Chapter Summary
This chapter on descriptive statistics in the context of real estate appraisal focuses on three key aspects: central tendency, dispersion, and shape of data distributions. Understanding these concepts is crucial for making informed inferences about property values and market trends.
Central tendency measures, such as the mean, median, and mode, provide a single, representative value for a dataset. While the mean is commonly used, the median is often a more robust measure when dealing with skewed data or outliers, which are common in real estate. Stratified random sampling can improve inferences about population central tendency by ensuring sample variability reflects the underlying population’s characteristics (e.g., proportion of one-bedroom vs. two-bedroom apartments).
Measures of dispersion, including standard deviation, variance, coefficient of variation, range, and interquartile range, quantify the variability within a dataset. Standard deviation, the square root of variance, is particularly useful for statistical inference. The coefficient of variation allows for comparing variability across different datasets with differing means. The range and interquartile range provide simple measures of spread. Comparing dispersion measures to known distributions, like the normal distribution, helps determine if parametric statistical methods are appropriate. For instance, if data closely follows a normal distribution, statistical methods based on this distribution can be used to infer population parameters.
Measures of shape, specifically skewness and kurtosis, describe the symmetry and peakedness of a distribution. Skewness indicates whether a distribution is symmetrical or leans to one side. Positive skewness signifies a right-leaning distribution (mean > median), while negative skewness indicates a left-leaning distribution (mean < median). Kurtosis describes the “peakedness” of the distribution. A normal distribution has a kurtosis of 3 (mesokurtic). Higher values (leptokurtic) indicate a more peaked distribution, while lower values (platykurtic) suggest a flatter distribution. Box and whisker plots and histograms are useful graphical tools to visualize skewness.
Assessing normality is crucial for selecting appropriate statistical tests. While real estate data rarely exhibits perfect normality, tests like the Kolmogorov-Smirnov (KS) test and normal probability plots can evaluate the degree of departure from normality. If data deviates significantly from a normal distribution, nonparametric statistical methods might be more suitable, particularly for small sample sizes. Nonparametric tests do not rely on assumptions about the underlying population distribution.
Finally, the chapter touches upon the central limit theorem, which states that the sampling distribution of the mean approaches a normal distribution as the sample size increases, regardless of the underlying population distribution. This theorem is fundamental to many inferential statistical tests. In summary, by analyzing central tendency, dispersion, and shape, appraisers can gain a comprehensive understanding of real estate data, make informed decisions about appropriate statistical techniques, and draw more reliable conclusions about property values and market trends.