{"id":1365,"date":"2023-03-29T16:24:51","date_gmt":"2023-03-29T10:24:51","guid":{"rendered":"https:\/\/agribusinessedu.com\/?p=1365"},"modified":"2023-03-29T16:58:32","modified_gmt":"2023-03-29T10:58:32","slug":"importance-of-robust-statistics","status":"publish","type":"post","link":"https:\/\/agribusinessedu.com\/importance-of-robust-statistics\/","title":{"rendered":"Importance of Robust Statistics"},"content":{"rendered":"
Importance of Robust Statistics<\/strong><\/span><\/p>\n Outliers, or variations in the total variation distance, are generally the subject of robust statistics. The deliberate corruption of a dataset, however, can take many different forms, including systematic measurement errors and the absence of confounders. Robust statistics are those that perform well for data derived from a variety of probability distributions, particularly for non-normal distributions.<\/span><\/p>\n Sample statistics that approximate the values of their corresponding population variables include the mean, median, standard deviation, and interquartile range. The sample values should ideally be near to the population value and not consistently overly high or low (i.e., unbiased). However, some sample statistics might result in biased, subpar estimates because of outliers and extreme values in the long tail of a skewed distribution. The sample statistics will go further away from the correct value and be systematically too high or too low. In contrast, when there are outliers and extreme values in the long-tails, a robust statistic will be effective, have a negligible bias, and be asymptomatically impartial as sample size increases. Robust statistics won’t systematically overestimate or underestimate the population value even in the presence of outliers and long tails because they will be pretty near to the right value given your sample size. Also, the statistic becomes closer to being completely impartial as the sample size grows. Outliers and long tails have little or no effect on robust statistics. In particular non-normal distributions, they perform well in a wide range of probability distributions.<\/span><\/p>\n Example:<\/strong> If we increase one of the data set’s values to a much higher number, let’s say 1000. The median stays at 3.5 but the mean grows significantly. The median is resistant to the extreme observation, to put it another way. This is because the median solely depends on the middle of the distribution, but the mean depends on all observations in the data set.<\/span><\/p>\n The Meaning of Robust Statistics <\/strong><\/span><\/p>\n Commonly, the traditional presumptions of normalcy, independence, and linearity are not met. These assumptions lead to skewed results from statistical estimators and tests, depending on the “magnitude” of the departure and the “sensitivity” of the method. A statistical theory that takes into consideration this kind of departure from parametric models is required in order to get accurate results. Many probability distributions are possible with nonparametric statistics. It is no longer necessary to limit yourself to, say, regularly distributed data. Nonparametric statistics, however, also makes several significant presuppositions, such as symmetry and absolute continuity. Again, departing from these requirements produces outcomes that are skewed and distorted. In a parametric model’s “neighborhood,” robust statistics operates. Although it allows for deviations, it makes use of the benefits of parametric models. One way to think of robust statistics is as a theory of approximatively parametric models. According to Hampel et al., the definition of robust statistics is “in a broad informal sense, a body of knowledge, partly formalized into ‘theories of robustness,’ dealing to departures from idealized assumptions in statistics.”<\/span><\/p>\n What is the aim of robust statistics?<\/strong><\/span><\/p>\n Classical statistical techniques aim to best match all available data points. The most common criterion is least squares, where the parameters must be estimated by minimizing the sum of the squared residuals. The parameter estimates may differ significantly from those produced from the “clean” data if the data set contains outliers. For instance, the regression line may be drawn to outliers. Large deviations are dispersed throughout all the residuals in the least squares criterion, making them frequently difficult to identify because all data points receive the same weight.<\/span><\/p>\n Reduced influence of outliers is one goal of robust statistics. Robust approaches aim to fit the majority of the data, presuming that the number of reliable observations outweighs that of the outliers. The residuals, which are significant in the robust analysis, can then be used to detect outliers. Asking what led to these outliers is a crucial task that comes next. They must be examined and interpreted, but they should not be disregarded.<\/span><\/p>\n In the event that actual results differ from idealized assumptions, robust statistics should guarantee accurate outcomes. Unexpected serial correlations that result from violations of the independence assumption are another type of deviation in addition to outliers. The removal of a few extreme data points is not sufficient to achieve robust statistics. Effective statistical techniques should also guard against efficiency loss, which reduces the accuracy of the statistical estimation.<\/span><\/p>\n What are Robust Statistical Analyses?<\/strong><\/span><\/p>\n Even in cases when real-world data may not meet the ideal circumstances, robust statistical analysis can nonetheless yield reliable conclusions. These methods work well when the sample data contain unusual values and different distributions. In other words, even if the assumptions are not entirely met, you may still trust the results.<\/span><\/p>\n For instance, parametric hypothesis tests that evaluate the mean, such t-tests and ANOVA, presumptively presume that the data are distributed normally. Yet when your sample size per group is sufficiently big, the central limit theorem makes these tests resilient to deviations from the normal distribution.<\/span><\/p>\n Similar to parametric analyses, which make no assumptions about the distribution of the data, nonparametric analyses evaluate the median. Nonparametric analyses also resist the effects of outliers, much like the median.<\/span><\/p>\n Measures of Robustness<\/strong><\/span><\/p>\n In the literature, a number of robustness measures were proposed. The breakdown point, sensitivity curve, and influence function are the three that are most important.<\/span><\/p>\n The breakdown point<\/strong><\/span><\/p>\n The breakdown point, which indicates the lowest level of contamination a statistical technique can take before being arbitrarily skewed, is a measure of a statistical procedure’s overall reliability. It offers the worst-case scenario and is attainable by a standard back-of-the-envelope calculation.<\/span><\/p>\n This idea has made it easier to look for high breakdown point processes that enable separation of the structure comprising the bulk (or majority) of data from that which may constitute an important minority group. As a result, these are helpful exploratory tools that enable data pattern discovery. Its advancement has revived outdated ideas like the depth of a data cloud and opened up new research trajectories in numerous fields, having a significant impact on computational statistics and data analysis.<\/span><\/p>\n Sensitivity Curve<\/strong><\/span><\/p>\n The sensitivity curve measures the effect of a single outlier on the estimator. The idea is to compute the difference between the estimate for a given sample\u00a0 \u00a0\u00a0and the estimate when an observation x is added to the sample. The resulting difference is normalized by the fraction of contamination 1\/ (N + 1). Hence, for a given estimator \u00a0the sensitivity curve is defined as:<\/span><\/p>\n <\/span><\/p>\n Influence Function (IF)<\/strong><\/span><\/p>\n The Influence Function (IF) measures the impact of an infinitesimal fraction of outliers on an estimator. Let\u2019s define a metric of a probability distribution as \u03b8(X) where X is any probability distribution function. This metric can be anything; for example, a measure of the spread of the distribution (e.g. standard deviation).Now, suppose we have a \u2018normal\u2019 distribution with thin tails, f. We contaminate this with an outlier distribution g, which is generally assumed to be a point mass at z. This gives a resultant distribution:<\/span><\/p>\n <\/span><\/p>\n Robust Estimator<\/strong><\/span><\/p>\n a method of estimating that is unaffected by slight deviations from the idealized premises that were utilized to develop the algorithm. When there is a lot of noise and outliers in the observed data, robust estimators are preferable, however minimal zone fitting is a proper fitting error metric to get a small tolerance range.<\/span><\/p>\n Robust Estimators of the Central Tendency<\/strong><\/span><\/p>\n The central tendency in statistics refers to the tendency for quantitative data to group together around a single central value. The mean is the traditional indicator of central tendency, although it is not reliable. The median and the trimmed mean are the most useful robust estimators of the central tendency.<\/span><\/p>\n Robust Estimators of the Dispersion<\/strong><\/span><\/p>\n The variability of the observations in a dataset is represented by the statistical dispersion. The standard deviation is the traditional measure of statistical dispersion, however it is not reliable because a single outlier can make it arbitrary huge. The median absolute deviation and the interquartile range are the two most frequently used reliable estimators of the dispersion.<\/span><\/p>\n The Importance of Using Robust Analysis to Understand Change<\/strong><\/span><\/p>\n Leaders gain more insight into aspects of the organization that they may not always be able to perceive by using data to unearth knowledge.<\/span><\/p>\n Understanding new customer expectations and gathering solid data to track the development of each person’s change strategy are essential, but all too frequently the discrepancy between actual expectation and company perception of needs results in business process changes that fall short of the desired outcomes.<\/span><\/p>\n Deep and solid analysis can find the signals in the noise, and here\u2019s how.<\/span><\/p>\n Success can undoubtedly be significantly impacted if correlation is defined as a relationship or pattern between the values of two variables and causation as the happening of one event as a result of another occurrence. When the two are mixed together too frequently, bad assumptions are utilized to make business judgments. Some of my go-to tools to get to the root of how structures meet customer needs are:<\/span><\/p>\n Robust Statistical Methods in Quantitative Finance<\/strong><\/span><\/p>\n Data on fundamental factor exposure and financial asset returns frequently include outliers, or observations that are incongruous with the majority of the data. Outliers in financial data are common, and academic finance researchers and quantitative finance experts are both aware of this fact and work to reduce their impact on data studies. Frequently used outlier mitigation strategies make the assumption that handling outliers in each variable independently is sufficient. These methods are prone to missing multivariate outliers, which are observations that deviate from the norm in more than one dimension without deviating from the norm in any one variable. In the presence of multivariate outliers, robust statistical approaches provide a preferable strategy for developing trustworthy financial models, but academic researchers and practitioners regrettably underuse them. With two applications to outlier detection and asset pricing research, this dissertation encourages broader use of robust statistical approaches in quantitative finance research.<\/span><\/p>\n Machine learning algorithms for Robust statistics<\/strong><\/span><\/p>\n Machine learning techniques depend on the capacity to harmonize ideal mathematical models with noisy, outlier-contaminated real-world data. A series of estimating approaches known as robust statistics find patterns in this imprecise data while lessening the impact of outliers and smaller subgroups. Data analysis may yield answers and conclusions that are skewed if solid statistical techniques are not used. Many applications of computer vision rely heavily on robust statistics. In computer vision, outlier-contaminated data is a common occurrence. Imaging sensors, depth sensors, laser scanners, and other data gathering devices\/sensors are fundamentally flawed and are unable to completely avoid taking some erroneous values. Moreover, it is usually not possible to extract the amounts of interest straight from the sensors, which necessitates some sort of pre-processing. Oftentimes, this pre-processing stage introduces mistakes or outliers. The processing of the input data must be carried out robustly in order for computer vision algorithms to function consistently and accurately in real-world contexts.<\/span><\/p>\n <\/strong><\/span><\/p>\n Conclusion<\/strong><\/span><\/p>\n By offering numerous ideas, concepts, and tools that are now commonplace in statistics, robust statistics made a significant contribution to the growth of contemporary statistics. Without a doubt, robustness will continue to improve in line with how statistics and data analysis are currently progressing and face the same wide range of difficulties. The creation of reliable methods for multidimensional, difficult issues by machine learning algorithms is a special case.<\/span><\/p>\n Reference<\/strong><\/span><\/p>\n \u00a0Zhu, B., Jiao, J., & Steinhardt, J. (2022). Generalized resilience and robust statistics. The Annals of Statistics, 50(4), 2256-2283.<\/span><\/p>\n Hampel F.R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association 69, 383\u2013393. [Introduces the influence curve, an important tool in robust statistics.]<\/span><\/p>\n Hampel F.R., Ronchetti E.M., Rousseeuw P.J., and Stahel W.A. (1986). Robust Statistics: The Approach Based on Influence Functions, 502 pp. New York: Wiley. [Gives a survey of robust statistical techniques.]<\/span><\/p>\n Huber P.J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35, 73\u2013101. [Introduces M-estimators. Started much work on robustness.] Huber P.J. (1981). Robust Statistics, 308 pp. New York: Wiley. [Presents a summary of mathematical concepts of robust statistics.]<\/span><\/p>\n Tukey J.W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harald Hotelling (ed. I. Olkin, S.G. Ghurye, W. Hoeffding, W.G. Madow, and H.B. Mann), pp. 448\u2013485. Stanford: Stanford University Press. [Showed how inefficient classical estimators can be in the presence of outliers, and asked for a systematic study of robustness.]<\/span><\/p>\n Atkinson, A. C., Riani, M., & Riani, M. (2000).\u00a0Robust diagnostic regression analysis<\/em>\u00a0(Vol. 2). New York: Springer.<\/span><\/p>\n Atkinson, A. C., Riani, M., & Cerioli, A. (2004).\u00a0Exploring multivariate data with the forward search<\/em>\u00a0(Vol. 1). New York: Springer.<\/span><\/p>\n Liu, R. Y. (1990). On a notion of data depth based on random simplices.\u00a0The Annals of Statistics<\/em>, 405-414.<\/span><\/p>\n Hampel, F. R. (1968).\u00a0Contributions to the theory of robust estimation<\/em>. University of California, Berkeley.<\/span><\/p>\n Lecu\u00e9, G., & Lerasle, M. (2020). Robust machine learning by median-of-means: theory and practice.<\/span><\/p>\n Rousseeuw, P. J., & Hubert, M. (1999). Regression depth.\u00a0Journal of the American Statistical Association<\/em>,\u00a094<\/em>(446), 388-402.<\/span><\/p>\n Prasad, A., Suggala, A. S., Balakrishnan, S., & Ravikumar, P. (2020). Robust estimation via robust gradient estimation.\u00a0Journal of the Royal Statistical Society Series B: Statistical Methodology<\/em>,\u00a082<\/em>(3), 601-627.<\/span><\/p>\n Rousseeuw, P. J., & Leroy, A. M. (2005).\u00a0Robust regression and outlier detection<\/em>. John wiley & sons.<\/span><\/p>\n Tukey, J. W. (1975). Mathematics and the picturing of data. In\u00a0Proceedings of the International Congress of Mathematicians, Vancouver, 1975<\/em>\u00a0(Vol. 2, pp. 523-531).<\/span><\/p>\n Retrieved from https:\/\/statisticsbyjim.com\/basics\/robust-statistics\/<\/span><\/p>\n Retrieved from https:\/\/projecteuclid.org\/journals\/annals-of-statistics\/volume-50\/issue- Retrieved from https:\/\/www.forbes.com\/sites\/ellevate\/2021\/06\/08\/the-importance-of-using-robust-analysis-to-understand-change\/?sh=1c2d6b943dab<\/span><\/p>\n Retrieved from https:\/\/www.adelaide.edu.au\/aiml\/our-research\/machine-learning\/robust-statistics.<\/span><\/p>\n Retrieved from https:\/\/www.sciencedirect.com\/topics\/engineering\/robust-estimator#:~:text=Robust%20estimators%20are%20preferred%20when,obtain%20a%20narrow%20tolerance%20band.<\/span><\/p>\n Retrieved from https:\/\/mathworld.wolfram.com\/RobustEstimation.html#:~:text=An%20estimation%20technique%20which%20is,used%20to%20optimize%20the%20algorithm.<\/span><\/p>\n Retrieved from https:\/\/www.baeldung.com\/cs\/robust-estimators-in-robust-statistics<\/span><\/p>\n Retrieved from https:\/\/researchhubs.com\/post\/ai\/data-analysis-and-statistical-inference\/robust-statistics.html#:~:text=The%20robust%20statistics%20is%20defined,stays%20the%20same%20at%203.5.<\/span><\/p>\n Retrieved from https:\/\/rohan-tangri.medium.com\/robust-statistics-the-influence-function-d71ac687d046<\/span><\/p>\n Retrieved from https:\/\/digital.lib.washington.edu\/researchworks\/handle\/1773\/40304<\/span><\/p>\n 4\/Generalized-resilience-and-robust-statistics\/10.1214\/22-AOS2186.short<\/span><\/p>\n Written by Mahamudul Hasan Millat<\/strong><\/span><\/p>\n Research Scholar\u00a0\u00a0<\/strong><\/span><\/p>\n Statistics Discipline<\/strong><\/span><\/p>\n Science, Engineering & Technology School<\/strong><\/span><\/p>\n\n
\n