Correct posture description statistics for unlocking data analysis

what do you do when you get a data set?

Do you want to immediately lift up your sleeves for analysis? This is not a good suggestion. Countless experiences tell us that if the analyst does not know the quality of the data set first, the subsequent inference analysis will be half the effort.

the correct method is to use descriptive statistics first.

What is descriptive statistics

It is a way to comprehensively summarize data sets, including data processing and display, data set distribution characteristics, etc. It echoes the inferential statistics.

before entering the statistical study, make clear the basic concepts.

data can be divided into classified data and numerical data. Classification data is to identify the types of variables, such as gender, region and various categories; Numerical data indicates the size and number of numerical values, such as 18, 19 and 2 years old in age.

the most obvious difference is that addition and subtraction cannot be used for classified data, while numerical data can. The two can be converted to a certain extent. For example, the age of 18 is numerical data, but it can also be converted into classified data "teenagers". We can also use numerical values to represent classified data, such as for women and 1 for men. It still has no computational significance, and it is more convenient for computer storage.

The specific application of classified data and numerical data will continue to deepen in the future study. This paper will focus on numerical data first.

measurement of data

average is a measure of data location, which is used to understand the overall data, which is learned in primary school. However, the average is not an authoritative measure. When we mention the national average wage, we are all ordinary people who are averaged by Ma Yun's father Wang Jianlin's father.

The average is easily influenced by extreme values, because the data set cannot be guaranteed to be "clean", and all kinds of operational data are often disturbed. For example, the bonus hunter Party will raise the average of marketing activities. Generally speaking, we can use the adjusted mean to eliminate abnormal fluctuations, delete a certain proportion of maxima and minima in the data set, such as 5%, and then recalculate the average.

since it is unreliable, let's ask for the median. After all the data are arranged in ascending order, the value in the middle is the median. When the data set is odd, the median is the middle value; when the data set is even, the median is the average of the middle two numbers. This is also the content of primary schools.

Another measure is the mode, which is the data that appears most frequently in the data set. When there are multiple modes, it is called multimodal. The mode is used less frequently than the first two, and is more used to classify data.

Average, median and mode constitute the standard measurement method. But it's not enough.

Data analysts often divide data into four parts, each part contains 25% of the data set, and the dividing point is called quartile.

arrange the data in ascending order. The 25th percentile is called the first quartile Q1, the 5th percentile is called the second quartile Q2, which is the median, and the 75th percentile is called the third quartile Q3. These three points can help measure the distribution of data.

dispersion and variation of data

Let's consider a new problem. Now an e-commerce company wants to sell two products of the same type, and their weekly sales (unit: one) are as follows:

Product A: 1, 1, 11, 12, 12

Product B: Of course not. As commodities, we prefer those with stable sales.

variance is a measure that can measure the "stability" of data. The more popular explanation is to measure the variability of data, which is also called the degree of dispersion graphically.

the formula for calculating variance is the average of the sum of squares of the differences between each data and its average.

The above formula is the variance calculation of the overall data set. When the data is nearly a partial sample, n should be changed to n-1. When the data set is large enough, the error between them can also be ignored.

now calculate the variance of the above goods. The VARiance formula in Excel is VARP (), and if it is sample data, it is var (). Different Excel versions have slight differences in functions.

the greater the variance, the greater the dispersion of the data set, and the sales fluctuation of commodity A is obviously more stable than that of commodity B .. In the calculation of variance, because the sum of squares is involved, the dimension of the unit is square (variance of commodities A and B, unit is 2), and it is difficult to have an intuitive interpretation. So we introduce the standard deviation.

standard deviation is the square root of variance:

in excel, the calculation function of standard deviation is stdevp (), and if it is sample data, it is stdev ().

the meaning of variance and standard deviation is the same, but the standard deviation is the same as the unit dimension of the original data, and it is easier to compare with the average and other measures. For example, the average sales volume of commodity A is 11, and the standard deviation is .85, so we know that this commodity sells stably.

Chebyshev theorem points out that at least 75% of the data values are within 2 standard deviations from the average, at least 89% are within 3 standard deviations, and at least 94% are within 4 standard deviations. This is a very convenient theorem, which can quickly grasp the range of data.

if the average salary in Shanghai is 2k and the standard deviation is 5K, then about 9% of the salary is in the range of 5k ~ 35k.

If the data itself conforms to the normal (bell-shaped) distribution, the estimation of Chebyshev's theorem will be more accurate: 68% of the data fall within one standard deviation from the average, 95% of the data values fall within two standard deviations from the average, and almost all the data fall within three standard deviations.

in Excel, there is an important tool called data analysis library (some versions of Excel need to be installed and searched by themselves), which encapsulates a large number of statistical tools.

click description statistics, select the area to be calculated, set it as column by column, and select the U2 block next to the output area. Output the calculation results.

everything in column 1 belongs to all kinds of measures in descriptive statistics. We don't have to calculate each function.

Variance and standard deviation are important concepts, which will continue to appear in the following statistics.

Box plot of data

Back to measurement, the above-mentioned contents are all numerical methods, but they are still not intuitive enough.

first, summarize five types of data: minimum value, first quartile Q1, median value, third quartile Q3 and maximum value.

Take the salary data of data analysts as a case.

the above is the data after cleaning. We use Excel function to calculate these five metrics. They are median (), max (), min () and quart (). Distinguish by city.

Through the data, we can now understand the salary distribution of data analysts in various cities, and then process them into box charts, which are the most commonly used descriptive statistical charts.

the box chart determines the position through the five data we have worked out.

the upper and lower edges of the box chart are the maximum and minimum values respectively (actually, it is not, so it is understood here for convenience), and the upper and lower boundaries of the box are the 25% quantile and the 75% quantile. The horizontal line in the box is the median. The abnormal value is the value outside the edge of the box line and needs to be eliminated directly.

Excel216 can directly draw a box diagram. if it is an early version, there are two drawing ideas.

the first one is to use the stock price chart. Arrange the charts in the order of 25% quantile, maximum value, minimum value and 75% quantile.

Then generate the chart directly:

There is no median in this chart, and the median needs to be added. The data source creates a new series, which should be adjusted to the middle position of the data source.

select the format of the median data series, change the label to "-",the size is 12 columns, and the color is black. At this point, there is the prototype of the box diagram.

Another way of thinking is to draw the error line of scatter chart, which is the same as the principle of Gantt chart. Let's practice it by ourselves.

actually, we can see from the chart that although we have drawn a box chart, the data difference between different cities is not intuitive, because the maximum value supports the edge of the box chart. We often encounter these abnormal values that affect the quality of analysis (although the excessively abnormal values are reasonable, many analyses must remove them). We need to clean up these outliers.

define quartile deviation IQR=Q3(75% quantile)-Q1 (25% quantile), and the boundary of the box chart is (Q1-1.5IQR, Q3+1.5 IQR). All values outside the boundary are outliers.

bottom and top are new boundaries, and data outside the boundaries are considered as abnormal values. The data inside the boundary is the main body of the box chart, and then find out the maximum and minimum values within the boundary. For example, the boundary of Shanghai is between-5 and 39, and the actual range of data within the boundary is 1.5~37.5, then draw a box with 1.5 ~ 37.5.

Now that you have worked out the real five metrics, you can redraw the box chart (we need to use bottom and top to find the new maximum and minimum values in the range). For the convenience of demonstration, I generated it directly in Python (the BI I taught before is also ok, which looks better).

It's much more intuitive than the diagram drawn by Excel. The position of the red line is the salary standard that can be obtained by data analysts at the midstream level in each city. The upper blue line interval is the middle and upper reaches, the lower blue line interval is the middle and lower reaches, and so on. In short, the crowd was divided into four classes.

Let's interpret it: the salary ranges of data analysts in Shanghai, Beijing and Shenzhen are similar, but people at the middle and upper reaches can get higher salaries in Beijing because the median position is higher. Xi 'an, Changsha and Tianjin are not conducive to the development of data analysts. The level of Hangzhou is close to the north and deep, but the salary ceiling is limited.

You can see a lot in this picture at a glance, and you must have understood the function of the box chart, which can read out the overall distribution and tilt trend (skewness) of the data.

it is one of the basic abilities of data analysts to interpret data quickly through charts (histogram and scatter chart are also descriptive statistics).

Let's think about it. If it is the data analysis of O2O, can we quickly judge the business situation of each city? If it's finance, can you divide people into different groups to see the different distribution of their businesses? If it is an e-commerce, will there be a big difference in marketing data of different categories? It is of great value to cooperate with different dimensional subdivision.

the box chart is an excellent chart. Although it will be a little cumbersome in Excel (update to 216), in Python and R language, it takes ten seconds to operate.

Fenghua Plastic Surgery is better than Yongchang Company.

Meixi Garden in Hefei: How to get to your place from Beicheng?

What children's song does mom play and turn around?

Large baby plastic

What are the ideological systems?

Tanker route from Jinan to Liangshan

The water is clear and the sky is blue, and people are kind, which attracts Hongyan to settle in Shenyang. I want to see the road map from a distance.

Does ps mean remarks?

How should 3 1 year-old women maintain their skin?

When is the best time for citrus to bend branches?