Statistical data analysis

Filipe Barbosa
Jan 12, 2024
6 min read

Updated: Feb 15, 2024

OVERVIEW

Statistical data processing, or data analysis, is the process of transforming data or numbers into useful information for understanding a situation and/or making a decision. We present the most common statistical techniques as well as some practical examples.

INTRODUCTION

It is difficult to identify the first case of data analysis, since humans have been analyzing data for centuries. However, one of the first recorded cases of data analysis dates back to the 17th century, when John Graunt, a London merchant, analyzed weekly mortality lists.

The death lists were weekly reports that documented the number of deaths in London and the cause of death. Graunt used this data to create the first life tables, which showed the probability of dying at different ages and the average life expectancy. This analysis helped advance the field of demography, the study of populations, and is considered a fundamental work in statistics.

Nowadays, some practical situations in which data processing is useful are:

1. Informed decision-making: Data analysis helps decision-makers make informed decisions by providing evidence-based insights.

2. Problem identification: Data analysis helps identify and solve problems by highlighting patterns, trends and exceptions.

3. Improving efficiency: Data analysis can help organizations identify inefficiencies and areas for improvement, leading to increased productivity and reduced costs.

4. Increased sales: by analyzing data, organizations can identify new opportunities for growth and expansion.

5. Predictive modeling: data analysis can be used to develop predictive models that can forecast future trends and patterns based on the most important variables.

Here are some of the most common techniques and some examples.

MOST COMMON STATISTICAL TOOLS

A large number of statistical tools have been developed over time, with varying degrees of complexity. The good news is that nowadays there are several software applications that put the use of these tools within almost everyone's reach. Some of the main statistical tools used in a business environment are:

1. descriptive statistics: this is a way of summarizing and describing data using measures such as mean, median, mode, variance and standard deviation. Descriptive statistics can be used to better understand the characteristics of a data set.

2. Inferential statistics: involves making predictions or drawing conclusions about a population based on one or more samples of data. Inferential statistics are often used in market research, where a sample of consumers is used to make inferences about the whole population. It is at the basis of Hypothesis Testing where you try to determine whether there is a statistically significant difference between two groups or whether there is a relationship between two variables.

3. Regression analysis: Regression is a statistical method used to identify the relationship between two or more variables. Regression analysis can be used to develop predictive models and make forecasts.

4. Time series analysis: This is a statistical method used to analyze data over time. Time series analysis can be used to identify trends and patterns in data, make predictions and detect anomalies.

5. Statistical process control: This is a method that uses control charts and is used to monitor a process and determine whether the results obtained are consistent and predictable. It also allows abnormal situations to be identified and, in these cases, a process of investigating the causes to be initiated.

These are just some of the main statistical tools used in business. The choice of the appropriate tool or tools will depend on the specific business problem or issue to be addressed.

EXAMPLES

To illustrate the use of some of the tools presented above, let's use an example in which a company is analyzing data on the time it takes to prepare orders, measured in minutes. The aim is to understand which factors most influence this time and then take steps to reduce it.

The data collected over several weeks looks like this:

We can see that, in addition to the "Time" variable, which is the dependent variable, data was also collected for each order on 6 independent variables: whether the order is urgent or not, day of the week, number of items in the order, number of lines in the order, time of day and operator who prepared the order.

In a first analysis, we typically use descriptive statistics to quantify the central tendency and dispersion of the "Time" variable:

Next, we try to visualize the dependent variable over time in a time series graph to see if there are any patterns of interest:

In this case, there seem to be several "strange" observations, but how do we decide with some degree of statistical confidence which ones are really abnormal compared to the rest? To do this, we use a control chart:

Based on this chart, we can see which observations are really strange or "outliers" and investigate their causes. In some cases, we'll be able to conclude that there's a good reason to discard this data from the analysis, in other cases we won't be able to, so we should keep all the observations for the next steps of the analysis. In our example, we found that the "outliers" were due to data recording errors or system malfunctions, so we chose to discard these observations and recalculated the descriptive statistics and the control chart:

Next, we will try to find out if, based on this sample, we can conclude that urgent orders are prepared more quickly than non-urgent orders. This is where inferential statistics and hypothesis testing come in handy:

Without going into too much detail about the results of the analysis, we can see graphically that the confidence intervals for the two sets of data do not overlap and the P-Value is less than 0.05, so we conclude that there is a significant difference between the preparation times for urgent and non-urgent orders. In practice, we have confirmed what we expected, which is that the process of handling urgent orders is faster than the standard process.

Another hypothesis is whether "Day of the week" has an influence on order preparation times. Looking only at the descriptive statistics, we see that the averages per day are different:

However, these are just averages of random samples. What we are really looking for is a comparison between the populations from which these samples were taken, based on their averages and variations. To do this, we again have to resort to inferential statistics and hypothesis testing:

The result of the analysis shows us graphically that the confidence interval for the means of the populations we are comparing overlaps, while the P-Value is greater than 0.05, which means that we cannot conclude that "Day of the week" significantly influences order preparation time.

Following the same approach, we could repeat the analysis for the independent variables "Operator" and "Time of day":

Concluding that:

-Operator B is significantly faster than operators A and C, which leads us to observe the differences in the way they operate and thus understand how we can help operators A and C to be as efficient as B.

-There are no significant differences in preparation time throughout the day, depending on when the order is prepared.

Looking now at the importance of the remaining independent variables in our process, "Qty of items" and "Number of lines", since they are quantitative variables, we can use regression to try to obtain meaningful models.

It can be concluded from the P value of more than 0.05 that the variable "Qty of items" has no significant influence on preparation time.

The P value of less than 0.05 shows that the variable "Number of lines" has a significant influence on preparation time. We can therefore use the model, or equation, to predict how long it will take to prepare an order depending on the number of lines.

To conclude this example, the variables or factors that significantly influence order preparation time are:

-Urgency

-Operator

-Number of items

From this knowledge we can look for solutions to make the process faster and more consistent, discarding the other variables studied "Day of the week", "Qty of items" and "Time of day".

IN SUMMARY

Data analysis is important because it helps individuals and organizations make informed decisions based on evidence and facts. By analyzing data, we can identify patterns, trends and insights that may not be immediately apparent in the raw data.

Note: the analyses presented in this article were carried out using Minitab statistical software.

Comments