Dr. Mark Gardener 


Managing Data Using Excel 
Writer's BlocOn this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date. The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.


Managing Data Using Excel: Available now from Get £5 discount when you buy direct from the publisher. Enter voucher code MDUE20 in the shopping basket at Pelagic Publishing 
Making sense of your data.Once you have got your data into a sensible arrangement and carefully builtup yourt dataset you'll need to think about exploring the data in order to make sense of it. After all, this is generally the point of collecting the data in the first place. You need to be able to work out what kind of data you've got and produce summary statistics, to help reduce the information into the data to a more manageable form (for the human brain). You also need to be able to visualize the data, this helps highlight patterns in the data in a consice and helpful manner. Summary statistics and charts are also essential to convey information to others when it comes to sharing your results (Chapter 9 in the book). Excel has various tools that can help you to explore your data, including Pivot Tables and Chart Tools. This page gives an overview of the material covered in Chapters 4–8 of the book. 

Main types of dataset:
Data can be timerelated 
Types of datasetIn a broad sense you can think of data as being in one of three camps:
How you deal with your data depends to a large extent on which camp your data is most allied to. Some datasets may be a combination and contain elements of more than one kind. Once you know what you are looking at you can use the most appropriate tools to help you explore and make sense of your data. 

Cornerstones of data summary: Centrality  averages 
Fundamentals of data summaryIt is hard to make sense of a load of numbers. The sensible thing to do is to distil the numbers into a summary. Ideally you should do a visual summary before a numerical one but in practice you have to do the latter before the former. Such summary statistics reduce the original data to a few key values, which help the reader view the data. Summary statistics come in four main sorts:
Summary statistics are useful but most readers will find a graphic more helpful in the long run. 

Averages: Mean 
CentralityThere are three main sorts of average:
The most appropriate average is related to the shape of the data, its distribution. 

Dispersion: Standard deviation 
DispersionThis is a measure of how spreadout the data are from the middle. There are several commonly used measures:
There are two camps" of measure, the first 2 are in one camp the rest are in the other camp. Which camp you go for depends on the distribution of the data. 

The size of the sample is called the replication or number of observations  ReplicationThis is the simplest, yet often overlooked. The number of observations (called replications or records) is used in many other calculations. It is important to know if the sample is small (few replicates) or large because this can affect the level of certainty. 

The shape of the data determines the summary statistics and analytical methods Describe shape using: Tally plot 
DistributionThe shape of the data is important because it will affect the kinds of summary statistics you can use and also the statistical approach you can use later on. Essentially you split the data into chunks (called size classes or bins) and work out the frequency of observations in each bin. You usually make a graphical representation of the resulting table but there are "shape statistics" that can help:
If your data are normally distributed the tally plot or histogram looks symmetrical, with a central "hump", which tails off gracefully either side (forming a bellshape). The properties of the normal distribution are well understood and this kind of shape is also called parametric. With parametric data you can use the following:
If your data shape is nonparametric (i.e. is skewed) the mathematical properties are rather more uncertain, you have to use the following for summary:
Of course the replication is independent of the shape of the data. 

Different graphs have different uses: Bar chart 
Graphs and chartsGraphical methods of presentation are usually the best way to explore your data. Graphs are also more easily understood by others so are useful for sharing your findings. Excel has quite powerful graphical capabilities but you often have to put in a bit of effort to overcome the default settings. However, since the graphs of your work represent the pinnacle of your research you probably ought to make it look like you care. Different charts are useful for different purposes:
Some graphs can be pressed into service for more than one use. All of these charts types are easily made in Excel, although the boxplot requires some lateral thinking! 

Exploring correlation & regression: Excel's default correlation is Pearson's proiduct moment, which is a parametric correlation coefficient Use correlation matrix 
Exploring Correlation & Regression datasetsIn these kinds of dataset you are looking for links between variables. Sometimes you have grouping variables and sometimes not. Generally you will check the variables for normality, as this affects the sort of regression you can carry out. You may or may not choose to sumarize each variable, depending how many there are. In any event you need to work out correlation or regression values for the links between pairs of variables. There are two commonly used sorts of correlation coefficient:
In the main you'll use Pearson's coefficient for basic exploration simply because it is convenient to compute in Excel.You can also use the Analysis ToolPak to make a correlation matrix, which allows you to see all the coefficients in one go. You can make a correlation matrix by other methods, but they are a bit more work. When you need to visualize the relationships you will use scatter plots. You cannot use these from a Pivot Table (as a Pivot Chart) but it is relatively easy to make scatter plots, which you can alter. You can most easily do this using selection outlines, which appear when you click in a scatter plot. You simply drag the outline to a new variable, and the chart updates. 

Timerelated data: Line plots allow you to follow a variable across (equal) time intervals Sparklines are mini incell charts 
Exploring Timerelated datasetsWhen you have timerelated data you'll often want to "follow" a variable over time. The line plot is the way to do this graphically. The line plot shows the value of a variable at various categories, these would correspond to the time intervals. They are equal in "size" so if the time intervals are not regular then you should consider an alternative, such as a scatter plot. You can use a Line plot from a Pivot Table, as a Pivot Chart. This makes it easy to explore your data as you can alter the chart rapidly and easily. You can also use Sparklines, which are mini line charts (although there is a bar version), which can be useful to follow several variables at the same time as separate plots. The Sparklines are incell plots so you can place regular text or functions in the "background", i.e. the cells work like normal cells, they somply have a chart inside too. 

Association data: Contingency table crosstabulates frequencies over categories Visualize with bar charts or pie charts 
Exploring Association datasetsWhen you have association data you have frequencies of observations at categories. The usual way to proceed is to make a contingency table – the Pivot Table is a useful tool for this. The contingency table shows the frequencies crosstabulated over two ranges of category. You can have grouping variables and the Pivot Table makes it easy to incorporate them. The contingency table itself is a useful summary but large tables can be hard to take in. Using charts is a good way to visualize association data. You can make bar charts from the Pivot Tables very easily as Pivot Charts, which allow you to arrange and rearrange the data easily. Whilst bar charts are the most useful way to visualize the data Pie charts are also an option. Pie charts show proportions of things in a sample. This means that you may need several pie charts, one for each category. Pie charts are not ideal, as the human eye is not good with angular measures. You can make a pie chart as a Pivot Chart but a bar chart is likely to be a better option! 

Differences data: Regular summary statistics for sample groups Visualize using: Bar charts (with error bars) Charts can be filtered to help explore the data 
Exploring Differences datasetsWhen you are looking for differences between sample groups you'll use summary statistics on each sample group. You can use Pivot Tables to produce mean values but not medians. If you need to use medians you'll have to assemble the data into sample groups (using the Pivot Table and an index variable) and regular Excel functions. Visualizing differences data usually means the bar chart, which allows you to show the average and a measure of dispersion (as error bars). You can make bar charts easily from Pivot Tables, although you need a bit of work to add the error bars. Boxwhisker plots are a good alternative to bar charts. They show several statistics in a compact manner, usually median, IQR and the max and min. Many statistians use boxplots to show nonparametric statistics, even if the data are normally distributed. It is not so easy to make a boxplot in Excel, but it can be done. You need to use the "Other" types of chart, specifically a HighLowOpenClose type of Stock chart. With a little bit of coercion you can show the median, IQR and range (as error bars) and make a very tidy plot. If you arrange your data appropriately you can use filters on your data to alter the chart, even boxplots can be filtered. You start with a Pivot Table to get the data into sample groups, then prepare the summary statistics required for the chart. You then build the chart and can apply filters afterwards to explore the data. 

Top 


Providing training for:


Follow me... 

Top  Contact  DataAnalytics Homepage 