10 Essential Advance Excel Features For Data Analysts (And How To Use Them)
Spreadsheets are always in! The use of electronic spreadsheets dates back to 1979 and remains one of the most popular ways to examine and manipulate data.
Today, Microsoft’s popular spreadsheet application, Excel, has more than 750 million users and is used by some of the world’s largest companies. Personally, I use Excel almost every day because it can sometimes lead to faster results than running Python or loading data into a database.
One of the reasons why Excel is so popular is because it is packed with features and functions that can be used to clean, aggregate, pivot, and chart data. In this article, we will go through 10 features and functions that use data analysis in Excel.
I think every analyst should know:
- Pivot tables and pivot charts
- Conditional formatting
- Remove duplicates
- DAYS and NETWORKDAYS
1. Pivot tables and Pivot charts
One of my favorite features in Excel is Pivot Tables, and as a follow-up, Pivot Tables. Pivot Charts visualize the data represented in pivot tables, giving us insights at a glance. Pivot tables provide a simple approach to reformat columns and rows, turning them into groups, statistics, or summaries. We can create a chart with table using the PivotChart feature under Insert.
Let’s say we want to count records grouped by Marital_Status. Using pivot tables makes this very simple, as it transforms the data and aggregates it for us.
To create a pivot chart and pivot table, first select the range of data you want to include then click Insert > PivotChart > PivotChart & PivotTable. The Create PivotTable editor will pop up.
The selected range will appear in the Table/Range field by default. Click OK and the pivot table will generate.
In the PivotChart Fields, drag Marital_Status from the field list into the Axis (Categories) box. Then drag Marital_Status from the field list into the Values box. Since the Marital_Status data is a string, the Values aggregation should default to Count. If the data type were numeric, the aggregation defaults to Sum.
The pivot table and chart should populate. You can add additional dimensions or filters by dragging new fields into the corresponding box. It only takes a few clicks to aggregate data and visualize it in Excel, which is why it is such a widely-used tool.
2. Conditional formatting
Come to think of it, I probably use conditional formatting more than any other feature in Excel. Conditional formatting allows you to highlight or hide cells based on the rules you specify. Apply rules to one or more cells in the same worksheet. It is useful to highlight outliers, duplicates, or patterns in data
Let’s say we want to highlight all Year_Birth values greater than 1995 in the dataset.
Select the Year_Birth column and click Conditional Formatting > Highlight Cells Rules > Greater Than.
Enter 1995 and click OK. The cells in the column with a value greater than 1995 will turn light red.
If we decide we want to adjust the conditional formatting rule we just created, follow the path Conditional Formatting > Conditional Formatting Rules Manager.
3. Remove duplicates
Data is often messy, so it is important that you know how to remove duplicates. Using conditional formatting rules, you can highlight the duplicate data to review it before deleting it. The Remove Duplicates feature is available under Data > Data Tools > Remove Duplicates.
Highlight the dataset in Excel and click Remove Duplicates.
The editor allows us to select columns that should be included when deleting the duplicates. Make sure the My data has headers checkbox is marked if the column names are not displaying.
Click OK. The duplicates will be dropped from the dataset. It will tell us how many unique values remain.
XLOOKUP is like a combination of VLOOKUP and HLOOKUP because it can go vertically or horizontally to search for a value in a range. It basically allows you to use a selected range as a lookup table and return “looked up” results in a cell.
The syntax is as follows:
=XLOOKUP(lookup_value, lookup_array, return_array, [if_not_found], [match_mode], [search_mode])
Let’s say we want to look up the Year_Birth based on an entered ID value. In cell AD2, enter an ID value—for example, 8755. In cell AE2, enter the XLOOKUP formula:
The lookup_value is the value we want looked up by the function, so we enter AD2.
The lookup_array is a column or row that contains the lookup value, so we enter A2:A2241 since that will give us an array of IDs.
The return_array is the column or row that contains the value we want to return, so we select B2:B2241 since that will give us Year_Birth values.
The completed formula will look like this: =XLOOKUP(AD2, A2:A2241, B2:B2241)
Enter in different IDs and the corresponding Year_Birth will return.
Lookup functions are very powerful, and you can even join data from different sheets or nest lookup functions within each other. For example, you could sum the value of multiple lookups.
The IFERROR function is used to generate custom error messages when a formula results in an error. For example, we can use it to wrap our XLOOKUP function so that it returns an explicit message if the ID is not found. The syntax is relatively simple.
Going back to the XLOOKUP function, if we enter an ID into AD2 that doesn’t exist in the lookup array, AE2 outputs #N/A. Instead, let’s return “ID Not Found.” For the value in the IFERROR function, use the XLOOKUP function. It should look like this:
=IFERROR(XLOOKUP(AD2, A2:A2241, B2:B2241), “ID Not Found”)
In addition to text, you can point the value_if_error at another cell too. If you target a blank cell as the value for value_if_error, 0 will appear in the cell.
Similar to the find functions, MATCH can be used when you need the position of a value in a range instead of the value itself.
This is the syntax for MATCH:
=MATCH(lookup_value, lookup_array, [match_type])
When writing the function, it is important to know which match type to use. Although it is optional and defaults to 1, the available options are -1, 0, or 1.
- -1: Finds the smallest value that is greater than or equal to lookup_value.
- 0: Finds the first value that is exactly equal to lookup_value.
- 1: Finds the largest value that is less than or equal to lookup_value.
If we want to find the first time the looked up birth year occurs, add a the following MATCH formula to cell AF2:
The COUNTBLANK function is quite intuitive, but it is an important function for data management in analytics, as many machine learning algorithms are very sensitive to null values. When you know how many null values there are, you have a better understanding of how to approach them. For example, if a lot of null values, you should delete the column. If several values are null, you need to assign a value to fill in the null value. COUNTBLANK counts the number of empty cells in a range.
The syntax is simple:
We want to count the number of nulls in the Income column, so add this formula to cell AG2:
8. DAYS and NETWORK DAYS
DAYS and NETWORKDAYS are separate functions, but they are so similar that there is no guarantee of separate entries in my list. The DAYS function only returns the number of days between two dates, while the NETWORKDAYS function is slightly different in that it does not include weekends and specified holidays. It just returns the number of business days between two dates.
The syntax for both is easy to remember:
=NETWORKDAYS(start_date, end_date, [holidays])
At my job, one of the things I analyze is usage data. I use these functions for things like counting the number of days since a user has logged in or used a tool. Since our software is used in schools, sometimes it makes sense to exclude weekends and holidays from our usage data so I’m thankful for the NETWORKDAYS function.
The RANK function ranks a number based on its size relative to other values in a list and returns the desired rank. This means that the rank of the number will be its position if the list is sorted in ascending or descending order. For example, sort Sales in descending order so that the largest value is at the top, and that value could be rank 1. RANK for duplicates with the same rank but cumulative. This means that if two values are rank four, the next rank will be six, not five (1,2,3,4,4,6).
The syntax for RANK takes 3 arguments.
=RANK(number, ref, [order])
It is important to note that [order] can be set to 0 for descending and 1 (or greater) for ascending.
We want to rank the income (column E) of our customers, so we will use the following formula in cell AH2:
=RANK(E2, E2:E2241, 0)
In the bottom right corner of cell AH2, click the square and drag it down to the last row of data and the formula will automatically copy allowing us to quickly generate a rank for each income value.
The last function on my top 10 list is SUMMARY. It’s great when you need to perform arithmetic on specific groups of values. It can be used to add, subtract, multiply or divide selected numeric values by the respective ranges. Although it sounds complicated, the logic is quite intuitive once you try it.
The syntax, however, is not very intuitive.
=SUMPRODUCT(array1, [array2], [array3], …)
We want to sum the KidHome for all rows where Education equals Graduation (C2) and Marital_Status equals Single (D2). We will add this formula to cell AI:
If we wanted to see a different Education value, PhD for example, we could change C2 to C6 in the formula.