-Beginner

You are currently browsing the archive for the -Beginner category.

For those of you in the business of peer reviewing other people’s data entries, the ability to minimize tedious checks is essential.  Today we’ll look at how =SUM(ABS(…)) is a convenient tool to reduce a simple 3-step process to a simpler 1-step process.

Your intern has just copied a large table of numbers from Worksheet1 to Worksheet2, and you want to make sure nothing went wrong in the process.  In best practice, you’d want to check not only the numbers but also the relative position of the numbers… did the intern switch some rows or columns around?  A good system of lookup formulas should be employed to ensure the integrity of this data transfer.  However, for now let’s just focus on the most basic case where all we care is that the two tables match identically.

It’s a huge table of 100 rows and 20 columns, so you really don’t want to look at each number with your bare eyes.  A typical way to check is to set up a third table of the same dimentions, and subtract the corresponding cells in the two tables:

(Worksheet3!A1) =Worksheet1!A1-Worksheet2!A1
(Worksheet3!A2) =Worksheet1!A2-Worksheet2!A2
….

After that, a quick glance for nonzero values will help you find places where discrepancies exist.  This is not difficult or time consuming by any means, but you still have to (1) type in a formula, (2) copy formula to the correct range, and (3) check for nonzero values.  Instead, try this array formula:

{=SUM(ABS(Worksheet1!A1:T100-Worksheet2!A1:T100))}     (Don’t forget to press CTRL+ALT+Enter)

This formula does an element-wise subtraction of the second table from the first table, and totals the absolute differences into a single number.  If the result of this formula is nonzero, then we’ve got a discrepancy problem worth further investigation – otherwise the job is done!  The ABS() function here serves an important role at making sure positive discrepancies don’t happen to offset negative discrepancies.

With the debates heating up more than ever, we step closer to the General Election each day.  The question of the day is – will you vote?

I love democracy.  I’m also a statistician.  Mixing those two and you get the following conclusion: allowing every single citizen to vote is an unnecessary waste of the nation’s resources, which is therefore, unpatriotic.

Let’s make some simplifying assumptions regarding the election process:

  • Popular vote is what determines the outcome – let’s assume no electoral college here.  It’s how election should be done, anyway.
  • There are only two candidates, and every voter will vote for one or the other.  This is almost true in the United States, with less than 1% of the popular votes going toward a non-Republican non-Democrat in 2004.

Now, let’s see the following scenario.  121 million citizens voted in the 2004 presidential election and mattered (no offense to those voting for Nader, Badnarik, etc… but they really didn’t count for much).  51.2% of them voted for Bush, who continued to become our president for the next four years.  The remaining 48.8%, of course, wishfully voted for Kerry.  Without going far into the statistical details, we know that if we drew a random sample of 2.6 million voters, we can be 99.9% confident that their voting outcome will be within 0.1% of the entire country’s voting results.

Download the Sample File

I made a simple Sample Size Calculator based on the formulas from this website.  It takes four parameters regarding the population size, estimate of the result, and the required confidence level, and gives you the necessary sample size.  The noteworthy formula used is the inverse normal function:

=NORMINV(probability, mean, standard_dev)

It gives you the value below which the distribution is expected p% of the time.  Use 0 for mean and 1 for standard deviation to get the normalized z value.  Do take caution to adjust the two-tailed confidence level to one-tailed.

The variables and formulas for this problem include:

N = size of the population (number of eligible voters)
p = expected % outcome (% of the population voting for Democrat or Republican); assume 50% for the “worse case” scenario
c = confidence interval expressed as a %
CL = confidence level, as a %
Z = NORMINV(1-(1-CL)/2, 0, 1) = the number of standard deviations that we are CL% confident the result is within
ss = (Z² * (p) * (1-p)) / c² = the sample size needed for an infinite population
ssf = ss / (1 + (ss-1) / N) = the sample size needed for a finite population with size N

Another thing to point out is that the statistical significance heavily depends on the expected outcome p.  The closer p is to the mid-point (50%), the less confidence we have and the larger the sample size we require.  As historically observed, the U.S. voters are split fairly evenly down the middle.  However, that’s frequently not the case when you examine voting results by state or county.  When a locality is as heavily as 80% leaning toward red or blue, we’d only need a very small sample of voters to be very confident about the results.

If you aren’t familiar with the concept of confidence intervals, this article probably isn’t making much sense to you.  Otherwise, it’s proving a practical political point – in an age where everything needs to go green and at a time when everyone is on a tight budget, let’s not waste all the applications, ballots, and time by allowing every citizen to vote!  Instead, we should randomly select 2.2% of the eligible voters in the country and rest assured that their votes are nearly 100% representative of ALL Americans!

 Download the Sample File

 

The Beijing Olympics games is all about the shimmer, shine and glamor: the Olympic Medals. If I were interested in comparing the number of Olympic Medals by country using Excel, how would I do this?

Download the Sample File

To make tasks such as this a bit more dynamic, I often like to create user interfaces. When creating an interface, the general rule of thumb is that it should be simple, allow the user to select his/her preferences and produce accurate results (both detailed and executive summaries).

Download the Sample File

To create this user interface, I will use the Excel functions MATCH and OFFSET to fetch data. This data is the basis to populate the graph and perform the simple difference calculation.

MATCH

  • =MATCH(lookup_value, lookup_array,[match_type])
  • MATCH returns the index / element position for the specified lookup value based on an associated array. The optional match type argument is FALSE if a lookup value must match an element within the array. TRUE or blank (default) in combination with a sorted array will return an index based on the range of values.
  • Ex, lookup_array = {1, 3, 5, 7, 9}.
    • MATCH(5, lookup_array, FALSE) = 3 because 5 is the 3rd element within the lookup_array.
    • MATCH(4, lookup_array, FALSE) = #N/A.  Since match_type = FALSE an exact match must occur which means that the lookup value must exist within the lookup array!
    • MATCH(4, lookup_array, TRUE) = 2. lookup_array must be sorted if match type is TRUE.
      • lookup values from 3 to <5 will return an index of 2,
      • lookup values from 5 to <7 will return an index of 3, etc, etc.

OFFSET

  • =OFFSET(reference, rows, columns, [height], [width])
  • Think of OFFSET in terms of Excel key navigation. Excel defaults at cell A1.
    • Want to get to cell B1? It is easy- simply press the right arrow or in other words, shift forward one column.
    • Cell A2? Shift forward one row / press the down arrow.
    • In Excel, OFFSET(A1, 0, 1) = B1 and OFFSET(A1, 1, 0) = A2. Therefore, OFFSET(A1, 2, 1) = B3.
  • OFFSET can also retrieve array / matrices of information when using the optional height and width arguments.

How does it work?  MATCH allows us to determine the position of each country and the medal type.  The MATCH results filter to OFFSET where Excel can retrieve the appropriate data from our table.

We have created this exhibit with just a few tools: Excel’s built in graph formats, conditional formatting, and the MATCH and OFFSET Excel functions. This type of model can have its various uses whether it is to create an exhibit to quickly compare financial ratios of equities, varying benefit plan design costs, and even Olympic medal counts.

Note: OFFSET and MATCH are not the only functions we can utilize to create the output, some of you may prefer MATCH with either VLOOKUP or HLOOKUP.

Download the Sample File

While arrays are great for grouping a system of formulas, their real advantage is to enable complex subtotals and lookups that ordinary Excel functions fail to support, through a technique known as the array sum or the conditional sum.

The general syntax is:

{=SUM(
     [condition 1]*
     [condition 2]*…
     [value 1]*…
     )
}

Note the brackets enclosing the formula.  To enter the formula as an array, press CTRL+ALT+Enter.

Each of the conditions and values is an array of corresponding size (equal size most of the time, but not necessarily).  The enclosing SUM() function and the array brackets serve as a SUMPRODUCT().  Note that

=SUMPRODUCT([array 1], [array 2], …)

is equivalent to

{=SUM([array 1]*[array 2]*…)}

The array sum multiplies the value array by one or more condition arrays, which contain 0’s and 1’s (FALSE and TRUE).  These condition arrays serve as filters (or masks) to zero-out values not corresponding to the specified conditions.  By picking up the remaining values this way, we can take SUMIF(), COUNTIF(), VLOOKUP() and other functions beyond one condition and one dimension.

Below is an example of box office sales records.  To make sense of this data, we’d often need to analyze subsets of the records.

Download the Sample File

With SUMIF() and COUNTIF() functions, we can easily obtain a subtotal for all sales in July.  We’ll practice using an array sum instead:

{=SUM(
     ([month column]=”July”)*
     ([sales column])
     )
}

Note that by omitting the [sales column], we’d be counting the number of records instead of summing their values.

Moving on, let’s get a subtotal for all June sales in Roanoke.  SUMIF() and COUNTIF() fail to be useful here.

{=SUM(
     ([month column]=”June”)*
     ([city column]=”Roanoke”)*
     ([sales column])
     )
}

Similarly, we can add a third condition (or in fact, as many as necessary):

{=SUM(
     ([month column]=”August”)*
     ([city column]=”New York”)*
     ([movie column]=”Transformers”)*
     ([sales column])
     )
}

In the last example, only one record satisfies all conditions.  The result is similar to that of a VLOOKUP(), except of course, we have a lot more freedom by using an array sum.  Sometimes, you might find yourself needing to specify more than one condition in a lookup and not wanting to double count any potential duplication of records.  What you can do is to use a MAX() function in place of the SUM() to limit the result to one record.

Here is a typical data snapshot summarizing the subtotals in a two-dimensional table:

Lastly, let’s see how an array sum can obtain subtotals (or single-value lookups) in such a two dimensional array.  The trick is to line up the dimensions of the condition and value arrays:

{=SUM(
     ([3×1 month column]=”July”)*
     ([1×3 movie row]=”Batman”)*
     ([3×3 sales table array])
     )
}

This article provides an introduction to the more advanced methods of using Excel formulas.  Once you get familiar with these methods, you will find a much more satisfying experience analyzing data in Excel.

Download the Sample File

 

Since many of this blog’s posts will be about array-based tricks, it might be a good idea to have a proper introduction to Excel’s array formulas.

By default, Excel’s functions and formulas take one or more input values and produce a single output value. Think of the typical multiplication and SUMPRODUCT() function.  When solving a more complex system of problems, however, it is often desired to process an array of similar values concurrently.  Hence the array formulas.

Download the Sample File

Let’s start with the real basics.  You’re the manager of a video game store, and for recordkeeping, you’ve just created a spreadsheet listing the prices of your inventory of video game consoles.  Now you want the identical price listing to appear on another part of the spreadsheet, and here’s what you can do:

  1. Select an area identical in size to the source data array (in this case, 3 rows x 1 column)
  2. Type the “=” to get a formula started, then drag to select the source data array
  3. Press Ctrl+Shift+Enter to complete your first array formua

Note that, in the formula bar, a pair of brackets {} enclose your formula.  This tells you that it’s recognized as an array formula.

Note: if you pressed the Enter key without holding down Ctrl and Shift, the brackets would not appear and only the first element in the array would be recognized.  Additionally, you cannot type in those brackets.

An array does not have to be a one-dimensional vector.  You can also reproduce an entire table by employing a two-dimentional array (3 rows x 2 columns):

So what now?  Pretty boring and useless, huh?  For this type of tasks you would typically use a simple “=” formula and copy it across the table.  In fact, we’d usually recommend that as well.  But this approach has its advantages – because the entire array (1D or 2D) is produced by one formula, you cannot change any of its elements (try typing over any of the cells… Excel won’t let you).  When processing large amounts of data, it’s not always easy to check for human errors, i.e. if your coworker’s son messed up one of your 5,000 rows of identical formulas.  Array formulas help ensure data integrity.

Moving on… now it’s the end of the month and your intern compiled the sales records for the store.  But he listed the consoles horizontally on three rows instead of vertically!  To fix this, you use the TRANSPOSE() function to “rotate” the table… turning rows into columns and columns into rows.  This is a very handy function that’s strictly used with arrays.


To turn this 2×4 table into a 4×2 table


Use the TRANSPOSE() function… don’t forget, Ctrl+Shift+Enter!

Okay, before we wrap up this post, let’s look at something actually useful, i.e. more than simply duplicating values.  Given your price list (a 3×1 array) and your intern’s sales records (a 1×3 array), you now want to calculate the total revenue.  SUMPRODUCT() comes to mind since you want to multiply the price by unit for each product, and add them all together.

However, SUMPRODUCT() requires all of its input parameters to be of identical size.  If you simply entered in the row array and the column array, you’d get a #VALUE! error.  You could TRANSPOSE() one of the two arrays to match the dimension of the other, or you could simply use TRANSPOSE() within your SUMPRODUCT():

{=SUM([array1], TRANSPOSE([array2]))}

Note the brackets – don’t forget to press Ctrl+Shift+Enter.  This calculation gives you the total revenue by first turning the two arrays into the same dimension, and then apply the SUMPRODUCT() function.

Download the Sample File

In the next article, we’ll talk about array sums, which lead into a whole wide world of possibilities.

Let’s recap.

  1. You want to learn about Excel.
  2. You want to buy computer parts to build a PC.
  3. You research the prices at Best Buy and Circuit City.
  4. You research the prices at more stores.

You have optimized the computer part prices and determined that the lowest total cost is $640.63. Before you pat yourself on the back for your hard work, you must realize that although cost is important, it means little unless you determine where to shop. The following Exhibit cleanly indicates the best price option and its respective store.

Download the Sample File

One option to create the exhibit is to use a manual “select cell, highlight, change font” (rinse, repeat, recycle many many times). This manual process may be fine for small projects, but the task can quickly become tedious if information changes or expands quickly.  Let’s create the exhibit using Excel’s highly powerful and usable Conditional Formatting (Excel 2003: Format > Conditional Formatting).

  • If an additional “cheaper” column is available, this task is easy. Choose the appropriate cells and set the “Cell Value Is” to refer to the Cheaper column.  (In the example below, column I refers to the before mentioned “cheaper” column.)  I have chosen the “equal to” option, but there are many other options that are easy to understand.

Download the Sample File

  • Now, lets make the assumption that you have no interest in creating an additional “cheaper” column.  No problem.  Instead of the default “Cell Value Is“, change the condition type to “Formula Is“.  This allows you to write an Excel formula which must solve to True or False (or equivalent).  In the example below, columns C through G refer to prices specified in Best Buy through Walmart.

Download the Sample File

Conditional Formatting is a very useful tool. When working for a financial firm, I used Conditional Formatting to error check, show real time changes in Equity Prices (Bberg BDP function), and highlight cells for concern.  What is really nice about Conditional Formatting is that you can utilize all the popular Excel formulas.  In many ways, you can create instant feedback for your spreadsheets because you can set up cells to format automatically based on your input/formulas.  Unfortunately, there are restrictions.   Excel 2003 only allows 3 conditions, the size of the file will increase, and large spreadsheets can become noticeably slower.  Microsoft expanded this tool in Excel 2007 and offers >3 Conditional Formatting formulas and more intuitive default conditions.  I have yet to explore all the Excel 2007 additions, but perhaps this could serve as a post in the future!?  Thanks to Peter for his original post.

Note*  If you need to build a computer, do your own research.  These prices most likely do not reflect actual store prices!

Download the Sample File

Peter explained an useful array based formula in his post. I won’t repeat what he was trying to accomplish, but a piece of his post really got me thinking. So for my very first, real post, I will blog about a real issue that many of us will face at one point.

Here’s the scenario. You are the valedictorian from the USA Today’s #1 ranked Liberal Arts University in Northern Kentucky. Your summa cum laude in Art History has taken you to the most prestigious consulting firm, ABC Consultants LLC (“ABC”). Your manager, Peter, is very excited to have you on board and tells you about a really exciting project that he worked on last year. It deals with the super-innovative and enterprising firm that is breaking new grounds everywhere it turns. This firm is known as XYZ Corporation (“XYZ”). Peter is really proud of the work he did last year (after all, XYZ renewed its contract with ABC, so why not be proud Peter?) and directs you to a spreadsheet named “XYZ Model.xls“. He wants you to take a look at the model and understand it by tomorrow morning so you can have a long conversation with XYZ’s CEO about its methodologies. Before you leave, Peter mentions:

“XYZ Model.xls may have a ‘convoluted approach and a super long formula‘”.

Here’s that formula:

{=SUM(C4:C12*((C4:C12-D4:D12)/IF(C4:C12=D4:D12,1,
ABS(C4:C12-D4:D12))-1)/(-2)+D4:D12*((C4:C12-D4:D12)/
IF(C4:C12=D4:D12,1,ABS(C4:C12-D4:D12))+1)/2)}

Array formulas are great because you can cram a ton of information into one single cell. The problem is, they are often difficult to understand because they can be very involved! So where would you begin?

1) Organize the array formula. When programmers code, they love to use indenting to help organize their thoughts. Let’s use a similar approach:

{
_____=SUM
_____ (
__________C4:C12 *
__________(
_______________(C4:C12-D4:D12) /
_______________IF(C4:C12=D4:D12,1,ABS(C4:C12-D4:D12))
_______________-1
__________) /
__________(-2)

__________+

__________D4:D12 *
__________(
_______________(C4:C12-D4:D12) /
_______________IF(C4:C12=D4:D12,1,ABS(C4:C12-D4:D12))
_______________+1
__________ ) /
__________2
_____)
}

2) Look for patterns. Many Excel array formulas process similar cells of information. Let’s define variable lists A through E as the following:

A = C4:C12
B = D4:D12
C = C4:C12-D4:D12
D = IF(C4:C12=D4:D12,1,ABS(C4:C12-D4:D12))
Let’s also have … E = C ÷ D
_____Note (if we want to go wild)*
_____C = A – B, D = IF( A=B,1,ABS(C) )

3) Now that we have organized our formula and defined variables, we can combine the results from Steps 1 & 2 together to create a cleaner formula.

= ?(A x (E – 1) ÷ -2 + B x (E + 1) ÷ 2)
= ½ x ?(-A x (E – 1) + B x (E + 1)) | E = C ÷ D

4) The formula seen in Step #3 looks complex, but to me it looks a lot simpler (and less confusing) then the original formula!  Let’s take advantage of Excel and expand out the array formula.

Download the Sample File

5) Great.  Now that we have confirmed our result with the original array formula’s result, we are done, right?  No way!  Remember Peter’s initial request.  He would like you to be able to  hold a conversation about the model.  Let’s understand the formula!

= ½ x ?(-A x (E – 1) + B x (E + 1)) | E = C ÷ D

  • C is simply the difference between A and B.
  • D is the absolute value of C or 1 if C is zero.  Why is D not just zero when C is zero?  I will tell you soon.
  • E as mentioned is C ÷ D.  This results in E becoming either -1, 0, 1.  Back to our non-zero D riddle above.  Since E is C ÷ D, dividing by D = 0 would result in a #Div/0! (Divide by Zero) error.  Mathematically speaking, anything divided by zero gets everybody confused!
  • Now we know that E is -1, 0 or 1.  Since we have calculated E, lets forget about arrays C and D.
  • Let’s take a look at each scenario.
    • If A < B Then E = -1
      • = ½ x (-A x -2 + B x 0)
      • = ½ x (-A x -2)
      • = A
    • If A = B Then E =0

      • = ½ x (-A x -1 + B x 1)
      • = ½ x (A + B) |  A = B
      • = A or B (it does not matter!)
    • If B < A Then E = 1

      • = ½ x (A x 0 + B x 2)
      • = ½ x (B x 2)
      • = B
  • Based on the results above.  Peter has duplicated a Min like function for A and B.  This formula does achieve what Peter was originally set to accomplish “Summing Element-Wise Minimums“.

6) If you become completely confused by the formula, removing yourself from the spreadsheet will often clear your mind.  Do your best to understand what you can, but do not spin your wheels!  Don’t be shy.  Have a conversation with your manager (or a knowledgable colleague) to either confirm or gain further understanding.  As you gain practice and experience with array formulas, you will begin to understand these formulas quickly.  Just be sure when you create a model with complicated array formulas, document the process to avoid further confusion!

Download the Sample File

My intern came to me today and asked about displaying multiple lines within a cell.  I’ll write the first real post in his honor.

Most people know that when entering text in Excel, pressing ALT+ENTER makes a new line.  However, he needed to generate this multi-line string dynamically (via a formula).  The new line characters, carriage return (CR) and line feed (LF), are not as obvious in Excel as most programming languages make them.

Note: This refers to the ‘\r\n’ in C-styled languages and vbCrLf in Visual Basic. In many environments, however, only the latter (‘\n’ and vbLf) is necessary to create a new line.


Download the Sample File

One way to do this, which may be more intuitive to non-programmers, is to use the same ALT+ENTER key stroke by entering this formula:

= [first line] & “[press ALT+ENTER here]” & [second line]…

Another way to do this, which is longer but probably preferred by programmers, is to employ the new line characters (CR + LF):

= [first line] & CHAR(13) & CHAR(10) & [second line]…

Similar to aforementioned note, only CHAR(10) is necessary for the desired results.  Add CHAR(13) only if you have the pet peeve from certain programming styles.

This entry may sound obvious, but here’s the catch: by using the formula above, you probably will still see all your text concatenated on one line.  You still need to wrap text (Format -> Cells -> Alignment tab -> Wrap text) in order to see multiple lines!  Note: Excel automatically sets the wrap text property when you enter plain text (instead of a formula) with a line break in a cell.

Download the Sample File

Tags: , , , ,