Wednesday, September 25, 2019

To attempt or not to attempt - A risk management approach toward negative marking

1.0 Introduction

In some of the competitive examinations, there is negative marking. When equal weightage is given for both right answer and wrong answer, the risk of losing marks when a wrong answer is given is high. In case of multiple choice questions, there will be always some probability of answering the question correctly, which increases when there is more than one right answer.

In risk analysis, we have the following formula for the expected value of a variable X.

E[X]=SpiXi  for all possible values of X=Xi,

where 
E[X]=expected value of X.
p= probability of the value of X being Xi

2.0 Joint Entrance Exam Main (JEE Main) - A case study

2.1 Introduction - JEE Main


In India, we have the famous Joint Entrance Examination (JEE) for admission into various engineering, architecture and planning courses across the nation offered by various academic institutions run by or funded by the Government of India. This is conducted in two phases i.e. JEE Main and JEE Advanced. JEE Main is the basic qualifying examination for the said courses.

For 2020, the information bulletin for JEE Main can be found at the following link:


From the above bulletin (see pages 6 and 7), we find that every correct answer is awarded 4 marks and every wrong answer is awarded one negative mark (-1).

2.2 Definition: Uncertain question

Let us define an uncertain question as the MCQ for which the examinee doesn't know the right answer.

2.3 Answering uncertain questions at random


Going by the question paper for 2019, there are four options for each multiple choice question (MCQ).

The probability of choosing the right answer at random is 1/4=0.25.

The probability of choosing the wrong answer at random is 3/4=0.75.

So, the expected value of attempting a question by randomly choosing an answer is 0.25x4-0.75*.1=0.25.

This means, one may still score positively by taking a chance. However, the probability of the answer being wrong is very high.

When four questions are wrongly answered and one question is rightly answered out of five questions, the outcome is 4*(-1)+4=0. From this, it is evident that the risk of losing marks happens only when one scores less than 1/5th of the total number of questions where one is uncertain of the answer.

When p is the probability of answering a question correctly, 1-p is the probability of answering it wrong. In this case, p and 1-p are 0.25 and 0.75 respectively.

The probability of answering r answers or less correctly for n questions is,



When n= 1 to 5, negative marks will be scored only when i=0.
The probability of obtaining negative marks for n=1 to 5 is as follows:


n
P
5
0.237
4
0.316
3
0.422
2
0.563
1
0.750

Given that there are 20 MCQs in each question paper, the probability of obtaining negative marks by answering all questions at random occurs when less than 4 answers are right i.e. when the number of right answers is 3 or less.

The probability of obtaining negative marks when 20 questions are answered at random is P=0.225.

From this, though the probability of scoring positive marks or zero out of all uncertain questions is higher except when the no. of MCQS are one or two, there is still a good amount of risk involved. Whether one can take this much risk or not, depends on the future plans one has.

2.4 Making intelligent guesses for MCQs

The risk of scoring negative can be minimized by guesstimating the answers that appear to be the most right.

Assumption: Let us assume the probability of choosing the right answer increases to p=0.5 when one makes an intelligent guess combined with subject matter knowledge.


In this case, the estimated value of attempting one uncertain answer is 0.5*4-0.5*1=1, which is much higher than before.

Now, the probability of a negative score when 8 uncertain questions are answered is,
P=0.035 which is statistically insignificant, as statistical signifcance is often attached to a confidence level of 95% (probability of 1-0.05) and more. This means, we can say with a confidence level of 0.965(1-0.035) or 96.5% that will either improve one's score or there is no harm at least, by making intelligent guesses when the no. of uncertain questions is 8 or more. 

Even when there is only one uncertain question, the probability of scoring a negative mark is 0.5 only if the assumption made above is right. For a higher number of uncertain questions, it will be less than 0.5 which means the probability of net benefit or no loss out of the whole exercise is higher.

3.0 Applicability of the case study

The above case study is applicable in similar cases when the following conditions are met

  1.  When the options provided are four out of which only one is correct. When more than one option is correct, however, the risk comes down.
  2. The ratio of positive marking to negative marking is 4:1.

4.0 Conclusions

  1. For the above case study, it is shown that the expected value from taking the risk of attempting uncertain questions is always positive.
  2. The probability of scoring positive is higher than the probability of scoring negative, when more than two multiple choice questions are answered at random.
  3. The probability of scoring positive is higher than the probability of scoring negative when more than one multiple choice question is answered by making an intelligent guess, when one can guesstimate correctly with a probability of 0.5.
  4. The risk of scoring negative marks comes down when one makes an intelligent guess of the answer rather than selecting the answer at random.
  5. When 8 or more questions are answered by an intelligent guess, there is a very good possibility of scoring positive or suffering no loss from the guess work, when the probability of guesstimating the right answer is 0.5.

5.0 Path forward

  1. One can do similar math when the number of options and/or the marking scheme are different.
  2. The confidence level of scoring no negative mark can be recalculated by varying the probability of success for an intelligent guess.
  3. Coaching centers can find out the probability of their students guessing the answer right in each subject separately and advise the students individually. As they have math and stat experts, they can even come up with better models (Good if they are already doing it, but I have not heard of any).
  4. Coaching centers also need to teach and demonstrate how to make better guesses. Those preparing themselves for competitive exams on their own need to develop their own strategies.
  5. When one has multiple options for pursuing higher studies, or multiple career options, one can take bigger risks, as bigger rewards may come from bigger risks. 















Tuesday, September 17, 2019

Importance of data literacy for non-IT engineers - Some random thoughts

The big question(s)

Often one may question - why do we get into all this statistics, six sigma, data analytics and all such jargon to do my day-to-day engineering work? I know my stuff and have years of experience behind me. Can a statistical model interpret or predict better than me? Don't we lose focus by doing all this? Will it not require extra efforts?

Well, my article may not provide answer to each and every question a traditional engineer like many of us may have, but will try to address the concerns and apprehensions at a broader level.

What is data analytics and where is it applied?

I do not want to provide any standard textbook or classroom definition of what is meant by data analytics, but one can always find it by googling or referring to the right resources available aplenty online. Many are aware of it by now, and many aren't aware too!

With or without saying, analytics is applied everywhere. When you have a numerical data, as an engineer, you will at least try to find what is its mean and what is its range (maximum and minimum).
This is a very elementary application of data analytics. As you further probe, you will find other parameters like standard deviation, median, mode etc. which statistically describe your data. For large populations, you will do sampling. This is followed by finding the probability distributions associated with your data, making inferences, and the journey continues...

From the above examples, it is evident that the concept of data analytics is not at all new. Analytics at its most elementary level does not require a computer at all!

I am a non-IT engineer. Why do I need it?

In engineering, a lot of data gets generated. One good example is traffic studies. For a structured review and for drawing meaningful conclusions, one needs to apply analytics.

Well, I design a structure or machine. I have my formulas and software. When I use the formulas rightly, I will get the right output. Why do I need analytics? One may say.

In engineering, many of the formulas are either complex or empirical. When we use such formulas, it is always essential to study the variation in the output as the inputs vary.

When the formulas are not derived from first principles but were empirically arrived at from observations, regression etc., it will be to one's own advantage to study and analyze the data generated by those formulas. This enables one to have control over one's designs and take informed decisions. 

If an equation is empirically arrived at using regression, then studying it further generates additional insights. One can make new models, re-validate the equation, or may even come up with a better equation!

These types of empirical equations we find often in fluid mechanics for example. Unfortunately, in most of the text books I studied in my engineering in the 90s, almost nowhere the rationale behind the empirical formulas was explained.

Even when you have a derived complex formula, the formula will be based on several assumptions. Also the inputs like material properties have their own confidence level attached to them. A statistical study will bail us out from the complexities of the equation as we get to study the numerical data and the trends it follows.

Needless to say that the data from experimentation and observations needs to be studied statistically, which many do, with or without calling it analytics. If a statistical study is not performed, the findings are just not right and it is high time one realized that.

In any case, one can always find out what are the inputs that influence the output most. This can be achieved by design of experiments and sensitivity analyses.

Two types of engineering data


Without getting into the standard technical terms in data analytics, I would classify the numerical data into two types from an engineer's perspective. The first one is visible data, abundantly available everywhere. The second type of data is invisible data, an example of which is hidden in the form of formulas and gets generated only when you calculate the output using those formulas and varying the inputs.

The visible data needs to be handled, using relevant data analytics techniques.

The invisible data can also be handled the same way, by generating data, arranging it in tables, plotting it graphically for visualization and then analyzing.

Sources of the visible data are, data from experimentation, published and validated data, data gathered from observations and so on.

Sources of invisible data not only those from the equations, but also those from computer simulations. For example, the output data generated by a finite element software can be put in this category.

Should I test my data when there is a clear formula?

As people say "seeing is not believing", seeing is not always believing. Hence the data needs to be put to statistical tests. This helps in decision making for those who are unsure of the direction in which one needs to go. Even for those who are "sure" by experience, it helps in re-validating their experience based knowledge.

Examples of some applications

In manufacturing engineering, one often uses the principles of six sigma. There are many statistical techniques that are common between six sigma and data analytics. By exposing oneself to data analytics, one will have a wider range of tools to handle one's data.

Design of experiments is often conducted in engineering, particularly mechanical engineering.

I see the term sensitivity analysis often being used in the analysis of high rise buildings. There are some who are loosely using this term just because the standards are using it. Maybe the standards and leaders of the industry in developing them  need to elaborate more on what it means by sensitivity analysis in line with its standard definition. It will be of great help if a standardized definition is included in the Indian tall concrete buildings code IS 16700.

As one is aware, advanced concepts like artificial intelligence, machine learning etc. are used in all industries.

Epilogue


One may say what I have written is nothing but statistics and hence old wine in new bottle! I would rather call it old nectar in new bottle, often left untasted by many.

The techniques that I have discussed and even more complex techniques are often used by researchers traditionally in engineering. But knowledge and application of the basic concepts at working level will be of great help as discussed and will elevate the society to the next level of a data literate society from a compute literate society.

Path forward


Below is the path forward I would suggest for promoting data literacy among engineers.

  1. Engineers should consider refreshing their undergraduate level statistics knowledge as part of their Continuous Professional Development.
  2. Guidelines should be framed to include a minimum number of questions on statistics and probability in exams like Professional Engineer's exams.
  3. A fixed number of minimum Professional Development Units (PDUs) in statistics and probability/data analytics/data science should be prescribed for retaining a Professional Engineer's or a project manager's certification.
  4. Non-IT engineering students interested in analytics with the intention of using it in their core field or making it their career, can consider doing a project in analytics (wait for my next article).
  5. Authors of engineering textbooks should include brief notes on how the empirical formulas have been arrived at, and provide reference to the original sources.



































Saturday, September 7, 2019

Higher concrete grade or more reinforcement? - A data driven approach for optimal shear strength

1. Introduction


Many a time the structural engineer has to choose between higher concrete grade and higher percentage of reinforcement. Some believe that increasing the concrete grade can result in savings and some intuitively view a high concrete grade to be very uneconomical for a small project.

The shear strength of concrete section varies based on the % of reinforcement and the grade of concrete.

In this article, let us examine whether higher concrete grade or higher reinforcement % makes significant difference for the shear strength of the section, by referring to the Indian concrete code IS 456:2000.

2.0 Data collection

2.1 Shear strength:

The below table (table 1) from IS 456:2000 is our data to be examined.

The parameters As b and d in table 1 are defined as below:

As=area of tension reinforcement
b=width of the concrete member
d=effective depth of the concrete member (overall depth minus distance between reinforcement centroid and extreme concrete fiber on tension side).

Table 1



2.2 Material price

The below material prices have been assumed.

Table 2  

                                                     
Grade of Conc.
Price (Rs/cum)
Shuttering (Rs/Sqm)
M15
4500
500
M20
4800
500
M25
5000
500
M30
5500
500
M35
5800
500
M40
6000
500

Reinforcement cost (Fe500): Rs. 60000/- per ton

2.3 Data clean-up

In the absence of an excel file for the shear strength table, the above picture was converted into Excel by using the below online character recognition tool.


The data converted into excel had some decimal points missing, and some numbers not properly recognized. The data table being small, this was cleaned up by visual inspection.

3.0 Case study

Let us consider a one way slab of 200 mm thickness. Assume 20mm clear cover and 16mm dia rebar. Then effective depth of the section d=200-20-16/2=172 mm.

Our shear strength studies will be based on this slab.

4.0 Assumptions

The following assumptions have been made.
  1. Maximum concrete grade is M40.
  2. Minimum % of reinforcement is 0.15.
  3. Maximum % of reinforcement is 3.0
  4. The depth of the section is constrained to be constant.
  5. Costing data is assumed.
  6. The slab requires no shear links.

5.0 Exclusion

The below is the exclusions
  • Distribution reinforcement is not considered in estimation and costing.

6.0 Data generation

For the slab under consideration, the following data is further generated from the shear strength and material pricing data collected.

6.1 Material price of reinforced concrete

The following material prices have been arrived at per cu m.

Table 3  



Cost of reinforced concrete in Rs. per cu m


M15
M20
M25
M30
M35
M40
100As/bd
0.15
7608
7908
8108
8608
8908
9108
0.25
8013
8313
8513
9013
9313
9513
0.5
9025
9325
9525
10025
10325
10525
0.75
10038
10338
10538
11038
11338
11538
1
11051
11351
11551
12051
12351
12551
1.25
12063
12363
12563
13063
13363
13563
1.5
13076
13376
13576
14076
14376
14576
1.75
14089
14389
14589
15089
15389
15589
2
15101
15401
15601
16101
16401
16601
2.25
16114
16414
16614
17114
17414
17614
2.5
17127
17427
17627
18127
18427
18627
2.75
18139
18439
18639
19139
19439
19639
3
19152
19452
19652
20152
20452
20652

6.2 Normalized material price of reinforced concrete

In table 3 above, M15 @ 0.15% reinforcement costs the least. Let us normalize the reinforced concrete costs with respect to the same, presented in the below table.

Table 4



Cost of reinforced concrete per cu m (normalized w.r.t. M15 @ 0.15% As)


M15
M20
M25
M30
M35
M40
100As/bd
0.15
1.00
1.04
1.07
1.13
1.17
1.20
0.25
1.05
1.09
1.12
1.18
1.22
1.25
0.5
1.19
1.23
1.25
1.32
1.36
1.38
0.75
1.32
1.36
1.39
1.45
1.49
1.52
1
1.45
1.49
1.52
1.58
1.62
1.65
1.25
1.59
1.63
1.65
1.72
1.76
1.78
1.5
1.72
1.76
1.78
1.85
1.89
1.92
1.75
1.85
1.89
1.92
1.98
2.02
2.05
2
1.99
2.02
2.05
2.12
2.16
2.18
2.25
2.12
2.16
2.18
2.25
2.29
2.32
2.5
2.25
2.29
2.32
2.38
2.42
2.45
2.75
2.38
2.42
2.45
2.52
2.56
2.58
3
2.52
2.56
2.58
2.65
2.69
2.71

6.3 Shear strength per unit normalized price of reinforced concrete

By dividing the shear strength in table 1 by the unit normalized price of reinforced concrete in table 4, let us arrive at the strength we are achieving per unit concrete price (normalized), as in table 5 below.

Table 5




Shear strength per unit normalized price (MPa)


M15
M20
M25
M30
M35
M40
100As/bd
0.15
0.28
0.27
0.27
0.26
0.25
0.25
0.25
0.33
0.33
0.32
0.31
0.30
0.30
0.5
0.39
0.39
0.39
0.38
0.37
0.37
0.75
0.41
0.41
0.41
0.41
0.40
0.40
1
0.41
0.42
0.42
0.42
0.41
0.41
1.25
0.40
0.41
0.42
0.41
0.42
0.42
1.5
0.40
0.41
0.41
0.41
0.41
0.41
1.75
0.38
0.40
0.41
0.40
0.41
0.41
2
0.36
0.39
0.40
0.40
0.40
0.40
2.25
0.34
0.38
0.39
0.39
0.39
0.40
2.5
0.32
0.36
0.38
0.38
0.38
0.39
2.75
0.30
0.34
0.37
0.37
0.38
0.38
3
0.28
0.32
0.36
0.36
0.37
0.37

7.0 Data visualization

From table 5, we see the value of shear strength we achieve per unit price increases as the % of reinforcement  increases initially and then starts decreasingThis happens for all values of concrete grade, from M15 through M40. On the other hand, the strength per unit price decreases as the concrete material grade increases at lower values of  100As/bd. This reverses at higher values of 100As/bd, with the strength per unit price increasing with concrete grade.

Since the strength per unit price increases with increasing 100As/bd for all concrete grades, let us plot the mean shear strength across all grades considered against 100As/bd as in the chart below.



8.0 Data analytics

8.1 Correlation

From the above data visualization, we understand there is a steep rise in the shear strength with reinforcement increasing up to 1% and then there is a relatively mild fall.

The correlation coefficient calculated in Excel works out to be around 0.208, which indicates mild positive correlation. 

8.2 ANOVA

Though ANOVA is ideally for categorical variables, it can also be used against continuous independent variables as in this case.

By performing single factor ANOVA across the rows and across the columns in Excel, below are the results obtained.

Table 6






Table 7


9.0 Interpretation of results
  1. The mild positive correlation in the chart plotted is to initial steep increase in the value of strength followed by subsequent relatively mild decline.
  2. From the ANOVA in Table 6, the P-value is very high across the columns i.e. for varying concrete grades. So we accept the null hypothesis and infer that the difference in the mean values of shear strength per unit price across varying concrete grades is not statistically significant.
  3. From the ANOVA in Table 7, the P-value is very low and nearly zero across rows i.e. for varying % reinforcement. So we reject the null hypothesis and infer that the difference in the mean values of shear strength per unit price is statistically significant.

10.0 Conclusions

The below conclusions are drawn from this study.
  1. It is the % of reinforcement that influences the shear strength achieved per every unit of currency, more than the grade of concrete.
  2. From data visualization, it is be concluded that the influence of % reinforcement has a positive effect on the cost effectiveness of the cross-section with respect to shear strength, only up to certain % of reinforcement. 

11.0 Practical Applications

  1. Reinforcement % increased up to some percentage (1% in the present study) in a flexural member will improve the cost efficiency of the section in terms of its shear strength.
  2. Where the section design is governed by shear strength, it will be judicious to increase the reinforcement only up to some percentage. Beyond that, other options like revising the section dimensions/slab depth etc. need to be explored.