Friday, October 4, 2019

Undergrad data science projects by non-IT engineering students - A few tips

1.0 Introduction

As some people rightly say, "Data is the new oil". Is does not necessarily mean one can make a lot of money by getting into data sciences, though it is one of the lucrative career options at this point in time. Data is the new oil, or any new ore, for that matter. Like an ore can be explored in many different ways right from manual excavation to the most modern mechanized and automated explorations, data can be explored in many ways, right from finding out the mean and standard deviation of a given small data set up to applying the most modern computational techniques.

In India, there is a lot of gap between the supply and demand of engineers, with many engineering colleges even at the remotest parts of the country. It is practically impossible for all the pass-outs to get jobs in their own fields of study even if there is a significant improvement in the curriculum to make the students industry-ready by the time they pass out.

Like the engineers of the last 2-3 decades had to be computer literates by the time they passed out from the colleges even to survive in their own field, the engineers of present generation and next generation need to be data literate. This data literacy can be achieved without extensive efforts from one's end unless one wishes to be highly specialized in data sciences.

Right from high school, the students of mathematics are exposed to the rudiments of probability and statistics, and so are engineering students at the undergrad level! Applying these concepts in their own field of study to study the data that they deal with, enhances their understanding of the subject matter and enables them to make data-driven decisions. This also opens an additional avenue for job opportunities.

1.1 Target readers

The target readers of this blogspot article are engineering students of non-IT streams who want to improve their data skills and faculty members and managements who wish to improve their campus placements.

2. Developing the skills

A non-IT engineering student has many necessary skills in data sciences as one studies advanced concepts of probability and statistics as part of engineering maths, and acquires a good amount of coding skills. Is that enough?

2.1 Courses

To understand terminology used in and to get acquainted with the state-of-the art data sciences, an engineering undergraduate can do some of the online courses. To enroll for a new course offered by IITs and get certified, one can visit www.swayam.gov.in. To listen to the archived courses, one can visit www.nptel.ac.in. Search using the keywords Data Science and Data Analytics and you will find many courses - introductory to advanced. A beginner can do introductory courses.

One can also go thru the video lectures in Coursea, Udacity etc. 


2.2 Coding

One can use any programming language to write a program for solving the equations. However, to capitalize on the several readily available libraries, one can learn Python and/or R.

2.3 Software packages

Students have the privelege to freely download many software packages from the net by furnishing their credentials. Look for the below:
1. Student versions available for free
3. Packages available with your academic institution
4. Excel Analysis Toolpak available as an add-in in your MS Exce. See the link below to find how to activate it:

Get in touch with experts in the field for their opinion which can include your teachers in the IT/Computer Science/Mathematics/Statistics departments.


3. Applying the skills in your project work

To apply the skills, one can do their undergraduate project work. To infuse data analytics into one's own project work, the following steps can be taken.

3.1 Define your problem

This can be as visualized by you or your project guide. You may also let your guide know that you are interested in a project that involves some sort of data analysis.

3.2 Gather data

This can be done in several ways. Could be from experimentation, could be from data generated from engineering formulas, could be from literature, or could be from databases available in records, standards, archives etc.

3.3 Visualize the data

This can be achieved by plotting. If the data is too large, you may want to do sampling.

3.4 Analyze the data

How much data? Entire data set or a sample? What is the right tool? How do we test?
Depends on the nature of your problem, no. of factors involved, size of the data etc.! For this, the knowledge gained from coursework will help.

3.5 Interpret the data


After the analysis, interpret what you have observed out of it i.e. the results of the analysis.

3.6 Develop solution

If your problem definition is to find the trend, you may not have to execute this step and conclude your project at the above step itself. If you have found what is the cause of the problem, you will have to present a solution. Note that it is generally expected in engineering profession that the person who find a problem also needs to find its solution!

3.7 Test the solution

Once you have developed a solution, you will have to test the solution by doing some iterations of reanalyzing the new data set and testing if the solution works.

3.8 Conclusion and presenting the report

Like in any other project work, this is the final step!

3.9 DMAIC - A popular technique

Define, Measure, Analyze, Improve and Control (DMAIC) is one popular technique where there is a good scope for applying the data analytics skills. For understanding and appreciating the process, visit:
https://asq.org/quality-resources/dmaic

4. Summary and additional thoughts

  • Brush up your probability and statistics basics.
  • Do some good data analytics/data science courses online, preferably from swayam.gov.in and/or nptel.ac.in.
  • Apply the skills in your project work.
  • Interact with your peers in the IT stream who should be doing some data science courses as part of their curriculum.
  • Choose your mathematics/statistics/IT faculty member as your co-guide.

5. Note for the academic institutions

Showcase the skills of your students to the prospective recruiters and invite more and more companies from both IT and non-IT streams to your campus!



















Wednesday, September 25, 2019

To attempt or not to attempt - A risk management approach toward negative marking

1.0 Introduction

In some of the competitive examinations, there is negative marking. When equal weightage is given for both right answer and wrong answer, the risk of losing marks when a wrong answer is given is high. In case of multiple choice questions, there will be always some probability of answering the question correctly, which increases when there is more than one right answer.

In risk analysis, we have the following formula for the expected value of a variable X.

E[X]=SpiXi  for all possible values of X=Xi,

where 
E[X]=expected value of X.
p= probability of the value of X being Xi

2.0 Joint Entrance Exam Main (JEE Main) - A case study

2.1 Introduction - JEE Main


In India, we have the famous Joint Entrance Examination (JEE) for admission into various engineering, architecture and planning courses across the nation offered by various academic institutions run by or funded by the Government of India. This is conducted in two phases i.e. JEE Main and JEE Advanced. JEE Main is the basic qualifying examination for the said courses.

For 2020, the information bulletin for JEE Main can be found at the following link:


From the above bulletin (see pages 6 and 7), we find that every correct answer is awarded 4 marks and every wrong answer is awarded one negative mark (-1).

2.2 Definition: Uncertain question

Let us define an uncertain question as the MCQ for which the examinee doesn't know the right answer.

2.3 Answering uncertain questions at random


Going by the question paper for 2019, there are four options for each multiple choice question (MCQ).

The probability of choosing the right answer at random is 1/4=0.25.

The probability of choosing the wrong answer at random is 3/4=0.75.

So, the expected value of attempting a question by randomly choosing an answer is 0.25x4-0.75*.1=0.25.

This means, one may still score positively by taking a chance. However, the probability of the answer being wrong is very high.

When four questions are wrongly answered and one question is rightly answered out of five questions, the outcome is 4*(-1)+4=0. From this, it is evident that the risk of losing marks happens only when one scores less than 1/5th of the total number of questions where one is uncertain of the answer.

When p is the probability of answering a question correctly, 1-p is the probability of answering it wrong. In this case, p and 1-p are 0.25 and 0.75 respectively.

The probability of answering r answers or less correctly for n questions is,



When n= 1 to 5, negative marks will be scored only when i=0.
The probability of obtaining negative marks for n=1 to 5 is as follows:


n
P
5
0.237
4
0.316
3
0.422
2
0.563
1
0.750

Given that there are 20 MCQs in each question paper, the probability of obtaining negative marks by answering all questions at random occurs when less than 4 answers are right i.e. when the number of right answers is 3 or less.

The probability of obtaining negative marks when 20 questions are answered at random is P=0.225.

From this, though the probability of scoring positive marks or zero out of all uncertain questions is higher except when the no. of MCQS are one or two, there is still a good amount of risk involved. Whether one can take this much risk or not, depends on the future plans one has.

2.4 Making intelligent guesses for MCQs

The risk of scoring negative can be minimized by guesstimating the answers that appear to be the most right.

Assumption: Let us assume the probability of choosing the right answer increases to p=0.5 when one makes an intelligent guess combined with subject matter knowledge.


In this case, the estimated value of attempting one uncertain answer is 0.5*4-0.5*1=1, which is much higher than before.

Now, the probability of a negative score when 8 uncertain questions are answered is,
P=0.035 which is statistically insignificant, as statistical signifcance is often attached to a confidence level of 95% (probability of 1-0.05) and more. This means, we can say with a confidence level of 0.965(1-0.035) or 96.5% that will either improve one's score or there is no harm at least, by making intelligent guesses when the no. of uncertain questions is 8 or more. 

Even when there is only one uncertain question, the probability of scoring a negative mark is 0.5 only if the assumption made above is right. For a higher number of uncertain questions, it will be less than 0.5 which means the probability of net benefit or no loss out of the whole exercise is higher.

3.0 Applicability of the case study

The above case study is applicable in similar cases when the following conditions are met

  1.  When the options provided are four out of which only one is correct. When more than one option is correct, however, the risk comes down.
  2. The ratio of positive marking to negative marking is 4:1.

4.0 Conclusions

  1. For the above case study, it is shown that the expected value from taking the risk of attempting uncertain questions is always positive.
  2. The probability of scoring positive is higher than the probability of scoring negative, when more than two multiple choice questions are answered at random.
  3. The probability of scoring positive is higher than the probability of scoring negative when more than one multiple choice question is answered by making an intelligent guess, when one can guesstimate correctly with a probability of 0.5.
  4. The risk of scoring negative marks comes down when one makes an intelligent guess of the answer rather than selecting the answer at random.
  5. When 8 or more questions are answered by an intelligent guess, there is a very good possibility of scoring positive or suffering no loss from the guess work, when the probability of guesstimating the right answer is 0.5.

5.0 Path forward

  1. One can do similar math when the number of options and/or the marking scheme are different.
  2. The confidence level of scoring no negative mark can be recalculated by varying the probability of success for an intelligent guess.
  3. Coaching centers can find out the probability of their students guessing the answer right in each subject separately and advise the students individually. As they have math and stat experts, they can even come up with better models (Good if they are already doing it, but I have not heard of any).
  4. Coaching centers also need to teach and demonstrate how to make better guesses. Those preparing themselves for competitive exams on their own need to develop their own strategies.
  5. When one has multiple options for pursuing higher studies, or multiple career options, one can take bigger risks, as bigger rewards may come from bigger risks. 















Tuesday, September 17, 2019

Importance of data literacy for non-IT engineers - Some random thoughts

The big question(s)

Often one may question - why do we get into all this statistics, six sigma, data analytics and all such jargon to do my day-to-day engineering work? I know my stuff and have years of experience behind me. Can a statistical model interpret or predict better than me? Don't we lose focus by doing all this? Will it not require extra efforts?

Well, my article may not provide answer to each and every question a traditional engineer like many of us may have, but will try to address the concerns and apprehensions at a broader level.

What is data analytics and where is it applied?

I do not want to provide any standard textbook or classroom definition of what is meant by data analytics, but one can always find it by googling or referring to the right resources available aplenty online. Many are aware of it by now, and many aren't aware too!

With or without saying, analytics is applied everywhere. When you have a numerical data, as an engineer, you will at least try to find what is its mean and what is its range (maximum and minimum).
This is a very elementary application of data analytics. As you further probe, you will find other parameters like standard deviation, median, mode etc. which statistically describe your data. For large populations, you will do sampling. This is followed by finding the probability distributions associated with your data, making inferences, and the journey continues...

From the above examples, it is evident that the concept of data analytics is not at all new. Analytics at its most elementary level does not require a computer at all!

I am a non-IT engineer. Why do I need it?

In engineering, a lot of data gets generated. One good example is traffic studies. For a structured review and for drawing meaningful conclusions, one needs to apply analytics.

Well, I design a structure or machine. I have my formulas and software. When I use the formulas rightly, I will get the right output. Why do I need analytics? One may say.

In engineering, many of the formulas are either complex or empirical. When we use such formulas, it is always essential to study the variation in the output as the inputs vary.

When the formulas are not derived from first principles but were empirically arrived at from observations, regression etc., it will be to one's own advantage to study and analyze the data generated by those formulas. This enables one to have control over one's designs and take informed decisions. 

If an equation is empirically arrived at using regression, then studying it further generates additional insights. One can make new models, re-validate the equation, or may even come up with a better equation!

These types of empirical equations we find often in fluid mechanics for example. Unfortunately, in most of the text books I studied in my engineering in the 90s, almost nowhere the rationale behind the empirical formulas was explained.

Even when you have a derived complex formula, the formula will be based on several assumptions. Also the inputs like material properties have their own confidence level attached to them. A statistical study will bail us out from the complexities of the equation as we get to study the numerical data and the trends it follows.

Needless to say that the data from experimentation and observations needs to be studied statistically, which many do, with or without calling it analytics. If a statistical study is not performed, the findings are just not right and it is high time one realized that.

In any case, one can always find out what are the inputs that influence the output most. This can be achieved by design of experiments and sensitivity analyses.

Two types of engineering data


Without getting into the standard technical terms in data analytics, I would classify the numerical data into two types from an engineer's perspective. The first one is visible data, abundantly available everywhere. The second type of data is invisible data, an example of which is hidden in the form of formulas and gets generated only when you calculate the output using those formulas and varying the inputs.

The visible data needs to be handled, using relevant data analytics techniques.

The invisible data can also be handled the same way, by generating data, arranging it in tables, plotting it graphically for visualization and then analyzing.

Sources of the visible data are, data from experimentation, published and validated data, data gathered from observations and so on.

Sources of invisible data not only those from the equations, but also those from computer simulations. For example, the output data generated by a finite element software can be put in this category.

Should I test my data when there is a clear formula?

As people say "seeing is not believing", seeing is not always believing. Hence the data needs to be put to statistical tests. This helps in decision making for those who are unsure of the direction in which one needs to go. Even for those who are "sure" by experience, it helps in re-validating their experience based knowledge.

Examples of some applications

In manufacturing engineering, one often uses the principles of six sigma. There are many statistical techniques that are common between six sigma and data analytics. By exposing oneself to data analytics, one will have a wider range of tools to handle one's data.

Design of experiments is often conducted in engineering, particularly mechanical engineering.

I see the term sensitivity analysis often being used in the analysis of high rise buildings. There are some who are loosely using this term just because the standards are using it. Maybe the standards and leaders of the industry in developing them  need to elaborate more on what it means by sensitivity analysis in line with its standard definition. It will be of great help if a standardized definition is included in the Indian tall concrete buildings code IS 16700.

As one is aware, advanced concepts like artificial intelligence, machine learning etc. are used in all industries.

Epilogue


One may say what I have written is nothing but statistics and hence old wine in new bottle! I would rather call it old nectar in new bottle, often left untasted by many.

The techniques that I have discussed and even more complex techniques are often used by researchers traditionally in engineering. But knowledge and application of the basic concepts at working level will be of great help as discussed and will elevate the society to the next level of a data literate society from a compute literate society.

Path forward


Below is the path forward I would suggest for promoting data literacy among engineers.

  1. Engineers should consider refreshing their undergraduate level statistics knowledge as part of their Continuous Professional Development.
  2. Guidelines should be framed to include a minimum number of questions on statistics and probability in exams like Professional Engineer's exams.
  3. A fixed number of minimum Professional Development Units (PDUs) in statistics and probability/data analytics/data science should be prescribed for retaining a Professional Engineer's or a project manager's certification.
  4. Non-IT engineering students interested in analytics with the intention of using it in their core field or making it their career, can consider doing a project in analytics (wait for my next article).
  5. Authors of engineering textbooks should include brief notes on how the empirical formulas have been arrived at, and provide reference to the original sources.



































Saturday, September 7, 2019

Higher concrete grade or more reinforcement? - A data driven approach for optimal shear strength

1. Introduction


Many a time the structural engineer has to choose between higher concrete grade and higher percentage of reinforcement. Some believe that increasing the concrete grade can result in savings and some intuitively view a high concrete grade to be very uneconomical for a small project.

The shear strength of concrete section varies based on the % of reinforcement and the grade of concrete.

In this article, let us examine whether higher concrete grade or higher reinforcement % makes significant difference for the shear strength of the section, by referring to the Indian concrete code IS 456:2000.

2.0 Data collection

2.1 Shear strength:

The below table (table 1) from IS 456:2000 is our data to be examined.

The parameters As b and d in table 1 are defined as below:

As=area of tension reinforcement
b=width of the concrete member
d=effective depth of the concrete member (overall depth minus distance between reinforcement centroid and extreme concrete fiber on tension side).

Table 1



2.2 Material price

The below material prices have been assumed.

Table 2  

                                                     
Grade of Conc.
Price (Rs/cum)
Shuttering (Rs/Sqm)
M15
4500
500
M20
4800
500
M25
5000
500
M30
5500
500
M35
5800
500
M40
6000
500

Reinforcement cost (Fe500): Rs. 60000/- per ton

2.3 Data clean-up

In the absence of an excel file for the shear strength table, the above picture was converted into Excel by using the below online character recognition tool.


The data converted into excel had some decimal points missing, and some numbers not properly recognized. The data table being small, this was cleaned up by visual inspection.

3.0 Case study

Let us consider a one way slab of 200 mm thickness. Assume 20mm clear cover and 16mm dia rebar. Then effective depth of the section d=200-20-16/2=172 mm.

Our shear strength studies will be based on this slab.

4.0 Assumptions

The following assumptions have been made.
  1. Maximum concrete grade is M40.
  2. Minimum % of reinforcement is 0.15.
  3. Maximum % of reinforcement is 3.0
  4. The depth of the section is constrained to be constant.
  5. Costing data is assumed.
  6. The slab requires no shear links.

5.0 Exclusion

The below is the exclusions
  • Distribution reinforcement is not considered in estimation and costing.

6.0 Data generation

For the slab under consideration, the following data is further generated from the shear strength and material pricing data collected.

6.1 Material price of reinforced concrete

The following material prices have been arrived at per cu m.

Table 3  



Cost of reinforced concrete in Rs. per cu m


M15
M20
M25
M30
M35
M40
100As/bd
0.15
7608
7908
8108
8608
8908
9108
0.25
8013
8313
8513
9013
9313
9513
0.5
9025
9325
9525
10025
10325
10525
0.75
10038
10338
10538
11038
11338
11538
1
11051
11351
11551
12051
12351
12551
1.25
12063
12363
12563
13063
13363
13563
1.5
13076
13376
13576
14076
14376
14576
1.75
14089
14389
14589
15089
15389
15589
2
15101
15401
15601
16101
16401
16601
2.25
16114
16414
16614
17114
17414
17614
2.5
17127
17427
17627
18127
18427
18627
2.75
18139
18439
18639
19139
19439
19639
3
19152
19452
19652
20152
20452
20652

6.2 Normalized material price of reinforced concrete

In table 3 above, M15 @ 0.15% reinforcement costs the least. Let us normalize the reinforced concrete costs with respect to the same, presented in the below table.

Table 4



Cost of reinforced concrete per cu m (normalized w.r.t. M15 @ 0.15% As)


M15
M20
M25
M30
M35
M40
100As/bd
0.15
1.00
1.04
1.07
1.13
1.17
1.20
0.25
1.05
1.09
1.12
1.18
1.22
1.25
0.5
1.19
1.23
1.25
1.32
1.36
1.38
0.75
1.32
1.36
1.39
1.45
1.49
1.52
1
1.45
1.49
1.52
1.58
1.62
1.65
1.25
1.59
1.63
1.65
1.72
1.76
1.78
1.5
1.72
1.76
1.78
1.85
1.89
1.92
1.75
1.85
1.89
1.92
1.98
2.02
2.05
2
1.99
2.02
2.05
2.12
2.16
2.18
2.25
2.12
2.16
2.18
2.25
2.29
2.32
2.5
2.25
2.29
2.32
2.38
2.42
2.45
2.75
2.38
2.42
2.45
2.52
2.56
2.58
3
2.52
2.56
2.58
2.65
2.69
2.71

6.3 Shear strength per unit normalized price of reinforced concrete

By dividing the shear strength in table 1 by the unit normalized price of reinforced concrete in table 4, let us arrive at the strength we are achieving per unit concrete price (normalized), as in table 5 below.

Table 5




Shear strength per unit normalized price (MPa)


M15
M20
M25
M30
M35
M40
100As/bd
0.15
0.28
0.27
0.27
0.26
0.25
0.25
0.25
0.33
0.33
0.32
0.31
0.30
0.30
0.5
0.39
0.39
0.39
0.38
0.37
0.37
0.75
0.41
0.41
0.41
0.41
0.40
0.40
1
0.41
0.42
0.42
0.42
0.41
0.41
1.25
0.40
0.41
0.42
0.41
0.42
0.42
1.5
0.40
0.41
0.41
0.41
0.41
0.41
1.75
0.38
0.40
0.41
0.40
0.41
0.41
2
0.36
0.39
0.40
0.40
0.40
0.40
2.25
0.34
0.38
0.39
0.39
0.39
0.40
2.5
0.32
0.36
0.38
0.38
0.38
0.39
2.75
0.30
0.34
0.37
0.37
0.38
0.38
3
0.28
0.32
0.36
0.36
0.37
0.37

7.0 Data visualization

From table 5, we see the value of shear strength we achieve per unit price increases as the % of reinforcement  increases initially and then starts decreasingThis happens for all values of concrete grade, from M15 through M40. On the other hand, the strength per unit price decreases as the concrete material grade increases at lower values of  100As/bd. This reverses at higher values of 100As/bd, with the strength per unit price increasing with concrete grade.

Since the strength per unit price increases with increasing 100As/bd for all concrete grades, let us plot the mean shear strength across all grades considered against 100As/bd as in the chart below.



8.0 Data analytics

8.1 Correlation

From the above data visualization, we understand there is a steep rise in the shear strength with reinforcement increasing up to 1% and then there is a relatively mild fall.

The correlation coefficient calculated in Excel works out to be around 0.208, which indicates mild positive correlation. 

8.2 ANOVA

Though ANOVA is ideally for categorical variables, it can also be used against continuous independent variables as in this case.

By performing single factor ANOVA across the rows and across the columns in Excel, below are the results obtained.

Table 6






Table 7


9.0 Interpretation of results
  1. The mild positive correlation in the chart plotted is to initial steep increase in the value of strength followed by subsequent relatively mild decline.
  2. From the ANOVA in Table 6, the P-value is very high across the columns i.e. for varying concrete grades. So we accept the null hypothesis and infer that the difference in the mean values of shear strength per unit price across varying concrete grades is not statistically significant.
  3. From the ANOVA in Table 7, the P-value is very low and nearly zero across rows i.e. for varying % reinforcement. So we reject the null hypothesis and infer that the difference in the mean values of shear strength per unit price is statistically significant.

10.0 Conclusions

The below conclusions are drawn from this study.
  1. It is the % of reinforcement that influences the shear strength achieved per every unit of currency, more than the grade of concrete.
  2. From data visualization, it is be concluded that the influence of % reinforcement has a positive effect on the cost effectiveness of the cross-section with respect to shear strength, only up to certain % of reinforcement. 

11.0 Practical Applications

  1. Reinforcement % increased up to some percentage (1% in the present study) in a flexural member will improve the cost efficiency of the section in terms of its shear strength.
  2. Where the section design is governed by shear strength, it will be judicious to increase the reinforcement only up to some percentage. Beyond that, other options like revising the section dimensions/slab depth etc. need to be explored.