#HackerRank: Computing the Correlation

Problem

You are given the scores of N students in three different subjects – MathematicsPhysics and Chemistry; all of which have been graded on a scale of 0 to 100. Your task is to compute the Pearson product-moment correlation coefficient between the scores of different pairs of subjects (Mathematics and Physics, Physics and Chemistry, Mathematics and Chemistry) based on this data. This data is based on the records of the CBSE K-12 Examination – a national school leaving examination in India, for the year 2013.

Pearson product-moment correlation coefficient

This is a measure of linear correlation described well on this Wikipedia page. The formula, in brief, is given by:

where x and y denote the two vectors between which the correlation is to be measured.

Input Format

The first row contains an integer N.
This is followed by N rows containing three tab-space (‘\t’) separated integers, M P C corresponding to a candidate’s scores in Mathematics, Physics and Chemistry respectively.
Each row corresponds to the scores attained by a unique candidate in these three subjects.

Input Constraints

1 <= N <= 5 x 105
0 <= M, P, C <= 100

Output Format

The output should contain three lines, with correlation coefficients computed
and rounded off correct to exactly 2 decimal places.
The first line should contain the correlation coefficient between Mathematics and Physics scores.
The second line should contain the correlation coefficient between Physics and Chemistry scores.
The third line should contain the correlation coefficient between Chemistry and Mathematics scores.

So, your output should look like this (these values are only for explanatory purposes):

0.12
0.13
0.95

Test Cases

There is one sample test case with scores obtained in Mathematics, Physics and Chemistry by 20 students. The hidden test case contains the scores obtained by all the candidates who appeared for the examination and took all three tests (Mathematics, Physics and Chemistry).
Think: How can you efficiently compute the correlation coefficients within the given time constraints, while handling the scores of nearly 400k students?

Sample Input

20
73  72  76
48  67  76
95  92  95
95  95  96
33  59  79
47  58  74
98  95  97
91  94  97
95  84  90
93  83  90
70  70  78
85  79  91
33  67  76
47  73  90
95  87  95
84  86  95
43  63  75
95  92  100
54  80  87
72  76  90

Sample Output

0.89  
0.92  
0.81

There is no special library support available for this challenge.

Solution(Source)

 

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16

What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

enter image description here

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

enter image description here

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

enter image description here

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

Here’s another way to think about this that approaches the topic through the formulas instead of visually:

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the ‘intuitive’ formula for the slope. Consider this form for both the situation where you are regressing y on x, and where you are regressing x on y:

Now, I hope it’s obvious that these would not be the same unless Var(xequals Var(y). If the variances are equal (e.g., because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal SD(x)SD(y). In this case, β^1 would equal Pearson’s r, which is the same either way by virtue of the principle of commutativity:

 

Source

HackerEarth: Battle Of Bots 6: Draughts

Problem:

Sample Game

Draughts is a two player board game which is played on a 8X8 grid of cells and is played on opposite sides of the game-board. Each player has an allocated color, Red ( First Player ) or White ( Second Player ) being conventional. Players take turns involving diagonal moves of uniform game pieces in the forward direction only and mandatory captures by jumping over opponent pieces.

Rules:

  • Player can only move diagonally to the adjacent cell and in forward direction, if the diagonally adjacent cell is vacant.
  • A player may not move an opponent’s piece.
  • If the diagonally adjacent cell contains an opponent’s piece, and the cell immediately beyond it is vacant, the opponent’s piece may be captured (and removed from the game) by jumping over it in the forward direction only.
  • If a player made a jump, then its mandatory to keep on jumping as long as the jump is possible.
  • Player cannot move to the diagonally adjacent cell once the player made a jump.

The game will end when any of the players don’t have any move left. At the end of the game the player with majority of pieces will win.

We will play it on an 8X8 grid. The top left of the grid is [0,0] and the bottom right is [7,7].

Input:
The input will be a 8X8 matrix consisting only of 0o2. Then another line will follow which will contain a number –  1 or 2 which is your player id. In the given matrix, top-left is [0,0] and bottom-right is [7,7]. The x-coordinate increases from left to right, and y-coordinate increases from top to bottom.

The cell marked 0 means it doesn’t contain any stones. The cell marked 1 means it contains first player’s stone which is Red in color. The cell marked 2 means it contains second player’s stone which is white in color.

Output:
In the first line print the coordinates of the cell separated by space, the piece he / she wants to move.
In second line print an integer N, number of steps or jumps the piece will make in one move.
In the next N lines print the coordinates of the cells in which the piece will make jump.
You must take care that you don’t print invalid coordinates. For example, [1,1] might be a valid coordinate in the game play if [1,1] in diagonal to the piece in which is going to jump, but [9,10] will never be. Also if you play an invalid move or your code exceeds the time/memory limit while determining the move, you lose the game.

Starting state
The starting state of the game is the state of the board before the game starts.

0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 2 0 2 0 2 0 2
2 0 2 0 2 0 2 0

First Input
This is the input give to the first player at the start of the game.

0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 2 0 2 0 2 0 2
2 0 2 0 2 0 2 0
1
SAMPLE INPUT
0 1 0 1 0 1 0 1
1 0 1 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0
0 0 0 2 0 0 0 2
2 0 2 0 2 0 2 0
1
SAMPLE OUTPUT
2 5
2
4 3
6 1

Explanation

This is player 1’s turn, and the player will move the piece at [2,5] and will make two jumps. First jump will be at [4,3and second jump will be at [6,1]

After his/her move the state of game becomes:

0 1 0 1 0 1 0 1
1 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 0 2 0 2 0 2
2 0 2 0 2 0 2 0

This state will be fed as input to program of player 2.

Other valid move for the first player is

2 5
1
3 6

But the following are invalid moves.
Case 1:

2 5
1
4 3

Because after making a jump its possible to jump again and its mandatory to jump as long as its possible to jump.

Case 2:

2 5
2
4 3
5 4

Because after making a jump its invalid to move to diagonally adjacent cell.

Here is the code of the Random Bot.

Time Limit:1.0 sec(s) for each input file.
Memory Limit:256 MB
Source Limit:1024 KB

Solution

This is the solution submitted by me

Converting Java BufferedImage to OpenCV Mat and vice versa

The code below allows us to convert BufferedImage to Mat (in OpenCV) and vice versa.

This becomes handy when getting images from network or certain other sources:

HackerEarth: Battle Of Bots #5: Reversi

Problem:

Reversi is a two player board game which is played on a 10 x 10 grid of cells. Each player has an allocated color, Black ( First Player ) or White ( Second Player ) being conventional. Players take turns placing a stone of their color on a single cell. A player must place a stone on the board, in such a position that there exists at least one straight (horizontal, vertical, or diagonal) occupied line between the new stone and another stone of same color, with one or more contiguous other color stone between them.

During a game, any stone of the opponent’s color that are in a straight line and bounded by the stone just placed and another stone of the current player’s color are turned over to the current player’s color. The game will end when the board is completely filled or both the players don’t have any move left. At the end of the game the player with majority of stone will win.

We will play it on an 10 x 10 grid. The top left of the grid is [0,0] and the bottom right is [9,9]. The rule is that a cell[i,j] is connected to any of top, left, right, or bottom cell.

Input:
The input will be a 10 x 10 matrix consisting only of 0,1,2 or 3. Then another line will follow which will contain a number – 1 or 2 which is your player id.

In the given matrix, top-left is [0,0] and bottom-right is [9,9].

In cell[row,column], row increases from top to bottom and column increases from left to right.

The cell marked 0 means it doesn’t contain any stones. The cell marked 1 means it contains first player’s stone which is Black in color. The cell marked 2 means it contains second player’s stone which is white in color. The cell marked 3 means it is a valid place for player whose turn it is.

Output:
Print the coordinates of the cell separated by space, where you want to play your move. You must take care that you don’t print invalid coordinates. For example, [1] might be a valid coordinate in the game play if cell[i,j]=3, but [10] will never be. Also if you play an invalid move or your code exceeds the time/memory limit while determining the move, you lose the game.

Starting state
The starting state of the game is the state of the board before the game starts.

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 1 0 0 0 0
0 0 0 0 1 2 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

First Input
This is the input give to the first player at the start of the game.

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0
0 0 0 3 2 1 0 0 0 0
0 0 0 0 1 2 3 0 0 0
0 0 0 0 0 3 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1

Scoring
The scores are calculated by running tournament of all submissions. Your latest submission will be taken into tournament. Scores are assigned according to the Glicko-2 rating system. For more information and questions, see Bot problem judge.

SAMPLE INPUT

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0
0 0 0 3 2 1 3 0 0 0
0 0 0 0 2 2 0 0 0 0
0 0 0 3 1 1 2 3 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1
SAMPLE OUTPUT
4 3

Explanation

This is player 1’s turn, and the player puts his/her stone in cell[4,3].
After his/her move the state of game becomes:

0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0  
0 0 0 3 0 3 0 0 0 0  
0 0 0 1 1 1 0 0 0 0  
0 0 0 3 1 2 0 0 0 0  
0 0 0 3 1 1 2 0 0 0  
0 0 0 1 0 3 0 0 0 0  
0 0 3 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0  

This state will be fed as input to program of player 2.


Time Limit:1.0 sec(s) for each input file.
Memory Limit: 256 MB
Source Limit: 1024 KB
Marking Scheme:Marks are awarded if any testcase passes.
Allowed Languages:C, CPP, CLOJURE, CSHARP, D, ERLANG, FSHARP, GO, GROOVY, HASKELL, JAVA, JAVA8, JAVASCRIPT, JAVASCRIPT_NODE, LISP, LISP_SBCL, LUA, OBJECTIVEC, OCAML, OCTAVE, PASCAL, PERL, PHP, PYTHON, PYTHON3, R, RACKET, RUBY, RUST, SCALA, SWIFT, VB

My Solution:

HackerRank: BotClean Partially Observable

Problem Statement

The game Bot Clean took place in a fully observable environment, i.e., the state of every cell was visible to the bot at all times. Let us consider a variation of it where the environment is partially observable. The bot has the same actuators and sensors. But the sensors visibility is confined to its 8 adjacent cells.

Input Format
The first line contains two space separated integers which indicate the current position of the bot. The board is indexed using Matrix Convention

5 lines follow, representing the grid. Each cell in the grid is represented by any of the following 4 characters:
‘b’ (ascii value 98) indicates the bot’s current position,
‘d’ (ascii value 100) indicates a dirty cell,
‘-‘ (ascii value 45) indicates a clean cell in the grid, and
‘o’ (ascii value 111) indicates the cell that is currently not visible.

Output Format
Output is the action that is taken by the bot in the current step. It can either be any of the movements in 4 directions or the action of cleaning the cell in which it is currently located. Hence the output formats are LEFT, RIGHT, UP, DOWN or CLEAN.

Sample Input

0 0
b-ooo
-dooo
ooooo
ooooo
ooooo

Sample Output

RIGHT

Task
Complete the function next_move that takes in 3 parameters: posr and posc denote the co-ordinates of the bot’s current position, and board denotes the board state, and print the bot’s next move.

Scoring
The goal is to clean all the dirty cells in as few moves as possible. Your score is (200 – #bot moves)/25. All bots in this challenge will be given the same input. CLEAN is also considered a move.

Education Links

Solution:

Solution tester posted alongside the program for users to test their input as well.