Python

NumPy Tutorial with Exercises – Data Analysis

NumPy Tutorial with Exercises – Data Analysis

NumPy (acronym for ‘Numerical Python’ or ‘Numeric Python’) is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Why NumPy instead of lists?

One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:

  1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
  2. They are more speedy to work with and hence are more efficient than the lists.
  3. They are more convenient to deal with.

NumPy vs. Pandas

Pandas is built on top of NumPy. In other words,Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additionalmethod or provides more streamlined way of working with numerical and tabular data in Python.

Importing numpy

Firstly you need to import the numpy library. Importing numpy can be done by running the following command:

import numpy as np

It is a general approach to import numpy with alias as ‘np’. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias ‘np’ is introduced so we can write np.function. Some of the common functions of numpy are listed below –

FunctionsTasks
arrayCreate numpy array
ndimDimension of the array
shapeSize of the array (Number of rows and Columns)
sizeTotal number of elements in the array
dtypeType of elements in the array, i.e., int64, character
reshapeReshapes the array without changing the original shape
resizeReshapes the array. Also change the original shape
arangeCreate sequence of numbers in array
ItemsizeSize in bytes of each item
diagCreate a diagonal matrix
vstackStacking vertically
hstackStacking horizontally

1D array

Using numpy an array is created by using np.array:

a = np.array([15,25,14,78,96])

a
Output: array([15, 25, 14, 78, 96])

print(a)
Output: [15 25 14 78 96]

Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

Changing the datatype

np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.

a.dtype

Initially datatype of ‘a’ was ‘int32’ which on modifying becomes ‘float64’.

  1. int32 refers to number without a decimal point. ’32’ means number can be in between-2147483648 and 2147483647. Similarly, int16 implies number can be in range -32768 to 32767
  2. float64 refers to number with decimal place.

Creating the sequence of numbers

If you want to create a sequence of numbers then using np.arange, we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.

b = np.arange(start = 20,stop = 30, step = 1)

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In np.arange the end point is always excluded.

np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.

c = np.arange(20,30,2) #30 is excluded.

array([20, 22, 24, 26, 28])

It is to be taken care that in np.arange( ) the stop argument is always excluded.

Indexing in arrays

It is important to note that Python indexing starts from 0. The syntax of indexing is as follows –

  1. x[start:end:step]: Elements in array x start through the end (but the end is excluded), default step value is 1.
  2. x[start:end] : Elements in array x start through the end (but the end is excluded)
  3. x[start:] : Elements start through the end
  4. x[:end] : Elements from the beginning through the end (but the end is excluded)

If we want to extract 3rd element we write the index as 2 as it starts from 0.

x = np.arange(10)

x
Output: [0 1 2 3 4 5 6 7 8 9]

x[2]
Output: 2

x[2:5]
Output: array([2, 3, 4])

x[::2]
Output: array([0, 2, 4, 6, 8])

x[1::2]
Output: array([1, 3, 5, 7, 9])

Note that in x[2:5] elements starting from 2nd index up to 5th index(exclusive) are selected.

If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:

x[:7:3] = 123

array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])

To reverse a given array we write:

x = np.arange(10)

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Note that the above command does not modify the original array.

Reshaping the arrays

To reshape the array we can use reshape( ).

f = np.arange(101,113)

array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )

f.resize(3,4)

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.

f.reshape(3,-1)

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

Missing Data

The missing data is represented by NaN (acronym for Not a Number). You can use the command np.nan

val = np.array([15,10,

np.nan

val.sum()
Out: nan

To ignore missing values, you can use np.nansum(val) which returns 45

To check whether array contains missing value, you can use the functionisnan( )

np.isnan(val)

2D arrays

A 2D array in numpy can be created in the following manner:

g = np.array([(10,20,30),(40,50,60)])

The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:

g.ndim

g.ndim
Output: 2

g.size
Output: 6

g.shape
Output: (2, 3)

Creating some usual matrices

numpy provides the utility to create some usual matrices which are commonly used for linear algebra.

To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):

np.zeros( (2,4) )

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Here the dtype can also be specified. For a zero matrix the default dtype is ‘float’. To change it to integer we write ‘dtype = np.int16’

np.zeros([2,4],dtype=np.int16)

array([[0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int16)

To get a matrix of all random numbers from 0 to 1 we write np.empty.

np.empty( (2,3) )

array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
       [  2.29175545e-312,   2.33419537e-312,   2.37663529e-312]])

Note: The results may vary everytime you run np.empty.

To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:

np.ones([3,3])

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:

np.diag([14,15,16,17])

array([[14,  0,  0,  0],
       [ 0, 15,  0,  0],
       [ 0,  0, 16,  0],
       [ 0,  0,  0, 17]])

To create an identity matrix we can use np.eye( ) .

np.eye(5,dtype = “int”)

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

By default the datatype in np.eye( ) is ‘float’ thus we write dtype = “int” to convert it to integers.

Reshaping 2D arrays

To get a flattened 1D array we can use ravel( )

g = np.array([(10,20,30),(40,50,60)])

array([10, 20, 30, 40, 50, 60])

To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.

g.reshape(3,-1) # returns the array with a modified shape

 (2, 3)

Similar to 1D arrays, using resize( ) will modify the shape in the original array.

g.resize((3,2))

array([[10, 20],
       [30, 40],
       [50, 60]])

Time for some matrix algebra

Let us create some arrays A,b and B and they will be used for this section:

In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.

A.T #transpose

A.transpose()  #transpose
Output:
array([[2, 4, 7],
       [0, 3, 6],
       [1, 8, 9]])

np.trace(A)  # trace
Output: 14

np.linalg.inv(A)  #Inverse
Output:
array([[ 0.53846154, -0.15384615,  0.07692308],
       [-0.51282051, -0.28205128,  0.30769231],
       [-0.07692308,  0.30769231, -0.15384615]])

Note that transpose does not modify the original array.

Matrix addition and subtraction can be done in the usual way:

A+B

A+B
Output:
array([[12, 20, 31],
       [44, 53, 68],
       [77, 86, 99]])

A-B
Output:
array([[ -8, -20, -29],
       [-36, -47, -52],
       [-63, -74, -81]])

Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.

A.dot(B)

array([[  90,  120,  150],
       [ 720,  870, 1020],
       [ 940, 1160, 1380]])

To solve the system of linear equations: Ax = b we use np.linalg.solve( )

np.linalg.solve(A,b)

array([-13.92307692, -24.69230769,  28.84615385])

The eigen values and eigen vectors can be calculated using np.linalg.eig( )

np.linalg.eig(A)

(array([ 14.0874236 ,   1.62072127,  -1.70814487]),
 array([[-0.06599631, -0.78226966, -0.14996331],
        [-0.59939873,  0.54774477, -0.81748379],
        [-0.7977253 ,  0.29669824,  0.55608566]]))

The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

Some Mathematics functions

We can have various trigonometric functions like sin, cosine etc. using numpy:

B = np.array([[0,-20,36],[40,50,1]])

array([[ 0.        , -0.91294525, -0.99177885],
       [ 0.74511316, -0.26237485,  0.84147098]])

The resultant is the matrix of all sin( ) elements.

In order to get the exponents we use ******

B**2

array([[   0,  400, 1296],
       [1600, 2500,    1]], dtype=int32)

We get the matrix of the square of all elements of B.

In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:

B>25

array([[False, False,  True],
       [ True,  True, False]], dtype=bool)

We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.

In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.

np.absolute(B)

Now we consider a matrix A of shape 3*3:

A = np.arange(1,10).reshape(3,3)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:

A.sum()

A.sum()
Output: 45

A.min()
Output: 1

A.max()
Output: 9

A.mean()
Output: 5.0

A.std()   #Standard deviation
Output: 2.5819888974716112

A.var()
Output: 6.666666666666667

In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.

A.argmin()

A.argmin()
Output: 0

A.argmax()
Output: 8

If we wish to find the above statistics for each row or column then we need to specify the axis:

A.sum(axis=0)

A.sum(axis=0)                 # sum of each column, it will move in downward direction
Output: array([12, 15, 18])

A.mean(axis = 0)
Output: array([ 4.,  5.,  6.])

A.std(axis = 0)
Output: array([ 2.44948974,  2.44948974,  2.44948974])

A.argmin(axis = 0)
Output: array([0, 0, 0], dtype=int64)

By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column. To find the min and index of maximum element for each row, we need to move in right-wise direction so we write axis = 1:

A.min(axis=1)

A.min(axis=1)                  # min of each row, it will move in rightwise direction
Output: array([1, 4, 7])

A.argmax(axis = 1)
Output: array([2, 2, 2], dtype=int64)

To find the cumulative sum along each row we use cumsum( )

A.cumsum(axis=1)

array([[ 1,  3,  6],
       [ 4,  9, 15],
       [ 7, 15, 24]], dtype=int32)

Creating 3D arrays

Numpy also provides the facility to create 3D arrays. A 3D array can be created as:

X = np.array( [[[ 1, 2,3],

X contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.

To calculate the sum along a particular axis we use the axis parameter as follows:

X.sum(axis = 0)

X.sum(axis = 0)
Output:
array([[ 8, 10, 12],
       [14, 16, 18]])

X.sum(axis = 1)
Output:
array([[ 5,  7,  9],
       [17, 19, 21]])

X.sum(axis = 2)
Output:
array([[ 6, 15],
       [24, 33]])

axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.

X.ravel()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

ravel( ) writes all the elements in a single array.

Consider a 3D array:

X = np.array( [[[ 1, 2,3],

To extract the 2nd matrix we write:

X[1,…] # same as X[1,:,:] or X[1]

array([[ 7,  8,  9],
       [10, 11, 12]])

Remember python indexing starts from 0 that is why we wrote 1 to extract the 2nd 2D array.

To extract the first element from all the rows we write:

X[…,0] # same as X[:,:,0]

array([[ 1,  4],
       [ 7, 10]])

Find out position of elements that satisfy a given condition

a = np.array([8, 3, 7, 0, 4, 2, 5, 2])

array([0, 2, 6]

np.where locates the positions in the array where element of array is greater than 4.

Indexing with Arrays of Indices

Consider a 1D array.

x = np.arange(11,35,2)

array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])

We form a 1D array i which subsets the elements of x as follows:

i = np.array( [0,1,5,3,7,9 ] )

array([11, 13, 21, 17, 25, 29])

In a similar manner we create a 2D array j of indices to subset x.

j = np.array( [ [ 0, 1], [ 6, 2 ] ] )

array([[11, 13],
       [23, 15]])

Similarly we can create both i and j as 2D arrays of indices for x

x = np.arange(15).reshape(3,5)

To get the ith index in row and jth index for columns we write:

x[i,j] # i and j must have equal shape

array([[ 1,  6],
       [12,  0]])

To extract ith index from 3rd column we write:

x[i,2]

array([[ 2,  7],
       [12,  2]])

For each row if we want to find the jth index we write:

x[:,j]

array([[[ 1,  1],
        [ 2,  0]],

       [[ 6,  6],
        [ 7,  5]],

       [[11, 11],
        [12, 10]]])

Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

You can also use indexing with arrays to assign the values:

x = np.arange(10)

array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])

0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.

When the list of indices contains repetitions then it assigns the last value to that index:

x = np.arange(10)

array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])

Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.

Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.

x = np.arange(10)

array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])

Although index 1 and 7 are repeated but they are incremented only once.

Indexing with Boolean Arrays

We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.

a = np.arange(12).reshape(3,4)

array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

Note that ‘b’ is a Boolean with same shape as that of ‘a’.

To select the elements from ‘a’ which adhere to condition ‘b’ we write:

a[b]

array([ 5,  6,  7,  8,  9, 10, 11])

Now ‘a’ becomes a 1D array with the selected elements

This property can be very useful in assignments:

a[b] = 0

array([[0, 1, 2, 3],
       [4, 0, 0, 0],
       [0, 0, 0, 0]])

All elements of ‘a’ higher than 4 become 0

As done in integer indexing we can use indexing via Booleans:

Let x be the original matrix and ‘y’ and ‘z’ be the arrays of Booleans to select the rows and columns.

x = np.arange(15).reshape(3,5)

We write the x[y,:] which will select only those rows where y is True.

x[y,:] # selecting rows

Writing x[:,z] will select only those columns where z is True.

x[:,z] # selecting columns

x[y,:]                                   # selecting rows
Output:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[y]                                     # same thing
Output:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[:,z]                                   # selecting columns
Output:
array([[ 0,  1,  3],
       [ 5,  6,  8],
       [10, 11, 13]])

Statistics on Pandas DataFrame

Let’s create dummy data frame for illustration :

np.random.seed(234)
mydata = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
                     "x2"  : range(10)
                     })

1. Calculate mean of each column of data frame

np.mean(mydata)

2. Calculate median of each column of data frame

np.median(mydata, axis=0)

axis = 0 means the median function would be run on each column. axis = 1 implies the function to be run on each row.

Stacking various arrays

Let us consider 2 arrays A and B:

A = np.array([[10,20,30],[40,50,60]])

To join them vertically we use np.vstack( ).

np.vstack((A,B)) #Stacking vertically

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [100, 200, 300],
       [400, 500, 600]])

To join them horizontally we use np.hstack( ).

np.hstack((A,B)) #Stacking horizontally

array([[ 10,  20,  30, 100, 200, 300],
       [ 40,  50,  60, 400, 500, 600]])

newaxis helps in transforming a 1D row vector to a 1D column vector.

from numpy import newaxis

array([[ 4.],
       [ 1.]])

#The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:

np.column_stack((a[:,newaxis],b[:,newaxis]))

np.column_stack((a[:,newaxis],b[:,newaxis]))
Output:
array([[ 4.,  2.],
       [ 1.,  8.]])

np.hstack((a[:,newaxis],b[:,newaxis]))
Output:
array([[ 4.,  2.],
       [ 1.,  8.]])

Splitting the arrays

Consider an array ‘z’ of 15 elements:

z = np.arange(1,16)

Using np.hsplit( ) one can split the arrays

np.hsplit(z,5) # Split a into 5 arrays

[array([1, 2, 3]),
 array([4, 5, 6]),
 array([7, 8, 9]),
 array([10, 11, 12]),
 array([13, 14, 15])]

It splits ‘z’ into 5 arrays of eqaual length.

On passing 2 elements we get:

np.hsplit(z,(3,5))

[array([1, 2, 3]),
 array([4, 5]),
 array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15])]

It splits ‘z’ after the third and the fifth element.

For 2D arrays np.hsplit( ) works as follows:

A = np.arange(1,31).reshape(3,10)

[array([[ 1,  2],
        [11, 12],
        [21, 22]]), array([[ 3,  4],
        [13, 14],
        [23, 24]]), array([[ 5,  6],
        [15, 16],
        [25, 26]]), array([[ 7,  8],
        [17, 18],
        [27, 28]]), array([[ 9, 10],
        [19, 20],
        [29, 30]])]

In the above command A gets split into 5 arrays of same shape.

To split after the third and the fifth column we write:

np.hsplit(A,(3,5))

[array([[ 1,  2,  3],
        [11, 12, 13],
        [21, 22, 23]]), array([[ 4,  5],
        [14, 15],
        [24, 25]]), array([[ 6,  7,  8,  9, 10],
        [16, 17, 18, 19, 20],
        [26, 27, 28, 29, 30]])]

Copying

Consider an array x

x = np.arange(1,16)

We assign y as x and then say ‘y is x’

y = x

Let us change the shape of y

y.shape = 3,5

Note that it alters the shape of x

(3, 5)

Creating a view of the data

Let us store z as a view of x by:

z = x.view()

False

Thus z is not x.

Changing the shape of z

z.shape = 5,3

Creating a view does not alter the shape of x

(3, 5)

Changing an element in z

z[0,0] = 1234

Note that the value in x also get alters:

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.

Creating a copy of the data:

Now let us create z as a copy of x:

z = x.copy()

Note that z is not x

Changing the value in z

z[0,0] = 9999

No alterations are made in x.

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Python sometimes may give ‘setting with copy’ warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

Exercises : Numpy

1. How to extract even numbers from array?

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

**Desired Output :**array([0, 2, 4, 6, 8])

arr[arr % 2 == 0]

2. How to find out the position where elements of x and y are same

x = np.array([5,6,7,8,3,4])

y = np.array([5,3,4,5,2,4])

**Desired Output :**array([0, 5]

np.where(x == y)

3. How to standardize values so that it lies between 0 and 1

k = np.array([5,3,4,5,2,4])

**Hint :**k-min(k)/(max(k)-min(k))

kmax, kmin = k.max(), k.min()

k_new = (k – kmin)/(kmax – kmin)

4. How to calculate the percentile scores of an array

p = np.array([15,10, 3,2,5,6,4])

np.percentile(p, q=[5, 95])

5. Print the number of missing values in an array

p = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan])

print(“Number of missing values =”, np.isnan(p).sum())

Suggested Articles

Leave a Reply

Your email address will not be published. Required fields are marked *