BSC Computer Science - Data Science Assignment 3 (SET A)

Create own dataset and do simple preprocessing

Dataset Name: Data.CSV (save following data in Excel and save it with .CSV extension)

Country, Age, Salary, Purchased

France,44, 72000, No

Spain,27, 48000, Yes

Germany, 30, 54000, No

Spain,38, 61000, No

Germany, 40, Yes

France, 35, 58000, Yes

Spain,52000, No

France, 48, 79000, Yes

Germany, 50, 83000, No

France, 37, 67000, Yes

*Above dataset is also available at:

Click here for data set..


import numpy as np
import pandas as pd
df=pd.read_csv('Data.csv')
df

Output:

Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes


Write a program in python to perform following task.


1. Import Dataset and do the followings:
     a) Describing the dataset
     b) Shape of the dataset
     c) Display first 3 rows from dataset


a)Describing the dataset


df.describe()

Output:

       Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000

b) Shape of the dataset

df.shape

Output:

(10, 4)


c) Display first 3 rows from dataset

df.head(3)

Output:

Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No


2. Handling Missing Value: a) Replace missing value of salary,age column with mean of that column.


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(df.iloc[:, 1:3])
df.iloc[:, 1:3] = imputer.transform(df.iloc[:, 1:3])  
df

Output:

Country Age     Salary    Purchased   
0 France 44.000000 72000.000000 No
1 Spain 27.000000 48000.000000 Yes
2 Germany 30.000000 54000.000000 No
3 Spain 38.000000 61000.000000 No
4 Germany 40.000000 63777.777778 Yes
5 France 35.000000 58000.000000 Yes
6 Spain 38.777778 52000.000000 No
7 France 48.000000 79000.000000 Yes
8 Germany 50.000000 83000.000000 No
9 France 37.000000 67000.000000 Yes

3. Data.csv have two categorical column (the country column, and the purchased column).

a. Apply OneHot coding on Country column.


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
df = pd.DataFrame(ct.fit_transform(df))
df

Output:

0   1   2   3   4   5 6 7
0 1 0 1 0 0 44 72000 No
1 0 1 0 0 1 27 48000 Yes
2 0 1 0 1 0 30 54000 No
3 0 1 0 0 1 38 61000 No
4 0 1 0 1 0 40 63777.8 Yes
5 1 0 1 0 0 35 58000 Yes
6 0 1 0 0 1 38.7778 52000 No
7 1 0 1 0 0 48 79000 Yes
8 0 1 0 1 0 50 83000 No
9 1 0 1 0 0 37 67000 Yes

b. Apply Label encoding on purchased column



from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.iloc[:,-1] = le.fit_transform(df.iloc[:,-1])
df

Output:

0    1   2   3   4   5    6 7
0 1 0 1 0 0 44 72000 0
1 0 1 0 0 1 27 48000 1
2 0 1 0 1 0 30 54000 0
3 0 1 0 0 1 38 61000 0
4 0 1 0 1 0 40 63777.8 1
5 1 0 1 0 0 35 58000 1
6 0 1 0 0 1 38.7778 52000 0
7 1 0 1 0 0 48 79000 1
8 0 1 0 1 0 50 83000 0
9 1 0 1 0 0 37 67000 1

Post a Comment

2 Comments

Thanks,To visit this blog.