資源簡介
壓縮包主要采用隨機森林算法處理adult數據集的分類問題,主要包含四部分,第一部分是由python編寫的adult數據集預處理過程,第二部分是自己編寫的隨機森林算法處理adult數據集,第三部分是調用python中sklearn模塊處理adult分類問題,第四部分是基于matlab調用5種機器學習分類算法分別處理adult分類問題比較哪種算法能夠取得更好的分類效果。
代碼片段和文件信息
#?-*-?coding:?utf-8?-*-
“““
Created?on?Tue?Nov??6?13:29:41?2018
@author:?28770
“““
import?pandas?as?pd
excelFile=r‘ML_data2.xlsx‘
train_df?=?pd.Dataframe(pd.read_excel(excelFilesheet_name=0))??#讀取指定路徑的表格的sheet0為文件并轉換到結構框格式
test_df=?pd.Dataframe(pd.read_excel(excelFilesheet_name=1))?#讀取指定路徑的表格的sheet1為文件并轉換到結構框格式
‘‘‘
#workClass_loss用于返回train_df中‘workClass‘這一列中的確實項,缺失數據處為True
workClass_loss=train_df[‘workClass‘].isnull()??#.notnull()效果與其相反。
‘‘‘
‘‘‘
缺失值填充步驟:(使用缺失值上一行的數據填充缺失值處)
對train_df中的缺失值進行填充,其中.mode()是用這一列的眾數填充,mean()使用列平均值填充。
其中,由于可能某一列有多個相同的眾數,因此.mode()返回的是一個series不像mean()一樣返回
的是一個數值,因此,采用.mode()[0]自動將其填充為第一個眾數。
‘‘‘
train_df_fill=train_df.fillna(method=“ffill“)
test_df_fill=test_df.fillna(method=“ffill“)
‘‘‘
刪除重復的列信息
‘‘‘
train_df_fill=train_df_fill.drop([‘education‘]1)
test_df_fill=test_df_fill.drop([‘education‘]1)
‘‘‘
離散特征映射
‘‘‘
salary_mapping={‘<=50K‘:0‘>50K‘:1}
train_df_fill[‘salary‘]=train_df_fill[‘salary‘].map(salary_mapping)
test_df_fill[‘salary‘]=test_df_fill[‘salary‘].map(salary_mapping)
Discrete_attribute=[‘workClass‘‘education‘‘marital_status‘‘occupation‘
????????????????????‘relationship‘‘race‘‘sex‘‘native_country‘]
for?attribute?in?Discrete_attribute:
????attribute_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[attribute]))}?
????train_df_fill[attribute]?=?train_df_fill[attribute].map(attribute_mapping)??
????test_df_fill[attribute]?=?test_df_fill[attribute].map(attribute_mapping)
‘‘‘
workClass_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘workClass‘]))}?
train_df_fill[‘workClass‘]?=?train_df_fill[‘workClass‘].map(workClass_mapping)??
test_df_fill[‘workClass‘]?=?test_df_fill[‘workClass‘].map(workClass_mapping)?
education_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘education‘]))}?
train_df_fill[‘education‘]?=?train_df_fill[‘education‘].map(education_mapping)??
test_df_fill[‘education‘]?=?test_df_fill[‘education‘].map(education_mapping)?
marital_status_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘marital_status‘]))}?
train_df_fill[‘marital_status‘]?=?train_df_fill[‘marital_status‘].map(marital_status_mapping)??
test_df_fill[‘marital_status‘]?=?test_df_fill[‘marital_status‘].map(marital_status_mapping)?
occupation_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘occupation‘]))}?
train_df_fill[‘occupation‘]?=?train_df_fill[‘occupation‘].map(occupation_mapping)??
test_df_fill[‘occupation‘]?=?test_df_fill[‘occupation‘].map(occupation_mapping)?
relationship_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘relationship‘]))}?
train_df_fill[‘relationship‘]?=?train_df_fill[‘relationship‘].map(relationship_mapping)??
test_df_fill[‘relationship‘]?=?test_df_fill[‘relationship‘].map(relationship_mapping)?
race_mapping?=?{lab:idx?for?idxlab?in?enumerate(set(train_df_fill[‘race‘]))}?
train_df_fill[‘race‘]?=?train_df_fill[‘race‘].map(race_mapping)??
test_df_
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件???????4575??2018-11-13?23:33??Random_Forest\excel_change.py
?????文件???????1589??2018-11-13?20:55??Random_Forest\Matlab_xlr\excel_run.m
?????文件????2677491??2018-11-06?20:50??Random_Forest\Matlab_xlr\ML_data2_trans.xlsx
?????文件????2918697??2018-11-01?21:57??Random_Forest\ML_data2.xlsx
?????文件?????642592??2018-11-08?10:55??Random_Forest\ML_data2_test.csv
?????文件????1285749??2018-11-08?10:55??Random_Forest\ML_data2_train.csv
?????文件????2677491??2018-11-06?20:50??Random_Forest\ML_data2_trans.xlsx
?????文件?????642435??2018-11-08?10:59??Random_Forest\Random?Forest\ML_data2_test.csv
?????文件????1285592??2018-11-08?10:59??Random_Forest\Random?Forest\ML_data2_train.csv
?????文件????2677491??2018-11-06?20:50??Random_Forest\Random?Forest\ML_data2_trans.xlsx
?????文件??????10260??2018-11-14?13:26??Random_Forest\Random?Forest\Random_Forest.py
?????文件?????642435??2018-11-08?10:59??Random_Forest\RF_sklearn\ML_data2_test.csv
?????文件????1285592??2018-11-08?10:59??Random_Forest\RF_sklearn\ML_data2_train.csv
?????文件????2677491??2018-11-06?20:50??Random_Forest\RF_sklearn\ML_data2_trans.xlsx
?????文件???????1259??2018-11-14?14:15??Random_Forest\RF_sklearn\RF_sklearn.py
?????文件????????214??2018-11-14?13:51??Random_Forest\文本描述(首先閱讀).txt
?????目錄??????????0??2018-12-14?10:51??Random_Forest\Matlab_xlr
?????目錄??????????0??2018-12-14?10:51??Random_Forest\Random?Forest
?????目錄??????????0??2018-12-14?10:51??Random_Forest\RF_sklearn
?????目錄??????????0??2018-12-14?10:51??Random_Forest
-----------?---------??----------?-----??----
?????????????19430953????????????????????20
評論
共有 條評論