資源簡介
打開網站鏈接http://archive.ics.uci.edu/ml/,點擊鏈接 view all data sets,打開所有數據頁面,點擊Instances,按照研究實例由多到少排序,選擇任務為Classification的數據集,最后我們小組選擇了“Letter Recognition Data Set”字母識別數據集。
二、數據分析
字母識別數據集每個對象有16個特征,共包括20000個數據對象,每個特征的取值都為整數,于1991年1月1日提供,主要用來進行數據分類試驗。分類的目標是識別由黑白像素組成矩形的圖像,代表26英文字母哪個字母。這些圖像基于20種不同字體,并經過隨機變形生成的20000個模擬實例。每個實例轉化成16個原始數字特征,其中10000用于訓練,另外10000個用于字母預測。因為每個樣本都有明確的類標識,所以這個一個監督學習過程。

代碼片段和文件信息
from?numpy?import?*
import?string
#parse?files?function?every?data?is?a?integer
def?loadDataSet(filename):
????numFeat?=?len(open(filename).readline().split(‘‘))
????dataMat?=?[]
????labelMat=[]
????fr?=?open(filename)
????for?line?in?fr.readlines():
????????lineArr=?[]
????????curLine?=?line.strip(‘\n‘).split(‘‘)
????????for?i?in?range(1?numFeat):
????????????lineArr.append(int(curLine[i]))
????????dataMat.append(lineArr)
????????labelMat.append(curLine[0])
????fr.close()
????return?dataMat?labelMat
‘‘‘‘‘
purpose:?data?classify?by?compare?to?threshold
‘‘‘
def?stumpClassify(dataMatrix?dimen?threshVal?threshIneq):
????retArray?=?ones((shape(dataMatrix)[0]1))
????if?threshIneq?==?‘lt‘:
????????retArray[dataMatrix[:dimen] ????else:
????????retArray[dataMatrix[:dimen] ????return?retArray
‘‘‘‘‘
purpose:?single?level?decision?tree?create?function(weak?classify?device)
input:??dataArr:?dataSet?classLabels:class?label?D:data?weight
output:??bestStump:?single?level?decision?tree?having?min?error?rate?minError:?min?Error?rate
?????????bestClassEst:?estimate?class?labels
‘‘‘
def?buildStump(dataArrclassLabelsD):
????dataMatrix?=?mat(dataArr);?labelMat?=?mat(classLabels).T
????mn?=?shape(dataMatrix)
????numSteps?=?10.0
????#?define?a?empty?dictionary?for?store?Dthe?better?single?level?tree?info
????bestStump?=?{}
????bestClasEst?=?mat(zeros((m1)))
????minError?=?inf?#init?error?sum?to?+infinity
????for?i?in?range(n):#loop?over?all?dimensions
????????rangeMin?=?dataMatrix[:i].min()
????????rangeMax?=?dataMatrix[:i].max()
????????stepSize?=?(rangeMax-rangeMin)/numSteps
????????for?j?in?range(-1int(numSteps)+1):#loop?over?all?range?in?current?dimension
????????????for?inequal?in?[‘lt‘?‘gt‘]:?#go?over?less?than?and?greater?than
????????????????threshVal?=?(rangeMin?+?float(j)?*?stepSize)
????????????????predictedVals?=?stumpClassify(dataMatrixithreshValinequal)#call?stump?classify?with?i?j?lessThan
????????????????errArr?=?mat(ones((m1)))?#?create?error?array
????????????????errArr[predictedVals?==?labelMat]?=?0
????????????????weightedError?=?D.T*errArr??#calc?total?error?multiplied?by?D
????????????????#print?“split:?dim?%d?thresh?%.2f?thresh?ineqal:?%s?the?weighted?error?is?%.3f“?%?(i?threshVal?inequal?weightedError)
????????????????if?weightedError?????????????????????minError?=?weightedError
????????????????????bestClasEst?=?predictedVals.copy()
????????????????????bestStump[‘dim‘]?=?i
????????????????????bestStump[‘thresh‘]?=?threshVal
????????????????????bestStump[‘ineq‘]?=?inequal
????return?bestStumpminErrorbestClasEst
‘‘‘‘‘
purpose:whole?AdaBoost?algorithm
input?parameter:
dataArr:data?set
classLabels:class?labels
numIt:die?dai?number?(only?one?parameter?needed?user?to?specified)
output?parameter:
weakClassArr:seve
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件?????356180??2016-11-24?20:38??traindata.txt
?????文件???????7150??2016-11-26?22:02??TreeAdaBoost.py
?????文件??????36042??2017-03-18?09:31??文檔.docx
?????文件?????356383??2016-11-24?20:39??testdata.txt
-----------?---------??----------?-----??----
???????????????755755????????????????????4
評論
共有 條評論