C4.5 algorithm
minor changes
| ← Previous revision | Revision as of 17:15, 23 April 2026 | ||
| Line 1: | Line 1: | ||
{{Short description|Algorithm for making decision trees}} |
{{Short description|Algorithm for making decision trees}} |
||
'''C4.5''' is an algorithm used to generate a [[decision tree]] developed by [[Ross Quinlan]].Quinlan, J. R. C4.5: ''Programs for Machine Learning''. Morgan Kaufmann Publishers, 1993. C4.5 is an extension of Quinlan's earlier [[ID3 algorithm]]. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a [[Statistical classification|statistical classifier]]. |
|||
[[File:Diagramm_beispiel_sarah_geht_segeln.png | thumb | right]] |
|||
'''C4.5''' is an algorithm used to generate a [[decision tree]] developed by [[Ross Quinlan]].Quinlan, J. R. C4.5: ''Programs for Machine Learning''. Morgan Kaufmann Publishers, 1993. C4.5 is an extension of Quinlan's earlier [[ID3 algorithm]]. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a [[Statistical classification|statistical classifier]]. In 2011, authors of the [[Weka (machine learning)|Weka]] machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".{{cite web |url=http://www.cs.waikato.ac.nz/~ml/weka/book.html |title=Data Mining: Practical machine learning tools and techniques, 3rd Edition |author=Ian H. Witten |author2=Eibe Frank |author3=Mark A. Hall |year=2011 |publisher=Morgan Kaufmann, San Francisco |page=191 |access-date=2017-07-04 |archive-date=2020-11-27 |archive-url=https://web.archive.org/web/20201127014857/https://www.cs.waikato.ac.nz/~ml/weka/book.html |url-status=dead }} |
|||
It became quite popular after ranking #1 in the ''Top 10 Algorithms in Data Mining'' pre-eminent paper published by [[Springer Science+Business Media|Springer]] [[Lecture Notes in Computer Science|LNCS]] in 2008.[http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf Umd.edu - Top 10 Algorithms in Data Mining] |
In 2011, authors of the [[Weka (machine learning)|Weka]] machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".{{cite web |url=http://www.cs.waikato.ac.nz/~ml/weka/book.html |title=Data Mining: Practical machine learning tools and techniques, 3rd Edition |author=Ian H. Witten |author2=Eibe Frank |author3=Mark A. Hall |year=2011 |publisher=Morgan Kaufmann, San Francisco |page=191 |access-date=2017-07-04 |archive-date=2020-11-27 |archive-url=https://web.archive.org/web/20201127014857/https://www.cs.waikato.ac.nz/~ml/weka/book.html |url-status=dead }} It became quite popular after ranking #1 in the ''Top 10 Algorithms in Data Mining'' pre-eminent paper published by [[Springer Science+Business Media|Springer]] [[Lecture Notes in Computer Science|LNCS]] in 2008.[http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf Umd.edu - Top 10 Algorithms in Data Mining] |
||
==Algorithm== |
==Algorithm== |
||
| Line 24: | Line 23: | ||
#Create a decision ''node'' that splits on ''a_best''. |
#Create a decision ''node'' that splits on ''a_best''. |
||
#Recurse on the sublists obtained by splitting on ''a_best'', and add those nodes as children of ''node''. |
#Recurse on the sublists obtained by splitting on ''a_best'', and add those nodes as children of ''node''. |
||
==Implementations== |
|||
'''J48''' is an [[open source]] [[Java (programming language)|Java]] implementation of the C4.5 algorithm in the [[Weka (machine learning)|Weka]] [[data mining]] tool. |
|||
==Improvements from ID3 algorithm== |
==Improvements from ID3 algorithm== |
||
| Line 32: | Line 28: | ||
C4.5 made a number of improvements to ID3. Some of these are: |
C4.5 made a number of improvements to ID3. Some of these are: |
||
* Handling both continuous and discrete attributes |
* Handling both continuous and discrete attributes: In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996. |
||
* Handling training data with missing attribute values |
* Handling training data with missing attribute values: C4.5 allows attribute values to be marked as missing. Missing attribute values are simply not used in gain and entropy calculations. |
||
* Handling attributes with differing costs. |
* Handling attributes with differing costs. |
||
* Pruning trees after creation |
* Pruning trees after creation: C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. |
||
==Improvements in C5.0/See5 algorithm== |
==Improvements in C5.0/See5 algorithm== |
||