Algorithms

Decision Trees - (dtree)

(wikipedia) A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal.

See also: [QR1986] and [QR1987]

[QR1986]Quinlan, J. R. Ross. Induction of decision trees. Machine learning, 1986, 1. Jg., Nr. 1, S. 81-106.
[QR1987]Quinlan, J. R. Ross. Simplifying decision trees. International journal of man-machine studies, 1987, 27. Jg., Nr. 3, S. 221-234.

Parameters

splitter - Which strategy should be used to choose the next split attribute?

`Best` always chooses the attribut that maximizes the score obtained via the split quality criterion, `random` just chooses a ... random... criterion.

  • name: Splitting strategy
  • default:
  • values: [u’best’, u’random’]
  • type: list

criterion - Measure of split quality

`gini` uses Gini’s impurity measure, `entropy` uses the Information Gain.

  • name: Split criterion
  • default:
  • values: [u’gini’, u’entropy’]
  • type: list

max_depth - Maximum depth of a tree.

If 0 then leaves are expanded until completely pure or all attributes have been used. Otherwise tree growth will stop after the set amount of attributes has been used.

  • min: 0
  • default:
  • type: int
  • name: Maximum tree depth

k-Nearest Neighbors - (kNN)

[WPknn] In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

[WPknn]http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Parameters

distance - Used distance measure

Based on this metric the `k` nearest neighbors are chosen.

  • name: Distance measure
  • default:
  • values: {u’chebyshev’: {u’short_desc_html’: u’<div class=”document”>n<p>The Chebyshev (maximum norm) distance measure</p>n</div>n’, u’short_desc’: u’The Chebyshev (maximum norm) distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\max_{i} |x_i-y_i|.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005max<sub><i>i</i></sub>|<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>|</span>n.</p>n</div>n’}, u’euclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The common euclidean distance measure</p>n</div>n’, u’short_desc’: u’The common euclidean distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\sqrt{\\sum_{i=0}^n (x_i-y_i)^2}.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)</span></span></span>n.</p>n</div>n’}, u’sqeuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The squared euclidean distance measure</p>n</div>n’, u’short_desc’: u’The squared euclidean distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\sum_{i=0}^n (x_i-y_i)^2.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span>n.</p>n</div>n’}, u’hamming’: {u’short_desc_html’: u’<div class=”document”>n<p>The hamming distance measure</p>n</div>n’, u’short_desc’: u’The hamming distance measure’, u’description’: u’The distance is calculated as the number of components which differ over the number of components.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as the number of components which differ over the number of components.</p>n</div>n’}, u’weuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The weighted euclidean distance measure</p>n</div>n’, u’short_desc’: u’The weighted euclidean distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\sqrt{\\sum_{i=0}^n w_i(x_i-y_i)^2}.’, u’parameters’: {u’weights’: {u’name’: u’Weigths’, u’default’: u’‘, u’short_desc’: u’Comma-separated list of weights for each dimension.’, u’short_desc_html’: u’<div class=”document”>n<p>Comma-separated list of weights for each dimension.</p>n</div>n’, u’validator’: <function invalid_weights at 0x7f071f231050>, u’validators’: [<function invalid_optional at 0x7f071f282aa0>, <function invalid_weights at 0x7f071f231050>], u’type’: u’string’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span><i>w</i><sub><i>i</i></sub>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)</span></span></span>n.</p>n</div>n’}, u’minkowski’: {u’short_desc_html’: u’<div class=”document”>n<p>The Minkowski distance with parameter <tt class=”docutils literal”>`p`</tt>. To turn this into euclidean distance, choose p=2.</p>n</div>n’, u’short_desc’: u’The Minkowski distance with parameter `p`. To turn this into euclidean distance, choose p=2.’, u’description’: u’The distance is calculated as d(x,y) = (\\sum_{i=0}^n (x_i-y_i)^p)^{1/p}.’, u’parameters’: {u’p’: {u’name’: u’p’, u’min’: 0.001, u’default’: u’‘, u’parser’: <type ‘float’>, u’short_desc’: u’Parameter of the Minkowski distance (exponent).’, u’short_desc_html’: u’<div class=”document”>n<p>Parameter of the Minkowski distance (exponent).</p>n</div>n’, u’validators’: [<function invalid_optional at 0x7f071f282aa0>, <function invalid_range at 0x7f071f230ed8>], u’type’: u’float’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005(<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup><i>p</i></sup>)<sup>1u2005u2044u2005<i>p</i></sup></span>n.</p>n</div>n’}, u’seuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The standardized euclidean distance measure</p>n</div>n’, u’short_desc’: u’The standardized euclidean distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\sqrt{\\frac{\\sum_{i=0}^n (x_i-y_i)^2}{s^2_i}}, where s^2_i is the standard deviation between the x_i and y_i in the data set.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”fraction”><span class=”ignored”>(</span><span class=”numerator”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)/(</span><span class=”denominator”><i>s</i><span class=”scripts”><sub class=”script”><i>i</i></sub><sup class=”script”>2</sup></span></span><span class=”ignored”>)</span></span></span><span class=”ignored”>)</span></span></span>n, where <span class=”formula”><i>s</i><span class=”scripts”><sub class=”script”><i>i</i></sub><sup class=”script”>2</sup></span></span>n is the standard deviation between the <span class=”formula”><i>x</i><sub><i>i</i></sub></span>n and <span class=”formula”><i>y</i><sub><i>i</i></sub></span>n in the data set.</p>n</div>n’}, u’manhattan’: {u’short_desc_html’: u’<div class=”document”>n<p>The manhatten (or cityblock) distance measure</p>n</div>n’, u’short_desc’: u’The manhatten (or cityblock) distance measure’, u’description’: u’The distance is calculated as d(x,y) = \\sum_{i=0}^n |x_i-y_i|.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>|<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>|</span>n.</p>n</div>n’}, u’wminkowski’: {u’short_desc_html’: u’<div class=”document”>n<p>The Minkowski distance with parameter p. To turn this into euclidean distance, choose p=2.</p>n</div>n’, u’short_desc’: u’The Minkowski distance with parameter p. To turn this into euclidean distance, choose p=2.’, u’description’: u’The distance is calculated as d(x,y) = (\\sum_{i=0}^n (x_i-y_i)^p)^{1/p}.’, u’parameters’: {u’p’: {u’name’: u’p’, u’min’: 0.001, u’default’: u’‘, u’parser’: <type ‘float’>, u’short_desc’: u’Parameter of the Minkowski distance (exponent).’, u’short_desc_html’: u’<div class=”document”>n<p>Parameter of the Minkowski distance (exponent).</p>n</div>n’, u’validators’: [<function invalid_optional at 0x7f071f282aa0>, <function invalid_range at 0x7f071f230ed8>], u’type’: u’float’}, u’weights’: {u’name’: u’Weights’, u’default’: u’‘, u’short_desc’: u’Comma-separated list of weights for each dimension.’, u’short_desc_html’: u’<div class=”document”>n<p>Comma-separated list of weights for each dimension.</p>n</div>n’, u’validator’: <function invalid_weights at 0x7f071f231050>, u’validators’: [<function invalid_optional at 0x7f071f282aa0>, <function invalid_weights at 0x7f071f231050>], u’type’: u’string’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005(<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup><i>p</i></sup>)<sup>1u2005u2044u2005<i>p</i></sup></span>n.</p>n</div>n’}}
  • type: dict

k - Number of neighbors

The higher this value is chosen the more neighboring points can `vote` for the class label of the target instance. Too low values may be prone to noise, too high values may be prone to unbalanced class distributions.

  • min: 1
  • default: 3
  • type: int
  • name: k

weights - Method used for instance weighting.

If `uniform` is used, each of the nearest neighbours will have equal weight in the prediction of the new class label while `weighted` uses the inverses of their respective distances as weights.

  • name: Weighting method
  • default:
  • values: [u’uniform’, u’distance’]
  • type: list