Bayesian classifier and Python implementation

Bayesian classifier and Python implementation

0. Preface

Bayesian classification is a general term for a class of classification algorithms, which are based on Bayes' theorem, so they are collectively referred to as Bayesian classification. This article is composed of my notes in the process of learning Bayesian classifier, plus actual text classification using Python.

1. Bayesian decision theory

Bayesian decision theory is the basic method of decision making under the framework of probability. For classification tasks, when all relevant probabilities are known, Bayesian decision theory considers how to select the optimal class label based on these probabilities and the misjudgment loss.

Assuming that it is labeled by N possible categories, it is to classify a sample error that is truly labeled as the resulting loss. Based on the posterior probability, the sample can be classified as the expected loss, that is "Conditional risk" (conditional risk)

The main task is to find a criterion to minimize the overall risk

Obviously, for each sample, if the conditional risk can be minimized, the overall risk will also be minimized

This resulted in the Bayes decision rule:

In order to minimize the overall risk, it is only necessary to select the category mark that minimizes the conditional risk on each sample

At this time, it is called the Bayes optimal classifier, and the corresponding overall risk is called the Bayes risk. It reflects the best performance that the classifier can achieve.

If the goal is to minimize the classification error rate, the misjudgment loss can be written as

At this time, the risk conditions are:

The Bayesian optimal classifier that minimizes the classification error rate is

Based on Bayes' theorem,

The proof of this formula is very easy. It can be obtained according to the definition of conditional probability. Interested readers can Baidu it. Although this theorem is very simple, it establishes a bridge for the mutual transformation of prior probability and posterior probability.

Class prior probability

Class conditional probability or likelihood

  • Prior probability: The probability obtained based on past experience and analysis.
  • Posterior probability: The posterior probability is based on new information, and the probability estimate obtained by modifying the original prior probability is closer to the actual situation.

The class prior probability is the proportion of various samples in the sample space. According to the law of large numbers (when there are enough samples, the frequency tends to be stable and equal to its probability), so when the training samples are sufficient, the frequencies of various types of occurrence can be used Instead. Therefore, only the class conditional probability is left, which means the probability of x appearing in category c. It involves the joint probability of attributes. If there is only one discrete attribute, it is okay. When there are many attributes, it is very difficult to use frequency estimation. Therefore, the maximum likelihood method is generally used here for estimation.

###2. Maximum likelihood estimation

Maximum Likelihood Estimation (MLE) is a classic method of estimating probability distribution based on data sampling. The commonly used strategy is to first assume that the population has a certain probability distribution, and then estimate the parameters of the probability distribution based on the training samples. When applied to class conditional probability, assuming that it obeys a distribution with a parameter of , the problem becomes estimation based on known training samples. The core idea of the maximum likelihood method is: the estimated parameters maximize the probability of the known samples, that is, maximize the likelihood of the training data.

Let denote the set of samples in the training set. Assuming that these samples are independent and identically distributed, the likelihood of the parameter to the data set is

The maximum likelihood estimation is to find the parameter value with the greatest likelihood. Intuitively, the maximum likelihood estimation is to find the most "probability" among all possible values of the view value

The multiplication operation makes the solution more complicated (easy to cause underflow), generally we use log-likelihood (log-likelihood)

At this time, the maximum likelihood of the parameters is estimated as

Therefore, the training process of Bayesian classifier is parameter estimation. Summarizing the process of estimating parameters by the maximum likelihood method, it is generally divided into the following four steps:

  • 1. Write the likelihood function;
  • 2. Take the logarithm of the likelihood function and sort it out;
  • 3. Find the derivative, set the partial derivative to 0, and get the likelihood equation set;
  • 4. Solve the system of likelihood equations and get all the parameters.

3. Naive Bayes Classifier

Naive Bayes classification is a very simple classification algorithm. It is called Naive Bayes classification because the idea of this method is really simple. The basic idea of Naive Bayes is this: For the given classification Item, solve the probability of each category appearing under the condition of this item, whichever is the largest, consider which category the item to be classified belongs to. In layman's terms, it's like this. You see a black man on the street. I ask you, guess where this buddy came from. You guess Africa out of the box. why? Because the proportion of Africans is the highest among blacks, of course people may also be Americans or Asians, but when there is no other available information, we will choose the category with the highest conditional probability. This is the basis of Naive Bayes' thinking.

Based on the hypothesis of attribute conditional independence, we can get

Is the number of attributes, is the value on the first attribute

This is the expression of Naive Bayes

In this way, estimating the class conditional probability for each sample becomes estimating the class conditional probability for each attribute of each sample.

For discrete attributes , the conditional probability of the attribute can be estimated as:

For continuous attributes , if the attribute is assumed to obey a normal distribution, the conditional probability can be estimated as:

Instance

Here uses the watermelon dataset 3.0 as an example

- "Zhou Zhihua-Watermelon Book"

 , , , , , , , , ,   
1, , , , , , ,0.697,0.46,   
2, , , , , , ,0.774,0.376,   
3, , , , , , ,0.634,0.264,   
4, , , , , , ,0.608,0.318,   
5, , , , , , ,0.556,0.215,   
6, , , , , , ,0.403,0.237,   
7, , , , , , ,0.481,0.149,   
8, , , , , , ,0.437,0.211,   
9, , , , , , ,0.666,0.091,   
10, , , , , , ,0.243,0.267,   
11, , , , , , ,0.245,0.057,   
12, , , , , , ,0.343,0.099,   
13, , , , , , ,0.639,0.161,   
14, , , , , , ,0.657,0.198,   
15, , , , , , ,0.36,0.37,   
16, , , , , , ,0.593,0.042,   
17, , , , , , ,0.719,0.103,   
 

Use the above data to train a naive Bayes classifier to classify the test case "Test 1":

 , , , , , , , , ,   
 1, , , , , , ,0.697,0.46,   
 

Practical-Text classification using Python

To get features from the text, you need to split the text first. How to do it in detail? The feature here is a token from the text, and a token is any combination of characters. You can think of entries as words, or you can use non-word entries, such as URL, IP address, or any other string.

Take the message board of an online community as an example. In order not to affect the development of the community, we need to block insulting comments, so we need to build a quick filter. If a message uses negative or insulting language, then the message is marked as inappropriate. Filtering this type of content is a very common requirement. Establish two categories for this question, insulting and non-insulting, and use 1 and 0 to represent them respectively.

data:

"my","dog","has","flea","problems","help","please","null"
"maybe","not","take","him","to","dog","park","stupid","null"
"my","dalmation","is","so","cute","I","love","him","null"
"stop","posting","stupid","worthless","garbage","null"
 "mr","licks","ate","my","steak","how","to","stop","him","null"
"quit","buying","worthless","dog","food","stupid","null"
 

Python implementation:

# _*_ coding:utf-8 _*_
import numpy as np

def loadDataSet():
    """
      1 
    @ return postingList:  
    @ return classVec:  
    """
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]  
    return postingList, classVec

def createVocabList(dataSet):
    """
     
    @ param dataSet:  
    @ return vocabSet:  
    """
    vocabSet = set([])
    for document in dataSet:
        #  
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):
    """
     . 1 0
    @ param vocabList:  
    @ param inputSet:  
    @ return returnVec:  
    """
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print(" : %s  !" % word)
    return returnVec


def trainNB0(trainMatrix, trainCategory):
    """
     
    @ param trainMatrix:  
    @ param trainCategory:  
    """
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    # 0 0 1 2.
    p0Num = np.ones(numWords) 
    p1Num = np.ones(numWords)
    p0Denom = 2
    p1Denom = 2
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #  log 
    p1Vect = np.log(p1Num/p1Denom)
    p0Vect = np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    """
     
    """
    p1 = sum(vec2Classify*p1Vec)+np.log(pClass1)
    p0 = sum(vec2Classify*p0Vec)+np.log(1-pClass1)
    if p1>p0:
        return 1
    else:
        return 0

def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(np.array(trainMat),np.array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))

if __name__=='__main__':
    testingNB()