Apriori算法是數(shù)據(jù)挖掘中頻發(fā)模式挖掘的鼻祖,從60年代就開始流行,其算法思想也十分簡單樸素,首先挖掘出長度為1的頻繁模式,然后k=2 將這些頻繁模式合并組成長度為k的頻繁模式,算出它們的頻繁次數(shù),而且要保證其所有k-1長度的子集也是頻繁的,值得注意的
Apriori算法是數(shù)據(jù)挖掘中頻發(fā)模式挖掘的鼻祖,從60年代就開始流行,其算法思想也十分簡單樸素,首先挖掘出長度為1的頻繁模式,然后k=2
將這些頻繁模式合并組成長度為k的頻繁模式,算出它們的頻繁次數(shù),而且要保證其所有k-1長度的子集也是頻繁的,值得注意的是,為了避免重復(fù),合并的時(shí)候,只合并那些前k-2個(gè)字符都相同,而k-1的字符一邊是少于另一邊的。
以下是算法的Python實(shí)現(xiàn):
__author__ = 'linfuyuan' min_frequency = int(raw_input('please input min_frequency:')) file_name = raw_input('please input the transaction file:') transactions = [] def has_infrequent_subset(candidate, Lk): for i in range(len(candidate)): subset = candidate[:-1] subset.sort() if not ''.join(subset) in Lk: return False lastitem = candidate.pop() candidate.insert(0, lastitem) return True def countFrequency(candidate, transactions): count = 0 for transaction in transactions: if transaction.issuperset(candidate): count += 1 return count with open(file_name) as f: for line in f.readlines(): line = line.strip() tokens = line.split(',') if len(tokens) > 0: transaction = set(tokens) transactions.append(transaction) currentFrequencySet = {} for transaction in transactions: for item in transaction: time = currentFrequencySet.get(item, 0) currentFrequencySet[item] = time + 1 Lk = set() for (itemset, count) in currentFrequencySet.items(): if count >= min_frequency: Lk.add(itemset) print ', '.join(Lk) while len(Lk) > 0: newLk = set() for itemset1 in Lk: for itemset2 in Lk: cancombine = True for i in range(len(itemset1)): if i < len(itemset1) - 1: cancombine = itemset1[i] == itemset2[i] if not cancombine: break else: cancombine = itemset1[i] < itemset2[i] if not cancombine: break if cancombine: newitemset = [] for char in itemset1: newitemset.append(char) newitemset.append(itemset2[-1]) if has_infrequent_subset(newitemset, Lk) and countFrequency(newitemset, transactions) >= min_frequency: newLk.add(''.join(newitemset)) print ', '.join(newLk) Lk = newLk
聲明:本網(wǎng)頁內(nèi)容旨在傳播知識(shí),若有侵權(quán)等問題請(qǐng)及時(shí)與本網(wǎng)聯(lián)系,我們將在第一時(shí)間刪除處理。TEL:177 7030 7066 E-MAIL:11247931@qq.com