[ML] Association Rules Discovery Algorithms

machine learning

by ~지우~ 2022. 12. 21. 21:08

728x90

Association Rules

- What are associatin rule?

"what goes with what"에 대한 규칙
예) "X를 산 고객들은 Y도 샀다", "어떤 증상은 어떤 진단을 받는다"

- Definition

◼ item: transaction의 항목/물품
◼ item set은 하나 이상의 항목의 집합
◼ transaction: 항목의 집합
◼ transaction dataset: 트랜잭션의 집합
◼ association rule: 항목 집합 X가 발생할 때 항목 집합 Y가 특정 확률로 발생한다는 패턴
◼ Support(X): 다음을 포함하는 transaction의 일부
◼ Confidence(X->Y): 항목 집합 X를 포함하고 항목 집합 Y를 포함하는 transaction의 일부 (P(Y | X))
◼ Lift(X->Y) : X와 Y가 함께 일어날 확률과 Y 혼자 일어날 확률의 비율 (confidence(X->Y)/support(Y))

◼ Lift(X->Y)은 Y 단독, X 공동 판매 비율이 증가하는 것을 말한다.
◼ Lift가 1이면 X와 Y의 매출은 세트된 품목 내에서 상관관계가 없다.
◼ Lift가 >1이면 항목 집합 내에 양의 상관관계가 있습니다. 즉, X와 Y가 함께 구입될 가능성이 높다.
◼ Lift가 1 미만이면 항목 집합 내에 음의 상관 관계가 있습니다. 즉, X와 Y를 함께 구입할 가능성이 낮다.
◼ Lift가 <=1인 경우 선행 및 결과와 관련된 규칙을 도출할 수 없다.
◼ Lift가 > 1인 경우 규칙은 향후 데이터 세트의 결과를 예측하는 데 잠재적으로 유용하다.

아래와 같은 transaction dataset이 있을 때,

◼ itemset {Chicken, Clothes, Milk}의 support는 [sup = 3/7]이고,
◼ {Clothes} → {Milk, Chicken}의 confidence는 [conf = 3/3], {Beef, Cheese} → {Chicken} 의 confidence는 [conf = 2/3]이다.

- Association rule mining task

transaction set T가 주어지면, association rule mining의 목표는 다음과 같은 모든 규칙을 찾는 것이다.
◼ support ≥ min-support threshold
◼ confidence ≥ min-confidence threshold

Frequent Itemset을 생성하기 위해서는 Brute force 방식을 사용할 수 있다. 가능한 모든 연결 규칙 나열하고, 각 규칙에 대한 지원 및 신뢰도 계산하 후 최소 지원 및 최소 신뢰 임계값에 실패하는 규칙을 제거하는 것이다. 이는 너무 많은 계산 시간을 요구한다. 따라서 우리는 Apriori algorithm과 FP(Freqent Pattern) growth algorithm을 주로 사용한다.

1. Apriori Algorithm

사용자 지정 지원 및 신뢰도를 충족하는 규칙을 생
◼ 1단계: 빈번하게 나타나는 항목 집합(충분한 support이 있는 항목)을 찾는다.
◼ 2단계: 충분한 confidence을 가지고 있는 항목 집합에서 규칙을 생성한다.

For k items...
◼ 최소 지원 기준(임계값)을 설정
◼ 지원 조건을 충족하는 단일 항목 집합 목록을 생성
◼ 단일 항목 집합 목록을 사용하여 지원 기준을 충족하는 두 항목 집합 목록을 생성
◼ 두 항목 집합의 목록을 사용하여 지원 기준을 충족하는 세 항목 집합의 목록을 생성
◼ k-항목 집합까지 반복

k가 증가하면 support가 감소하므로 생성할 후보 항목 집합의 수를 줄일 수 있다.

- Walkthrough example

아래와 같은 transaction dataset에서 association rule을 찾아보겠다. (support threshold=50%, confidence threshold=80%)

one-item candidates

2-item candidates

3-item candidates

confidence test를 적용하여 최종적으로 association rules set을 찾는다.

- Python code

import numpy as np
import matplotlib.pyplot as plt import pandas as pd
from apyori import apriori

# Import the data
movie_data = pd.read_csv(‘C:\Python\MarketBasket\Movie Example\movie_dataset.csv’,header = None) 
num_records = len(movie_data)
print(num_records)

# Data preprocessing
# The Apriori requires the dataset to be in the form of a list of lists.
# Currently, we have data in the form of a Pandas dataframe.
records = []
for i in range(0, num_records): 
	records.append([str(movie_data.values[I,j]) for j in range(0, 20)])

association_rules = apriori(records,min_support=0.0053,min_confidence=0.20, min_lift=3, min_length=2)

# Convert the rules found by the apriori class into a list to make it easier to view.
association_results = list(association_rules)

# Find the total number of rules mined by the apriori class.
print(len(association_results))

# Print the first item in the association_rules list to see the first rule.
print(association_results[0])

2. FP Growth Algorithm

Stage1: FP tree 구축
◼ 1단계: 데이터베이스 정리 및 정렬
◼ 2단계: FP 트리 및 헤더 테이블 구성

Stage2: FP tree 및 conditional FP tree 채굴
◼ 1단계: FP tree를 conditional FP tree로 나눈다.
◼ 2단계: 각 conditional FP tree를 재귀적으로 마이닝한다.

- Advantages of FP-Growth

◼ 분할 및 정복 접근법
◼ 불필요한 후보 생성 및 테스트 회피
◼ 데이터베이스가 FP 트리/조건부 FP 트리로 압축됨

- Apriori vs FP-Growth

Apriori
◼ 후보 품목 세트 생성 및 테스트 필요
◼ support 임계값이 매우 낮은 경우 엄청난 수의 후보 항목 집합을 생성
◼ transaction 데이터베이스를 여러 번 검색
FP
◼ 로컬 자주 사용하는 항목만 사용하여 짧은 패턴에서 긴 패턴을 확장하여 후보 항목 집합의 명시적 생성을 방지
◼ 트랜잭션 데이터베이스를 두 번 검사

FP-Growth의 장점은 minimum support가 낮을때 나타난다.(1.5% 미만)

728x90

저작자표시 비영리 변경금지 (새창열림)

'machine learning' 카테고리의 다른 글

[ML] Reinforcement Learning (2)	2022.12.23
[ML] Recommender Systems (0)	2022.12.21
[ML] Bayesian Classifier and Markov Model (0)	2022.12.20
[ML] Cost Function Optimization – Gradient Descent (0)	2022.12.17
[ML] Cost Function Optimization - Regularization (0)	2022.12.17

쥬코딩

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Association Rules

1. Apriori Algorithm

2. FP Growth Algorithm

'machine learning' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

티스토리툴바