[Python] tomotopy로 쉽게 토픽 모델링 실시하기

프로그래밍/NLP

by ∫2tdt=t²+c 2019. 5. 22. 17:06

저번 글에 소개했던 것처럼, 토픽 모델링 툴인 tomoto의 Python 패키지 버전을 며칠 전에 공개했었습니다. 이번 포스팅에서는 Python에서 이를 이용해서 쉽게 토픽 모델링을 하는 예제 코드를 공유하고자 합니다.

Step 1. tomotopy 패키지 설치하기

명령줄 혹은 Terminal 에서 다음과 같이 입력하여 tomotopy를 설치합니다. (만약 파이썬이 설치가 되지 않은 경우라면 먼저 파이썬을 설치해야겠죠. 3.5 버전 이상을 설치하는걸 추천드립니다)

$ pip install --upgrade tomotopy

Step 2. 토픽 모델링 코드를 작성하기

의외로 간단합니다. 바로 보도록 하시죠.

import tomotopy as tp # 먼저 모듈을 불러와야겠죠 model = tp.LDAModel(k=20, alpha=0.1, eta=0.01, min_cf=5) # LDAModel을 생성합니다. # 토픽의 개수(k)는 20개, alpha 파라미터는 0.1, eta 파라미터는 0.01 # 전체 말뭉치에 5회 미만 등장한 단어들은 제거할 겁니다. # 다음 구문은 input_file.txt 파일에서 한 줄씩 읽어와서 model에 추가합니다. for i, line in enumerate(open('input_file.txt', encoding='utf-8')): model.add_doc(line.strip().split()) # 공백 기준으로 단어를 나누어 model에 추가합니다. if i % 10 == 0: print('Document #{} has been loaded'.format(i)) # model의 num_words나 num_vocabs 등은 train을 시작해야 확정됩니다. # 따라서 이 값을 확인하기 위해서 train(0)을 하여 실제 train은 하지 않고 # 학습 준비만 시킵니다. # num_words, num_vocabs에 관심 없다면 이부분은 생략해도 됩니다. model.train(0) print('Total docs:', len(model.docs)) print('Total words:', model.num_words) print('Vocab size:', model.num_vocabs) # 다음 구문은 train을 총 200회 반복하면서, # 매 단계별로 로그 가능도 값을 출력해줍니다. # 혹은 단순히 model.train(200)으로 200회 반복도 가능합니다. for i in range(200): print('Iteration {}\tLL per word: {}'.format(i, model.ll_per_word)) model.train(1) # 학습된 토픽들을 출력해보도록 합시다. for i in range(model.k): # 토픽 개수가 총 20개이니, 0~19번까지의 토픽별 상위 단어 10개를 뽑아봅시다. res = model.get_topic_words(i, top_n=10) print('Topic #{}'.format(i), end='\t') print(', '.join(w for w, p in res))

참 쉽죠? 단 입력 파일인 input_file.txt의 모양을 잘 생각해보셔야합니다. 이 파일은 한 줄에 문헌 하나를 담고 있다고 가정합니다. 또한 각 단어들은 공백 문자로 구분되어 있어야하구요. 그게 아니라면 제대로 처리가 되지 않겠죠?

자연언어처리에서 제일 중요한 건 전처리죠. 그런데 이 코드에는 전처리 부분이 없습니다. 입력 파일이 전처리되지 않았다면 결과가 이쁘게 나오지 않을테니 nltk를 이용해 전처리하는 코드를 추가해보도록 하겠습니다.

Step 3. (영어) 전처리 코드 추가하기

import nltk.stem, nltk.corpus, nltk.tokenize, re stemmer = nltk.stem.porter.PorterStemmer() # 영어 단어의 어근만 남겨주는 포터 스테머입니다. stopwords = set(nltk.corpus.stopwords.words('english')) # 영어 단어의 불용어 집합입니다. rgxWord = re.compile('[a-zA-Z][-_a-zA-Z0-9.]*') # 특수문자를 제거하기 위해 일반적인 형태의 단어를 나타내는 정규식입니다. # 알파벳으로 시작하고 그 뒤에 알파벳 혹은 숫자, -,_, .가 뒤따라오는 경우만 단어로 취급합니다. # tokenize 함수를 정의합니다. 문장을 입력하면 단어 단위로 분리하고, # 불용어 및 특수 문자 등을 제거한 뒤, 어근만 추출하여 list로 반환합니다. def tokenize(sent): def stem(w): try: return stemmer.stem(w) except: return w return [stem(w) for w in nltk.tokenize.word_tokenize(sent.lower()) if w not in stopwords and rgxWord.match(w)]

전처리 코드를 이용하면 토픽 모델에 좀더 깔끔한 데이터를 입력할 수 있겠죠?

model = tp.LDAModel(k=20, alpha=0.1, eta=0.01, min_cf=5) for i, line in enumerate(open('input_file.txt', encoding='utf-8')): model.add_doc(tokenize(line)) # tokenize함수를 이용해 전처리한 결과를 add_doc에 넣습니다. if i % 10 == 0: print('Document #{} has been loaded'.format(i)) model.train(0) print('Total docs:', len(model.docs)) print('Total words:', model.num_words) print('Vocab size:', model.num_vocabs) for i in range(200): print('Iteration {}\tLL per word: {}'.format(i, model.ll_per_word)) model.train(1) for i in range(model.k): res = model.get_topic_words(i, top_n=10) print('Topic #{}'.format(i), end='\t') print(', '.join(w for w, p in res))

실행하면 다음과 같은 결과를 얻을 수 있으실 겁니다. 제가 사용한 데이터는 호텔 리뷰 데이터였습니다.

Topic #0 great, hotel, view, stay, nice, beach, locat, street, room, plaza

Topic #1 locat, hotel, great, walk, good, room, staff, close, time, distanc

Topic #2 park, hotel, not, charg, free, internet, no, room, day, fee

Topic #3 room, nice, bed, small, clean, comfort, hotel, good, bathroom, littl

Topic #4 servic, hotel, stay, not, time, wait, minut, food, check, get

Topic #5 hotel, not, check, room, book, charg, told, stay, would, us

Topic #6 room, smell, smoke, not, dirti, like, clean, carpet, bed, stay

Topic #7 good, need, hotel, room, clean, locat, staff, price, updat, nice

Topic #8 hotel, good, close, airport, nice, shuttl, stay, clean, place, breakfast

Topic #9 breakfast, no, room, coffe, not, machin, clean, ice, good, fridg

Topic #10 stay, great, staff, hotel, nice, room, friendli, clean, good, again

Topic #11 room, not, air, hotel, no, condit, night, work, sleep, ac

Topic #12 room, night, door, nois, hotel, could, next, floor, peopl, hear

Topic #13 desk, staff, front, not, hotel, servic, help, friendli, no, check

Topic #14 stay, not, hotel, place, would, get, price, again, night, look

Topic #15 room, desk, front, call, get, us, not, check, back, day

Topic #16 bathroom, shower, room, not, door, toilet, no, need, water, wall

Topic #17 pool, water, not, hot, hotel, no, shower, nice, tub, kid

Topic #18 hotel, not, stay, would, renov, inn, star, disappoint, expect, year

Topic #19 room, bed, not, hotel, book, two, one, us, doubl, king

Step 3b. (한국어) 전처리 코드 추가하기

한국어 전처리는 영어에 비해 까다롭습니다. 명사나 형용사, 동사 등에 조사나 어미가 복잡하게 결합하고, 항상 단어 단위로 띄어쓰기를 하는것도 아니기 때문이죠. 이 때문에 한국어 분석을 위해서는 형태소 분석기가 필요합니다. 다행히도 여러 한국어 형태소 분석기가 이미 개발되어 있고, 또 파이썬에서도 쉽게 이들을 사용할 수 있어서, 한국어 분석도 편하게 진행할 수 있습니다. 여기에서는 마침 '과거의 제'가 개발한 kiwi의 파이썬 버전을 이용해보도록 하겠습니다. (kiwi 설치 역시 pip install --upgrade kiwipiepy 로 쉽게 하실 수 있습니다.)

from kiwipiepy import Kiwi kiwi = Kiwi() kiwi.prepare() # tokenize 함수를 정의합니다. 한국어 문장을 입력하면 형태소 단위로 분리하고, # 불용어 및 특수 문자 등을 제거한 뒤, list로 반환합니다. def tokenize(sent): res, score = kiwi.analyze(sent)[0] # 첫번째 결과를 사용 return [word + ('다' if tag.startswith('V') else '') # 동사에는 '다'를 붙여줌 for word, tag, _, _ in res if not tag.startswith('E') and not tag.startswith('J') and not tag.startswith('S')] # 조사, 어미, 특수기호는 제거

마찬가지로 위에서처럼 add_doc에 tokenize(line)을 넣어주면 되겠죠.

(추가) 불용어를 지정하여 제거하고 싶은경우 아래와 같이 조건을 하나 더 추가할 수도 있겠습니다.

from kiwipiepy import Kiwi kiwi = Kiwi() kiwi.prepare() stopwords = set(["사람", "것"]) # tokenize 함수를 정의합니다. 한국어 문장을 입력하면 형태소 단위로 분리하고, # 불용어 및 특수 문자 등을 제거한 뒤, list로 반환합니다. def tokenize(sent): res, score = kiwi.analyze(sent)[0] # 첫번째 결과를 사용 return [word + ('다' if tag.startswith('V') else '') # 동사에는 '다'를 붙여줌 for word, tag, _, _ in res if not tag.startswith('E') and not tag.startswith('J') and not tag.startswith('S') and word not in stopwords] # 조사, 어미, 특수기호 및 stopwords에 포함된 단어는 제거

Step 4. 불용어 제거가 어려운 경우, 용어 가중치를 변경하자

이렇게 했는데도, 토픽별 상위 단어들이 너무 일반적이거나 자주 나오는 거라서 결과가 마음에 안 들수가 있습니다. 이 경우 용어 가중치 정책을 바꿔서 결과를 개선하는 방법을 사용할 수도 있습니다. 용어 가중치 정책은 모델 생성시에 설정해 줄 수 있습니다.

model = tp.LDAModel(k=20, alpha=0.1, eta=0.01, min_cf=5, tw=tp.TermWeight.IDF) # tw에 tp.TermWeight의 항목 중 하나를 선택하여 입력할 수 있습니다. # 기본값은 tp.TermWeight.ONE이며, 이는 모든 단어를 동등하게 보겠다는 것입니다. # 다른 선택지로는 tp.TermWeight.IDF나 tp.TermWeight.PMI가 있습니다.

용어 가중치 정책에 대해서는 이전 포스팅에서 다룬 바 있으니 해당 글(https://bab2min.tistory.com/605)을 참조하시면 되겠습니다. Step 3에서 사용한 호텔 리뷰 데이터에 IDF 가중치 정책을 사용한 결과는 다음과 같습니다.

Topic #0 coffe, breakfast, no, cereal, tv, fridg, egg, juic, shower, cup

Topic #1 need, bathroom, carpet, dirti, shower, old, wall, not, stain, smell

Topic #2 plaza, street, fremont, great, vega, casino, downtown, locat, nice, la

Topic #3 air, water, hot, shower, work, ac, no, heat, condition, condit

Topic #4 beach, great, love, view, pool, nice, beauti, food, restaur, kid

Topic #5 dirti, bed, toilet, sheet, bathroom, bug, towel, stain, hair, carpet

Topic #6 bed, doubl, king, queen, size, room, two, book, us, park

Topic #7 locat, squar, walk, central, great, hotel, good, london, station, excel

Topic #8 peopl, not, get, back, never, go, look, like, around, say

Topic #9 breakfast, good, restaur, great, conveni, food, locat, nice, hotel, comfort

Topic #10 good, need, clean, staff, friendli, great, nice, price, stay, comfort

Topic #11 check, desk, told, call, ask, us, front, said, not, get

Topic #12 nois, hear, loud, thin, wall, door, noisi, sleep, could, night

Topic #13 smoke, smell, cigarett, non-smok, non, like, room, dog, stale, odor

Topic #14 desk, front, call, us, check, told, manag, phone, not, fix

Topic #15 star, price, not, hotel, rate, qualiti, expect, it, stay, better

Topic #16 charg, fee, park, internet, per, pay, extra, day, free, hilton

Topic #17 pool, ice, machin, floor, hot, elev, door, no, water, swim

Topic #18 ever, iron, worst, i, pillow, stay, servic, not, bed, time

Topic #19 shuttl, airport, park, close, flight, walk, driver, taxi, ride, bu

입력 데이터는 같은데 전혀 다른 결과가 나왔죠? 이는 자주 등장하는 단어 hotel, room 등의 가중치를 낮게 계산하고, 적게 등장한 단어들의 가중치를 높게 계산하는 데에서 기인한 차이입니다.

내친김에 가중치를 PMI로 계산한 결과도 보시죠.

Topic #0 view, beach, walk, locat, nice, great, close, restaur, hotel, disneyland

Topic #1 good, locat, breakfast, clean, staff, price, nice, great, hotel, valu

Topic #2 pool, hot, water, cold, no, heat, swim, tub, indoor, con

Topic #3 great, nice, staff, stay, friendli, hotel, famili, good, comfort, clean

Topic #4 plaza, great, casino, fremont, vega, downtown, nice, hotel, street, stay

Topic #5 shower, water, toilet, work, door, bathroom, no, broken, hot, not

Topic #6 food, restaur, order, bar, breakfast, buffet, dinner, servic, tabl, coffe

Topic #7 check, desk, front, rude, us, card, manag, servic, ask, arriv

Topic #8 check, desk, call, front, room, us, told, back, get, went

Topic #9 internet, wifi, breakfast, free, no, access, servic, connect, slow, not

Topic #10 dog, pet, outsid, sign, secur, keep, tabl, light, open, lot

Topic #11 nois, loud, sleep, noisi, night, hear, next, room, air, sound

Topic #12 smell, smoke, room, dirti, carpet, stain, sheet, like, bathroom, cigarett

Topic #13 stay, not, place, money, noth, i, hotel, would, pictur, do

Topic #14 need, old, room, bed, not, bathroom, dirti, updat, air, carpet

Topic #15 locat, airport, hotel, shuttl, walk, great, squar, good, central, conveni

Topic #16 bed, doubl, king, size, room, queen, two, book, small, suit

Topic #17 need, inn, hotel, motel, place, price, stay, ok, holiday, night

Topic #18 no, coffe, fridg, microwav, iron, towel, tv, refriger, maker, soap

Topic #19 park, charg, fee, resort, per, extra, pay, hotel, cost, car

용어 가중치 정책은 LDAModel 뿐만 아니라 tomotopy에서 제공하는 나머지 모든 토픽 모델(DMR, HDP, MGLDA, PA, HPA)에 대해서도 적용 가능합니다.

이외에도 좀더 세세하게 토픽 모델링을 적용하고 싶으신 분들은 tomotopy 한국어 API문서를 살펴보시면 되겠습니다.

지금까지 tomotopy를 이용해 쉽게 토픽 모델링 하는 방법에 대해서 살펴보았습니다. 다음 번에는 tomotopy를 이용해 좀 더 다양한 데이터와 모델을 사용해보는 방법을 공유해보도록 하겠습니다~

저작자표시 비영리 동일조건

'프로그래밍 > NLP' 카테고리의 다른 글

[Python] tomotopy로 문헌별 토픽 비중 계산하기 (6)	2019.12.01
Chrono-gram을 이용해 라틴어 고문헌 연대 추정하기 (3)	2019.09.16
[토픽 모델링] 대량의 문헌을 주제에 따라 클러스터링해보기 (3)	2019.07.10
Python용 토픽 모델링 패키지 - tomotopy 개발 (12)	2019.05.19
[Kiwi] 지능형 한국어 형태소 분석기 0.6버전 업데이트 (1)	2018.12.09
[Tensorflow] 문자 인식용 신경망 Python3 코드 (2)	2018.11.14

나의 큰 O는 log x야

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Step 1. tomotopy 패키지 설치하기

Step 2. 토픽 모델링 코드를 작성하기

Step 3. (영어) 전처리 코드 추가하기

Step 3b. (한국어) 전처리 코드 추가하기

Step 4. 불용어 제거가 어려운 경우, 용어 가중치를 변경하자

'프로그래밍 > NLP' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

글쓴이 ∫2tdt=t²+c

댓글

태그

방문자

티스토리툴바