πŸ“š μŠ€ν„°λ””/NLP

[ν…μŠ€νŠΈλ§ˆμ΄λ‹] 1. ν…μŠ€νŠΈ 뢄석

μž₯μ˜μ€€ 2023. 6. 17. 09:55

μ΄λ²ˆμ— NLP 쀑 ν…μŠ€νŠΈλ§ˆμ΄λ‹μ— κ΄€ν•œ μžλ£Œλ“€μ„ κ°€μ§€κ³  μŠ€ν„°λ””λ₯Ό ν•΄λ³΄κ²Œ λ˜μ—ˆλ‹€.

첫 μ£Όμ°¨μ—λŠ” κ°„λ‹¨ν•˜κ²Œ ν…μŠ€νŠΈ 뢄석에 κ΄€ν•΄ μ•Œμ•„λ³΄κ³ , μ½”λ“œλ₯Ό μž‘μ„±ν•΄λ³΄μ•˜λ‹€. 

μžμ„Έν•œ μ½”λ“œλ“€μ€ κΉƒν—ˆλΈŒ μ°Έκ³ ν•˜κΈΈ λ°”λž€λ‹€.


1. ν…μŠ€νŠΈλ§ˆμ΄λ‹, ν…μŠ€νŠΈ 뢄석, μžμ—°μ–΄ 처리의 차이

μ‹œμž‘ν•˜κΈ°μ— μ•žμ„œ μœ„ μ„Έ μš©μ–΄λ“€μ˜ 차이λ₯Ό μ•Œμ•„λ³΄κ³  μ‹œμž‘ν•˜μž.

  • ν…μŠ€νŠΈλ§ˆμ΄λ‹: ν…μŠ€νŠΈ 데이터λ₯Ό ν™œμš©ν•œ λͺ¨λ“  μž‘μ—…
  • ν…μŠ€νŠΈ 뢄석: 쒁은 의미의 ν…μŠ€νŠΈ(λ¬Έμ„œ)의 νŠΉμ„± νŒŒμ•…ν•˜λŠ” 것
  • μžμ—°μ–΄ 처리: ν…μŠ€νŠΈλ§ˆμ΄λ‹μ„ μœ„ν•œ ν…μŠ€νŠΈ 처리 μž‘μ—…

 

2. ν…μŠ€νŠΈ λΆ„μ„μ˜ μ’…λ₯˜

ν…μŠ€νŠΈ λΆ„μ„μ˜ μ’…λ₯˜λŠ” λ‹€μŒκ³Ό κ°™λ‹€:

  • ν…μŠ€νŠΈ 선별: μ›ν•˜λŠ” 정보λ₯Ό κ°€μ§„ ν…μŠ€νŠΈλ§Œ μΆ”μΆœ
  • ν…μŠ€νŠΈ 정보 μΆ”μΆœ: ν•œ ν…μŠ€νŠΈ λ‚΄μ—μ„œ μ›ν•˜λŠ” 정보λ₯Ό μΆ”μΆœ
  • ν…μŠ€νŠΈ 주제 μ°ΎκΈ°: λΉˆλ„ 뢄석, ν† ν”½ λͺ¨λΈλ§ 등을 ν™œμš©
  • ν…μŠ€νŠΈ λΆ„λ₯˜: Logistic Regression, Deep Learning... 등에 μ‚¬μš©
  • ν…μŠ€νŠΈ λ…Όμ‘° 및 관점:  κ°μ„±λΆ„석, 의미 연결망(Semantic Network) 뢄석
  • ν…μŠ€νŠΈ νŠΉμ„± νŒŒμ•…:  Word Embedding

 

3. ν…μŠ€νŠΈ 뢄석 κ³Όμ •

ν…μŠ€νŠΈ 뢄석 과정은 μˆ˜μ§‘ -> μ „μ²˜λ¦¬ -> 뢄석 -> 평가 순으둜 이루어진닀.

  1. μˆ˜μ§‘: 크둀링, μŠ€ν¬λž˜ν•‘ 등을 ν†΅ν•œ 데이터 μˆ˜μ§‘
  2. μ „μ²˜λ¦¬
    • Cleaning: λΆˆν•„μš”ν•œ 기호 제거
    • Case Conversion: λŒ€μ†Œλ¬Έμž λ³€ν™˜
    • Lemmatizing, Stemming: λ‹¨μ–΄μ˜ μ›ν˜• λ˜λŠ” μ–΄κ°„ μ°ΎκΈ°
      • has -> have / watched -> watch / flies -> fly
      • 예쁜 -> μ˜ˆμ˜λ‹€
    • Text Tokenizing: 단어 λ˜λŠ” 토큰 λ‹¨μœ„λ‘œ 잘라주기
    • Tagging: 단어 ν’ˆμ‚¬ νƒœκ·Έν•˜κΈ°
    • Removing Stopwords: λΆˆμš©μ–΄(Stopword) μ œκ±°ν•˜κΈ°
  3. 뢄석
  4. 평가 

 

4. μ‹€μŠ΅ μ½”λ“œ

1. ν…μŠ€νŠΈ μ „μ²˜λ¦¬

μœ„ κ°œλ…λ“€λ‘œ, ν…μŠ€νŠΈλ₯Ό μ „μ²˜λ¦¬ν•˜κ³  μ›Œλ“œν΄λΌμš°λ“œλ₯Ό μƒμ„±ν•΄λ³΄μ•˜λ‹€.

λ°μ΄ν„°λ‘œλŠ” it 맀거진인 μš”μ¦˜ ITλΌλŠ” λ§€κ±°μ§„μ˜ 데이터λ₯Ό ν¬λ‘€λ§ν•˜μ—¬ μ‚¬μš©ν–ˆλ‹€.

path = '/content/drive/MyDrive/text-mining/α„‹α…­α„Œα…³α†·IT_2023.04.27.csv'
df = pd.read_csv(path); df

μ½”λ“œλ₯Ό 좜λ ₯ν•˜λ©΄, μœ„ 이미지와 같이 좜λ ₯λ˜μ—ˆλ‹€.

# κ²°μΈ‘κ°’ 개수 확인
df.isnull().sum()
# κ²°μΈ‘ 데이터 확인
df[df['λΆ„λ₯˜'].isnull()]
# 결츑치 제거
df.dropna(inplace = True)
df.reset_index(inplace= True, drop = True)

이후 μœ„ μ½”λ“œλ₯Ό μ‹€ν–‰ν•˜μ—¬ λ°μ΄ν„°μ˜ 결츑값을 ν™•μΈν•˜κ³ , ν•΄λ‹Ή κ²°μΈ‘μΉ˜λ“€μ„ λͺ¨λ‘ μ œκ±°ν–ˆλ‹€.

# Text Cleaning
content_list = []

for k in range(len(df['λ³Έλ¬Έ'])):
    content = df['λ³Έλ¬Έ'][k]
    cleaned_content = re.sub(r'[^\s\w]', ' ', content) # cleaning
    content_list.append(cleaned_content)
    
# Cleaning κ²°κ³Ό μ €μž₯ν•˜κΈ°
df['λ³Έλ¬Έ_μ „μ²˜λ¦¬'] = content_list; df

이후, λ³Έλ¬Έ ν…μŠ€νŠΈλ₯Ό λͺ¨λ‘ re (μ •κ·œν‘œν˜„μ‹)으둜 μ „μ²˜λ¦¬ν•œ ν›„, df에 λ”°λ‘œ 셀을 λ§Œλ“€μ–΄ μ €μž₯ν–ˆλ‹€.

μœ„μ™€ 같은 μ˜ˆμ‹œ output이 λ‚˜μ™”λ‹€.

2. ν…μŠ€νŠΈ 토큰화

# Okt ν™œμš©ν•˜κΈ°
text = df['λ³Έλ¬Έ_μ „μ²˜λ¦¬'][0]

word_list = okt.morphs(text) # morphsλŠ” ν’ˆμ‚¬ 없이 λͺ¨λ“  ν† ν°λ“€μ˜ κ²°κ³Όλ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.
stem_word_list = okt.morphs(text, stem = True) # stem을 True둜 ν•˜λ©΄ λ‹¨μ–΄μ˜ μ›ν˜•μœΌλ‘œ λ³€ν˜•ν•œ κ²°κ³Όλ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.

print(word_list, stem_word_list, sep = '\n')

이후, μœ„μ™€ 같이 Oktλ₯Ό ν™œμš©ν•˜μ—¬ ν’ˆμ‚¬κ°€ 없이 κ΅¬λΆ„λœ 단어듀과, λ‹¨μ–΄μ˜ μ›ν˜•μœΌλ‘œ λ°”λ€Œμ–΄ κ΅¬λΆ„λœ λ‹¨μ–΄λ“€μ˜ κ²°κ³Όλ₯Ό μΆ”μΆœν–ˆλ‹€. κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μ•˜λ‹€.

word_list (단어 μ›ν˜• X)
stem_word_list (단어 μ›ν˜• O)

word_list = okt.pos(text, stem = True)
pos_list = ['Noun', 'Verb', 'Adjective']

[word for word in word_list if word[1] in pos_list]

λ‹€μŒμ€ list comprehension κΈ°λ²•μœΌλ‘œ pos_list 내에 μ‘΄μž¬ν•˜λŠ” ν’ˆμ‚¬λ“€λ§ŒμœΌλ‘œ ν•„ν„°λ§ν•΄λ³΄μ•˜λ‹€. 

이후, μ „μ²˜λ¦¬ν–ˆλ˜ 본문듀에 λŒ€ν•΄ μŠ€ν…Œλ° μž‘μ—… ν›„ ν’ˆμ‚¬λ₯Ό νƒœκΉ…ν•˜κ³ , ν’ˆμ‚¬ ν•„ν„°λ₯Ό μ μš©ν•˜μ—¬ nounκ³Ό alpha ν’ˆμ‚¬λ§Œ κ³¨λΌλ‚΄λ³΄μ•˜λ‹€.

ν•΄λ‹Ή 단어듀을 word_list λΌλŠ” ν•˜λ‚˜μ˜ 리슀트둜 μ €μž₯ν–ˆλ‹€.

# ν’ˆμ‚¬λ‘œ ν•„ν„°λ§ν•˜μ—¬ μ›Œλ“œ λ°˜ν™˜
def pos_filtering(word_list):
    pos_list = ['Noun', 'Alpha']
    pos_filtered_word_list = [word[0] for word in word_list if word[1] in pos_list]

    return pos_filtered_word_list
     
# μŠ€ν…Œλ° + ν’ˆμ‚¬ νƒœκ·Έν•˜κΈ°
df['λ³Έλ¬Έ_POS'] = df['λ³Έλ¬Έ_μ „μ²˜λ¦¬'].map(lambda x: okt.pos(x, stem= True))

# ν’ˆμ‚¬ ν•„ν„° μ μš©ν•˜κΈ°
df['λ³Έλ¬Έ_단어'] = df['λ³Έλ¬Έ_POS'].map(pos_filtering)

# μ›Œλ“œ 리슀트 λ³‘ν•©ν•˜κΈ°
word_list = sum(df['λ³Έλ¬Έ_단어'], [])

word_list

μ΅œμ’…μ μœΌλ‘œ, 총 1261028개의 단어가 μžˆμ—ˆλ‹€.

3. λΉˆλ„λΆ„μ„

λ§ˆμ§€λ§‰μœΌλ‘œλŠ” λ‹¨μ–΄μ˜ λΉˆλ„λ₯Ό λΆ„μ„ν•˜κΈ° μœ„ν•΄ CounterVectorizer 라이브러리λ₯Ό μ‚¬μš©ν–ˆλ‹€.

# λΉˆλ„λΆ„μ„ 
c = Counter(word_list) 
num = 100 

# μƒμœ„ 100개 λ‹¨μ–΄λ§Œ 좜λ ₯ 
print(c.most_common(num))

κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μ•˜λ‹€.

λ„ˆλ¬΄ 많이 λ“±μž₯ν•˜λŠ” λ¬΄μ˜λ―Έν•œ 단어듀을 λΆˆμš©μ–΄λ‘œ μ„€μ •ν•˜κ³ , λ‹€μ‹œ λΉˆλ„λΆ„μ„μ„ ν•˜μ—¬ ν•΄λ‹Ή μ›Œλ“œλ“€λ‘œ μ›Œλ“œν΄λΌμš°λ“œλ₯Ό μƒμ„±ν–ˆλ‹€.

# λΆˆμš©μ–΄ 
stopwords = ['수', '것', '이', 'λ•Œ', 'λ“±', '더', 'λ₯Ό', 'κ·Έ', 'μœ„', '경우', '톡해', 'μœ„ν•΄', '일', 'λ‹€λ₯Έ', 'κ°€μ§€', 'λŒ€ν•œ', '의', 'λŒ€ν•΄', '쀑', 'λ‚΄', 'λ•Œλ¬Έ']

# λΆˆμš©μ–΄ 제거 ν›„ λΉˆλ„λΆ„μ„ 
word_list = [word for word in word_list if word[0] not in stopwords] 
print(c.most_common(num))
!pip install wordcloud
from wordcloud import WordCloud 
word_dict = dict(c.most_common(100)) # λ”•μ…”λ„ˆλ¦¬ ν˜•νƒœλ‘œ λ³€ν™˜

# μ›Œλ“œν΄λΌμš°λ“œ μ„€μ •ν•˜κΈ°
wc = WordCloud(font_path = 'NanumGothic.ttf',
               background_color= 'white',
               width = 3000, height = 2000,
               min_font_size = 10)

cloud = wc.generate_from_frequencies(word_dict) # λ”•μ…”λ„ˆλ¦¬ ν˜•νƒœλ‘œ μ €μž₯λ˜μ–΄ μžˆμ–΄μ•Ό 함

μ΅œμ’…μ μΈ κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μ•˜λ‹€.


μ΄λ ‡κ²Œ, ν…μŠ€νŠΈλ§ˆμ΄λ‹μ˜ μ „λ°˜μ μΈ κ°œλ…μ— κ΄€ν•΄ μ•Œμ•„λ³΄κ³ , 'μš”μ¦˜ IT'λΌλŠ” λ§€κ±°μ§„μ—μ„œ ν¬λ‘€λ§ν•œ λ°μ΄ν„°λ‘œ ν…μŠ€νŠΈ μ „μ²˜λ¦¬, 토큰화, λΉˆλ„λΆ„μ„ν•˜μ—¬ μ›Œλ“œν΄λΌμš°λ“œκΉŒμ§€ λ§Œλ“€μ–΄ λ³΄μ•˜λ‹€.