๐Ÿ“š ์Šคํ„ฐ๋””/NLP

[ํ…์ŠคํŠธ๋งˆ์ด๋‹] 1. ํ…์ŠคํŠธ ๋ถ„์„

์žฅ์˜์ค€ 2023. 6. 17. 09:55

์ด๋ฒˆ์— NLP ์ค‘ ํ…์ŠคํŠธ๋งˆ์ด๋‹์— ๊ด€ํ•œ ์ž๋ฃŒ๋“ค์„ ๊ฐ€์ง€๊ณ  ์Šคํ„ฐ๋””๋ฅผ ํ•ด๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.

์ฒซ ์ฃผ์ฐจ์—๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ํ…์ŠคํŠธ ๋ถ„์„์— ๊ด€ํ•ด ์•Œ์•„๋ณด๊ณ , ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด๋ณด์•˜๋‹ค. 

์ž์„ธํ•œ ์ฝ”๋“œ๋“ค์€ ๊นƒํ—ˆ๋ธŒ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.


1. ํ…์ŠคํŠธ๋งˆ์ด๋‹, ํ…์ŠคํŠธ ๋ถ„์„, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ฐจ์ด

์‹œ์ž‘ํ•˜๊ธฐ์— ์•ž์„œ ์œ„ ์„ธ ์šฉ์–ด๋“ค์˜ ์ฐจ์ด๋ฅผ ์•Œ์•„๋ณด๊ณ  ์‹œ์ž‘ํ•˜์ž.

  • ํ…์ŠคํŠธ๋งˆ์ด๋‹: ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋“  ์ž‘์—…
  • ํ…์ŠคํŠธ ๋ถ„์„: ์ข์€ ์˜๋ฏธ์˜ ํ…์ŠคํŠธ(๋ฌธ์„œ)์˜ ํŠน์„ฑ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ
  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ: ํ…์ŠคํŠธ๋งˆ์ด๋‹์„ ์œ„ํ•œ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ์ž‘์—…

 

2. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ข…๋ฅ˜

ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ข…๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • ํ…์ŠคํŠธ ์„ ๋ณ„: ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ
  • ํ…์ŠคํŠธ ์ •๋ณด ์ถ”์ถœ: ํ•œ ํ…์ŠคํŠธ ๋‚ด์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœ
  • ํ…์ŠคํŠธ ์ฃผ์ œ ์ฐพ๊ธฐ: ๋นˆ๋„ ๋ถ„์„, ํ† ํ”ฝ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ™œ์šฉ
  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜: Logistic Regression, Deep Learning... ๋“ฑ์— ์‚ฌ์šฉ
  • ํ…์ŠคํŠธ ๋…ผ์กฐ ๋ฐ ๊ด€์ :  ๊ฐ์„ฑ๋ถ„์„, ์˜๋ฏธ ์—ฐ๊ฒฐ๋ง(Semantic Network) ๋ถ„์„
  • ํ…์ŠคํŠธ ํŠน์„ฑ ํŒŒ์•…:  Word Embedding

 

3. ํ…์ŠคํŠธ ๋ถ„์„ ๊ณผ์ •

ํ…์ŠคํŠธ ๋ถ„์„ ๊ณผ์ •์€ ์ˆ˜์ง‘ -> ์ „์ฒ˜๋ฆฌ -> ๋ถ„์„ -> ํ‰๊ฐ€ ์ˆœ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

  1. ์ˆ˜์ง‘: ํฌ๋กค๋ง, ์Šคํฌ๋ž˜ํ•‘ ๋“ฑ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
  2. ์ „์ฒ˜๋ฆฌ
    • Cleaning: ๋ถˆํ•„์š”ํ•œ ๊ธฐํ˜ธ ์ œ๊ฑฐ
    • Case Conversion: ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜
    • Lemmatizing, Stemming: ๋‹จ์–ด์˜ ์›ํ˜• ๋˜๋Š” ์–ด๊ฐ„ ์ฐพ๊ธฐ
      • has -> have / watched -> watch / flies -> fly
      • ์˜ˆ์œ -> ์˜ˆ์˜๋‹ค
    • Text Tokenizing: ๋‹จ์–ด ๋˜๋Š” ํ† ํฐ ๋‹จ์œ„๋กœ ์ž˜๋ผ์ฃผ๊ธฐ
    • Tagging: ๋‹จ์–ด ํ’ˆ์‚ฌ ํƒœ๊ทธํ•˜๊ธฐ
    • Removing Stopwords: ๋ถˆ์šฉ์–ด(Stopword) ์ œ๊ฑฐํ•˜๊ธฐ
  3. ๋ถ„์„
  4. ํ‰๊ฐ€ 

 

4. ์‹ค์Šต ์ฝ”๋“œ

1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ

์œ„ ๊ฐœ๋…๋“ค๋กœ, ํ…์ŠคํŠธ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ณ  ์›Œ๋“œํด๋ผ์šฐ๋“œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด์•˜๋‹ค.

๋ฐ์ดํ„ฐ๋กœ๋Š” it ๋งค๊ฑฐ์ง„์ธ ์š”์ฆ˜ IT๋ผ๋Š” ๋งค๊ฑฐ์ง„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋กค๋งํ•˜์—ฌ ์‚ฌ์šฉํ–ˆ๋‹ค.

path = '/content/drive/MyDrive/text-mining/แ„‹แ…ญแ„Œแ…ณแ†ทIT_2023.04.27.csv'
df = pd.read_csv(path); df

์ฝ”๋“œ๋ฅผ ์ถœ๋ ฅํ•˜๋ฉด, ์œ„ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด ์ถœ๋ ฅ๋˜์—ˆ๋‹ค.

# ๊ฒฐ์ธก๊ฐ’ ๊ฐœ์ˆ˜ ํ™•์ธ
df.isnull().sum()
# ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ ํ™•์ธ
df[df['๋ถ„๋ฅ˜'].isnull()]
# ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ
df.dropna(inplace = True)
df.reset_index(inplace= True, drop = True)

์ดํ›„ ์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ธก๊ฐ’์„ ํ™•์ธํ•˜๊ณ , ํ•ด๋‹น ๊ฒฐ์ธก์น˜๋“ค์„ ๋ชจ๋‘ ์ œ๊ฑฐํ–ˆ๋‹ค.

# Text Cleaning
content_list = []

for k in range(len(df['๋ณธ๋ฌธ'])):
    content = df['๋ณธ๋ฌธ'][k]
    cleaned_content = re.sub(r'[^\s\w]', ' ', content) # cleaning
    content_list.append(cleaned_content)
    
# Cleaning ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ
df['๋ณธ๋ฌธ_์ „์ฒ˜๋ฆฌ'] = content_list; df

์ดํ›„, ๋ณธ๋ฌธ ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋‘ re (์ •๊ทœํ‘œํ˜„์‹)์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•œ ํ›„, df์— ๋”ฐ๋กœ ์…€์„ ๋งŒ๋“ค์–ด ์ €์žฅํ–ˆ๋‹ค.

์œ„์™€ ๊ฐ™์€ ์˜ˆ์‹œ output์ด ๋‚˜์™”๋‹ค.

2. ํ…์ŠคํŠธ ํ† ํฐํ™”

# Okt ํ™œ์šฉํ•˜๊ธฐ
text = df['๋ณธ๋ฌธ_์ „์ฒ˜๋ฆฌ'][0]

word_list = okt.morphs(text) # morphs๋Š” ํ’ˆ์‚ฌ ์—†์ด ๋ชจ๋“  ํ† ํฐ๋“ค์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
stem_word_list = okt.morphs(text, stem = True) # stem์„ True๋กœ ํ•˜๋ฉด ๋‹จ์–ด์˜ ์›ํ˜•์œผ๋กœ ๋ณ€ํ˜•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

print(word_list, stem_word_list, sep = '\n')

์ดํ›„, ์œ„์™€ ๊ฐ™์ด Okt๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ’ˆ์‚ฌ๊ฐ€ ์—†์ด ๊ตฌ๋ถ„๋œ ๋‹จ์–ด๋“ค๊ณผ, ๋‹จ์–ด์˜ ์›ํ˜•์œผ๋กœ ๋ฐ”๋€Œ์–ด ๊ตฌ๋ถ„๋œ ๋‹จ์–ด๋“ค์˜ ๊ฒฐ๊ณผ๋ฅผ ์ถ”์ถœํ–ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

word_list (๋‹จ์–ด ์›ํ˜• X)
stem_word_list (๋‹จ์–ด ์›ํ˜• O)

word_list = okt.pos(text, stem = True)
pos_list = ['Noun', 'Verb', 'Adjective']

[word for word in word_list if word[1] in pos_list]

๋‹ค์Œ์€ list comprehension ๊ธฐ๋ฒ•์œผ๋กœ pos_list ๋‚ด์— ์กด์žฌํ•˜๋Š” ํ’ˆ์‚ฌ๋“ค๋งŒ์œผ๋กœ ํ•„ํ„ฐ๋งํ•ด๋ณด์•˜๋‹ค. 

์ดํ›„, ์ „์ฒ˜๋ฆฌํ–ˆ๋˜ ๋ณธ๋ฌธ๋“ค์— ๋Œ€ํ•ด ์Šคํ…Œ๋ฐ ์ž‘์—… ํ›„ ํ’ˆ์‚ฌ๋ฅผ ํƒœ๊น…ํ•˜๊ณ , ํ’ˆ์‚ฌ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ noun๊ณผ alpha ํ’ˆ์‚ฌ๋งŒ ๊ณจ๋ผ๋‚ด๋ณด์•˜๋‹ค.

ํ•ด๋‹น ๋‹จ์–ด๋“ค์„ word_list ๋ผ๋Š” ํ•˜๋‚˜์˜ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅํ–ˆ๋‹ค.

# ํ’ˆ์‚ฌ๋กœ ํ•„ํ„ฐ๋งํ•˜์—ฌ ์›Œ๋“œ ๋ฐ˜ํ™˜
def pos_filtering(word_list):
    pos_list = ['Noun', 'Alpha']
    pos_filtered_word_list = [word[0] for word in word_list if word[1] in pos_list]

    return pos_filtered_word_list
     
# ์Šคํ…Œ๋ฐ + ํ’ˆ์‚ฌ ํƒœ๊ทธํ•˜๊ธฐ
df['๋ณธ๋ฌธ_POS'] = df['๋ณธ๋ฌธ_์ „์ฒ˜๋ฆฌ'].map(lambda x: okt.pos(x, stem= True))

# ํ’ˆ์‚ฌ ํ•„ํ„ฐ ์ ์šฉํ•˜๊ธฐ
df['๋ณธ๋ฌธ_๋‹จ์–ด'] = df['๋ณธ๋ฌธ_POS'].map(pos_filtering)

# ์›Œ๋“œ ๋ฆฌ์ŠคํŠธ ๋ณ‘ํ•ฉํ•˜๊ธฐ
word_list = sum(df['๋ณธ๋ฌธ_๋‹จ์–ด'], [])

word_list

์ตœ์ข…์ ์œผ๋กœ, ์ด 1261028๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์—ˆ๋‹ค.

3. ๋นˆ๋„๋ถ„์„

๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด CounterVectorizer ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

# ๋นˆ๋„๋ถ„์„ 
c = Counter(word_list) 
num = 100 

# ์ƒ์œ„ 100๊ฐœ ๋‹จ์–ด๋งŒ ์ถœ๋ ฅ 
print(c.most_common(num))

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

๋„ˆ๋ฌด ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋ฌด์˜๋ฏธํ•œ ๋‹จ์–ด๋“ค์„ ๋ถˆ์šฉ์–ด๋กœ ์„ค์ •ํ•˜๊ณ , ๋‹ค์‹œ ๋นˆ๋„๋ถ„์„์„ ํ•˜์—ฌ ํ•ด๋‹น ์›Œ๋“œ๋“ค๋กœ ์›Œ๋“œํด๋ผ์šฐ๋“œ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.

# ๋ถˆ์šฉ์–ด 
stopwords = ['์ˆ˜', '๊ฒƒ', '์ด', '๋•Œ', '๋“ฑ', '๋”', '๋ฅผ', '๊ทธ', '์œ„', '๊ฒฝ์šฐ', 'ํ†ตํ•ด', '์œ„ํ•ด', '์ผ', '๋‹ค๋ฅธ', '๊ฐ€์ง€', '๋Œ€ํ•œ', '์˜', '๋Œ€ํ•ด', '์ค‘', '๋‚ด', '๋•Œ๋ฌธ']

# ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ํ›„ ๋นˆ๋„๋ถ„์„ 
word_list = [word for word in word_list if word[0] not in stopwords] 
print(c.most_common(num))
!pip install wordcloud
from wordcloud import WordCloud 
word_dict = dict(c.most_common(100)) # ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

# ์›Œ๋“œํด๋ผ์šฐ๋“œ ์„ค์ •ํ•˜๊ธฐ
wc = WordCloud(font_path = 'NanumGothic.ttf',
               background_color= 'white',
               width = 3000, height = 2000,
               min_font_size = 10)

cloud = wc.generate_from_frequencies(word_dict) # ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ

์ตœ์ข…์ ์ธ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.


์ด๋ ‡๊ฒŒ, ํ…์ŠคํŠธ๋งˆ์ด๋‹์˜ ์ „๋ฐ˜์ ์ธ ๊ฐœ๋…์— ๊ด€ํ•ด ์•Œ์•„๋ณด๊ณ , '์š”์ฆ˜ IT'๋ผ๋Š” ๋งค๊ฑฐ์ง„์—์„œ ํฌ๋กค๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ, ํ† ํฐํ™”, ๋นˆ๋„๋ถ„์„ํ•˜์—ฌ ์›Œ๋“œํด๋ผ์šฐ๋“œ๊นŒ์ง€ ๋งŒ๋“ค์–ด ๋ณด์•˜๋‹ค.