๐Ÿ’ป ํ”„๋กœ์ ํŠธ/๐Ÿงธ TOY-PROJECTS

[DeepLook] 2. AI ์ž‘์—… ์„ค๊ณ„ ๊ณผ์ • / ํฌ๋กค๋ง

์žฅ์˜์ค€ 2023. 6. 20. 15:49

1. ์„ค๊ณ„ ๊ณผ์ •

์šฐ์„  ๋ชจ๋“  ๋‹ฎ์€๊ผด๋กœ ๋‚˜์˜ค๋Š” ์ธ๋ฌผ์„ ๋ชจ๋“  ์—ฐ์˜ˆ์ธ์œผ๋กœ ํ•˜๊ธฐ์—๋Š” ๋„ˆ๋ฌด ๊ด‘๋ฒ”์œ„ํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜์–ด, ๋‹น์‹œ ์œ ํ–‰ํ•œ ๋“œ๋ผ๋งˆ์ธ '๋” ๊ธ€๋กœ๋ฆฌ'์˜ ๋“ฑ์žฅ์ธ๋ฌผ๋กœ๋งŒ ์„ค์ •ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณธ ์ž‘์—… ์ˆœ์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:

  1. ๋”๊ธ€๋กœ๋ฆฌ ๋“ฑ์žฅ์ธ๋ฌผ์˜ ์‚ฌ์ง„ ํฌ๋กค๋ง
  2. ์‚ฌ์ง„ ์ „์ฒ˜๋ฆฌ ์ž‘์—… ๋ฐ data augmentation
  3. ๋ชจ๋ธ ์„ ์ •
  4. ๋ชจ๋ธ ํ•™์Šต
  5. ํ…Œ์ŠคํŠธ ๋ฐ ์›น ์—ฐ๊ฒฐ

2. ํฌ๋กค๋ง

๋ฌผ๋ก  ๋งŽ์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ๊ณต๊ฐœ๋œ ์˜คํ”ˆ์†Œ์Šค๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ, ํฌ๋กค๋Ÿฌ๋„ ์ง์ ‘ ๊ตฌํ˜„ํ•ด ๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. (๋งŽ์€ ๋ฆฌ์†Œ์Šค๊ฐ€ ๋“œ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ˆ๊นŒ..)

ํ˜„์žฌ ์šฐ๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์€ ๋“ฑ์žฅ์ธ๋ฌผ๋“ค์˜ ์ด๋ฏธ์ง€์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ์‚ฌ์ดํŠธ (https://www.google.co.kr/imghp?hl=ko)์„ ๊ธฐ์ค€์œผ๋กœ ํฌ๋กค๋งํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.

์šฐ์„  ์ตœ์ข…์ ์œผ๋กœ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import urllib.request
import ssl
import os

# ssl certification ์˜ค๋ฅ˜ ํ•ด๊ฒฐ
ssl._create_default_https_context = ssl._create_unverified_context

keyword = input("๊ฒ€์ƒ‰ํ•  ์ด๋ฆ„์„ ์ž…๋ ฅํ•˜์„ธ์š”: ")

# ํŒŒ์ผ๋ช… ์˜์–ด๋กœ ๋ณ€๊ฒฝ
keyword_to_english = ""
if keyword == "์†กํ˜œ๊ต":
    keyword_to_english = "shg"
elif keyword == "์ด๋„ํ˜„":
    keyword_to_english = "idh"
elif keyword == "์ž„์ง€์—ฐ":
    keyword_to_english = "ijh"
elif keyword == "์‹ ์˜ˆ์€":
    keyword_to_english = "she"
elif keyword == "์†๋ช…์˜ค":
    keyword_to_english = "smo"

# selenium option
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()), options=chrome_options
)

# ์ €์žฅํ•  ๋””๋ ‰ํ† ๋ฆฌ๋ช…
save_path = "/Users/jang-youngjoon/dev-projects/youtuber-look-alike/crawled-image"


# ์ œ์ผ ์•„๋ž˜๊นŒ์ง€ ์Šคํฌ๋กค -> ํ•ญ๋ชฉ ๋†’์ด๊ธฐ
def selenium_scroll_option():
    SCROLL_PAUSE_SEC = 2
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(SCROLL_PAUSE_SEC)
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break
        last_height = new_height


driver.get("https://www.google.co.kr/imghp?hl=ko")
elem = driver.find_element(By.NAME, "q")
elem.send_keys(keyword)
elem.send_keys(Keys.RETURN)
time.sleep(1)

# ๋”๋ณด๊ธฐ ๋ฒ„ํŠผ ๋ˆ„๋ฅด๋Š” ์ฝ”๋“œ
selenium_scroll_option()

images = driver.find_elements(By.CSS_SELECTOR, ".rg_i.Q4LuWd")
count = 1
images_url_list = []

for image in images:
    if image:
        # image.send_keys(Keys.ENTER) #.click() ๋ง๊ณ  send_keys(Keys.ENTER)๋กœ ๋ณ€๊ฒฝ
        driver.execute_script("arguments[0].click();", image)
        time.sleep(3)
        imageUrl = driver.find_element(
            By.XPATH,
            '//*[@id="Sva75c"]/div[2]/div/div[2]/div[2]/div[2]/c-wiz/div/div/div/div[3]/div[1]/a/img[1]',
        )
        if imageUrl:
            if imageUrl.get_attribute("src") == None:
                images_url_list.append(imageUrl.get_attribute("data-src"))
            else:
                images_url_list.append(imageUrl.get_attribute("src"))
        else:
            continue
    else:
        break

for image_url in images_url_list:
    file_name = f"{keyword_to_english}_{count:06d}.jpg"
    file_place = os.path.join(save_path, file_name)
    urllib.request.urlretrieve(image_url, file_place)
    print(f"Image saved: {count}")
    count += 1

driver.close()

์ฝ”๋“œ์— ๊ด€ํ•ด ์„ค๋ช…ํ•˜์ž๋ฉด, ์‹คํ–‰ ์‹œ ํฌ๋กค๋งํ•˜๊ณ ์ž ํ•˜๋Š” ์ธ๋ฌผ์˜ ์ด๋ฆ„์„ ์ž…๋ ฅ๋ฐ›๊ฒŒ ํ–ˆ๋‹ค. ๋ฌผ๋ก  ์ž…๋ ฅ๋ฐ›์„ ์ธ๋ฌผ์˜ ์ด๋ฆ„์ด ๋Š˜์–ด๋‚˜๋ฉด ์ฒ˜์Œ์— ์„ค์ •ํ•ด ๋‘” 'keyword' ๋ถ€๋ถ„์˜ if๋ฌธ์„ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜์ง€๋งŒ, ์šฐ์„  ์ด๋ ‡๊ฒŒ ์ž‘์„ฑํ–ˆ๋‹ค. (ํ›„์— ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ž๋™ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๊ณ ๋ฏผํ•ด ๋ด์•ผ๊ฒ ๋‹ค.)

์ดํ›„์—๋Š” ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ์‚ฌ์ดํŠธ๋กœ ์ด๋™ํ•˜์—ฌ ์ž…๋ ฅ๋ฐ›์€ ๋“ฑ์žฅ์ธ๋ฌผ์˜ ์ด๋ฆ„์„ ๊ฒ€์ƒ‰ํ•œ๋‹ค. ๋‹ค๋Ÿ‰์˜ ์‚ฌ์ง„์ด ํ•„์š”ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์Šคํฌ๋กค์ด ๊ผญ ํ•„์š”ํ–ˆ๋Š”๋ฐ, html ๋ฌธ๋ฒ•์„ ์•Œ๊ณ  ์žˆ์–ด์„œ์˜€๋Š”์ง€ ์†์‰ฝ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. ํ•ด๋‹น ์‚ฌ์ดํŠธ๋Š” ๋ฌดํ•œ ์Šคํฌ๋กค ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„์ด ๋ผ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์Šคํฌ๋กค์˜ ๋†’์ด๋ฅผ ์•„๋ž˜๋กœ ๋‚ด๋ฆผ์œผ๋กœ์จ ๋ฌดํ•œ ์Šคํฌ๋กค ๊ธฐ๋Šฅ์ด ๋ฐœ๋™๋˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ดํ›„, '๋” ๋ณด๊ธฐ' ๋ฒ„ํŠผ์ด ๋‚˜์˜ค๋ฉด ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ์•ผ ๋” ๋งŽ์€ ์‚ฌ์ง„์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, ๊ทธ ๋ฒ„ํŠผ์ด ๋‚˜์˜ค๋Š” ์ˆœ๊ฐ„ ์ ์–ด๋„ 200์žฅ์˜ ์ด๋ฏธ์ง€๋Š” ํ™•๋ณด๋œ ๊ฒƒ์ด๋‹ˆ ํ•ด๋‹น ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋Š” ๊ธฐ๋Šฅ๊นŒ์ง€๋Š” ๊ตฌํ˜„ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.

๊ทธ๋ ‡๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ํ•˜๋‚˜์”ฉ ๊ฐ€์ ธ์™€, ์ธ๋ฌผ์˜ ์ด๋‹ˆ์…œ๊ณผ ์ด๋ฏธ์ง€ ๋ฒˆํ˜ธ๋ฅผ ๋ถ€์—ฌํ•ด์„œ ๋‚ด ์ปดํ“จํ„ฐ์— ์ €์žฅํ–ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ์ธ๋ฌผ๋“ค์˜ ์ด๋ฏธ์ง€๋ฅผ ํฌ๋กค๋Ÿฌ๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•˜์—ฌ ํฌ๋กค๋งํ•ด๋ณด์•˜๋‹ค.

 

3. ๊ฐœ์„ ์‚ฌํ•ญ

์ด๋ ‡๊ฒŒ ์ง์ ‘ ํฌ๋กค๋Ÿฌ๋ฅผ ๊ตฌํ˜„ํ•ด์„œ ํฌ๋กค๋งํ•ด ๋ณด์•˜๋Š”๋ฐ, ์ •๋ง ์‹ค์งˆ์ ์ธ ์šฉ๋„๋กœ ์ด ํฌ๋กค๋Ÿฌ๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž๋™ํ™”ํ•˜๊ณ  ๊ฐœ์„ ํ•ด์•ผ ํ•  ๋ถ€๋ถ„์ด ๋งŽ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค:

  1. ์ด๋ฆ„์„ ์ž…๋ ฅ๋ฐ›๋Š” ๋ฐฉ์‹: ์‚ฌ์šฉ์ž๊ฐ€ txt ํŒŒ์ผ์— ํฌ๋กค๋งํ•  ์ด๋ฆ„์„ ๋ฌธ๋‹จ์„ ๊ตฌ๋ถ„ํ•˜์—ฌ ์จ ๋†“๊ณ , ์‹œ์Šคํ…œ ๋‚ด๋ถ€์—์„œ ์ž๋™์œผ๋กœ ํ•œ ์ค„์”ฉ ์ฝ๊ฒŒ ํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋” ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.
  2. ์ฃผ๊ธฐ์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ๊ตฌ๊ธ€ ์‚ฌ์ดํŠธ์˜ HTML element: ์ด ํฌ๋กค๋Ÿฌ๋ฅผ ๋‚˜์ค‘์— ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋ณด๋‹ˆ, ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ์‚ฌ์ดํŠธ์˜ html element๊ฐ€ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋ณ€ํ•ด์„œ ํฌ๋กค๋งํ•˜๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด ๋ถ€๋ถ„์€,, ์ฃผ๊ธฐ์ ์œผ๋กœ ๋‚ด๊ฐ€ ํ™•์ธํ•˜๊ณ  element ๋ถ€๋ถ„์„ ๋ณ€๊ฒฝํ•˜๊ฑฐ๋‚˜, ์‚ฌ์ดํŠธ์—์„œ ๊ทœ์น™์„ฑ์„ ์ฐพ์•„ ์—…๋ฐ์ดํŠธ์‹œ์ผœ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•˜๋‹ค. (ํ•˜์ง€๋งŒ ํ›„์ž๋Š” ๋„ˆ๋ฌด ์‹œ๊ฐ„๋„, ๋ฆฌ์†Œ์Šค๋„ ๋งŽ์ด ๋“ค ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค...)
  3. ์ธ์ฆ์„œ ๋ฌธ์ œ: ๊ฐ€๋”์”ฉ ์‹คํ–‰ํ•  ๋•Œ ์ธ์ฆ์„œ ๊ด€๋ จ (ssl ๊ด€๋ จ์ธ ๋“ฏํ•˜๋‹ค.) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์ด ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ• ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ๋‹ค.

๊ฐœ์„ ํ•ด์•ผ ํ•  ์ ๋„ ๋งŽ์•˜์ง€๋งŒ, ๊ทธ๋ž˜๋„ ์ง์ ‘ ํฌ๋กค๋Ÿฌ๋ฅผ ๊ตฌํ˜„ํ•˜์—ฌ ๋‹ค๋Ÿ‰์˜ ์‚ฌ์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์Œ“๊ฒŒ ๋˜์–ด ๊ธฐ๋ปค๋‹ค. ๋‹ค์Œ์€ ํ•ด๋‹น ์‚ฌ์ง„๋“ค์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•ด ๋ณผ ์˜ˆ์ •์ด๋‹ค.