๐Ÿง   Neural Network _ concept

๐Ÿคซ Neural Network
perceptron์—์„œ ๋‹จ์ง€ ์ธต์ด ์—ฌ๋Ÿฌ๊ฐœ ๋Š˜์–ด๋‚œ ๊ฒƒ ๋ฟ์ด์ง€๋งŒ, XOR ๋ฌธ์ œ์ฒ˜๋Ÿผ ํ•˜๋‚˜์˜ perceptron์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ค์› ๋˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์ˆ˜์˜ ๋‰ด๋Ÿฐ์„ ์‚ฌ์šฉํ•ด ๋งŒ๋“  ๊ฒƒ์„ ์ธ๊ณต์‹ ๊ฒฝ๋ง(artificial neural network)๋ผ ํ•˜๋ฉฐ, ์ค„์—ฌ์„œ ์‹ ๊ฒฝ๋ง(neural network)๋ผ ํ•œ๋‹ค. ์ด๋Ÿฐ ์‹ ๊ฒฝ๋ง์€ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชจ์–‘์œผ๋กœ ๋˜์–ด์žˆ๋Š”๋ฐ, ์ด๋•Œ ์€๋‹‰์ธต(hidden layer)์˜ ๊นŠ์ด๊ฐ€ ๊นŠ์€ ๊ฒƒ์„ ์ด์šฉํ•œ ํ•™์Šต๋ฐฉ๋ฒ•์„ Deep Learning์ด๋ผ ํ•œ๋‹ค.

์‹ ๊ฒฝ๋ง: ์ž…๋ ฅ์ธต์—์„œ ์€๋‹‰์ธต์œผ๋กœ ๊ฐ’์„ ์ „๋‹ฌํ•˜๋Š” ํ•จ์ˆ˜๋Š” vector-to-vectorํ•จ์ˆ˜
๋”ฅ๋Ÿฌ๋‹: ํ•จ์ˆ˜๋Š” vector-to-scalarํ•จ์ˆ˜( โˆต ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ’๋“ค์ด ๊ฐœ๋ณ„์ ์œผ๋กœ ์ž์œ ๋กญ๊ฒŒ ํ™œ๋™)

input layer(์ž…๋ ฅ์ธต): input์€ data์˜ feature๋กœ node์˜ ๊ฐœ์ˆ˜ = feature์˜ ๊ฐœ์ˆ˜์ด๋‹ค.
output layter(์ถœ๋ ฅ์ธต): node๊ฐœ์ˆ˜ = ๋ถ„๋ฅ˜ํ•˜๋ ค๋Š” class์˜ ๊ฐœ์ˆ˜์ด๋ฉฐ ์ด๋•Œ, ๊ฐ ํด๋ž˜์Šค์˜ score ์ค‘ ๊ฐ€์žฅ ๋†’์€๊ฐ’์„ ๊ฐ–๋Š” class๋ฅผ ์ตœ์ข…์ ์œผ๋กœ ์„ ์ •ํ•œ๋‹ค.

 

 

๐Ÿง   back propagation  (์˜ค์ฐจ ์—ญ์ „ํŒŒ)

๐Ÿคซ ์˜ค์ฐจ ์—ญ์ „ํŒŒ (back propagation)

๋‹ค์ธต ํผ์…‰ํŠธ๋ก (multi-perceptron)์—์„œ ์ตœ์ ๊ฐ’์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์€ ์˜ค์ฐจ์—ญ์ „ํŒŒ(back propagation)๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

์—ญ์ „ํŒŒ๋ผ๋Š” ๋ง์ฒ˜๋Ÿผ ์—ญ์ „ํŒŒ๋Š” ์ถœ๋ ฅ์ธต-์€๋‹‰์ธต-์ž…๋ ฅ์ธต ์ˆœ์„œ๋Œ€๋กœ ๋ฐ˜๋Œ€๋กœ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์˜ค์ฐจ์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด ์˜ค์ฐจ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ€์ค‘์น˜(weight)๋ฅผ ์ˆ˜์ •ํ•œ ํ›„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋„๋ก ๋ชจ๋ธ์„ ๊ฐœ์„ ํ•œ๋‹ค.

 



๐Ÿคซ back propagation ๊ณผ์ •

 1) weight ์ดˆ๊ธฐํ™”
 2) forward propagation์„ ํ†ตํ•œ output ๊ณ„์‚ฐ
 3) Cost function ์ •์˜, 1์ฐจ ๋ฏธ๋ถ„์‹ ๊ตฌํ•˜๊ธฐ. => ์‹ค์ œ๊ฐ’ t์— ๋Œ€ํ•ด,  ([t-z]^2) / 2
 4) back propagation์„ ํ†ตํ•œ 1์ฐจ ๋ฏธ๋ถ„๊ฐ’ ๊ตฌํ•˜๊ธฐ
 5) ํ•™์Šต๋ฅ (learning rate)์„ค์ • ํ›„ parameter(weight, bias) update
 6) 2 ~ 6์˜ ๊ณผ์ • ๋ฐ˜๋ณต

 

 

๐Ÿง   activation function  (ํ™œ์„ฑํ™” ํ•จ์ˆ˜)

๐Ÿคซ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ (activation function)
ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” input, weight, bias๋กœ ๊ณ„์‚ฐ๋˜์–ด ๋‚˜์˜จ ๊ฐ’์— ๋Œ€ํ•ด ํ•ด๋‹น ๋…ธ๋“œ๋ฅผ ํ™œ์„ฑํ™” ํ•  ๊ฒƒ์ธ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค.


๐Ÿคซ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ์ข…๋ฅ˜
๊ณ„๋‹จ ํ•จ์ˆ˜(step function): 0์„ ๊ธฐ์ ์œผ๋กœ ์ถœ๋ ฅ๊ฐ’์€ 0๊ณผ 1, ๋‘๊ฐ€์ง€ ๊ฐ’๋งŒ ๊ฐ–๋Š” ํŠน์ง•์ด ์žˆ๋‹ค. 

(๋‹ค๋งŒ 0์—์„œ ๋ฏธ๋ถ„๋ถˆ๊ฐ€๋Šฅ์ด๋ผ๋Š” ๋‹จ์ ์ด ์กด์žฌ)


๋ถ€ํ˜ธ ํ•จ์ˆ˜(sign function): ๊ณ„๋‹จํ•จ์ˆ˜์™€ ๋น„์Šทํ•˜์ง€๋งŒ 0์„ ๊ธฐ์ ์œผ๋กœ -1, 0, 1 ์„ธ ๊ฐ’์„ ๊ฐ–๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.


Sigmoid function: 1/(1 + exp(-x)) ๋กœ 0๊ณผ 1์‚ฌ์ด ๊ฐ’์„ ์ถœ๋ ฅํ•œ๋‹ค.

   ๋‹ค๋งŒ ๋‹จ์ ์ด ์กด์žฌํ•˜๋Š”๋ฐ, ๋ฐ”๋กœ vanishing gradient problem์ด๋‹ค.

   Vanishing gradient problem: ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฏธ๋ถ„์„ ๋ฐ˜๋ณตํ•˜๋ฉด gradient๊ฐ’์ด ์ ์  ์ž‘์•„์ ธ ์†Œ์‹ค๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.

   ์ด ๋ฌธ์ œ๋Š” ๋ฏธ๋ถ„๊ฐ’์ด 0์— ๊ฐ€๊นŒ์›Œ์ง€๊ณ  ์ด๋Š” ์ตœ์ข…์ ์œผ๋กœ ํ•™์Šต์†๋„๊ฐ€ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ๋‹ค.

 
Hyperbolic tangent (tanh): (exp(x) - exp(-x)) / (exp(x) + exp(-x))

    0~1์‚ฌ์ด์ธ sigmoidํ•จ์ˆ˜๋ฅผ ๋ณ€ํ˜•ํ•œ ํ•จ์ˆ˜๋กœ tanhํ•จ์ˆ˜๋Š” -1 ~ 1์‚ฌ์ด์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ–๋Š”๋‹ค.


ReLU ํ•จ์ˆ˜ (Rectified Linear function): max(x, 0)๋กœ ์•ž์„  ํ•จ์ˆ˜๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ ์ƒํ•œ์„ ์ด ์—†๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.

Leaky ReLU ํ•จ์ˆ˜: a <= 1์ผ ๋•Œ, max(x, ax)๋กœ ๋ณดํ†ต a๋Š” 0.01๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

ํ•ญ๋“ฑํ•จ์ˆ˜(identity function, linear function): x๋กœ ์ž…๋ ฅ๊ฐ’๊ณผ ์ถœ๋ ฅ๊ฐ’์ด ๋™์ผํ•˜๋‹ค.

     ์ฃผ๋กœ regression๋ฌธ์ œ์—์„œ ์ตœ์ข… ์ถœ๋ ฅ์ธต์— ์‚ฌ์šฉ๋˜๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜.

softmax function: exp(x) / Σ(exp(x))

     ์ฃผ๋กœ classification์ตœ์ข… ์ถœ๋ ฅ์ธต์— ์‚ฌ์šฉ๋˜๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜.

     ๋‹ค๋งŒ ์œ„์˜ ์‹์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, overflow๋ฐœ์ƒ์˜ ๊ฐ€๋Šฅ์„ฑ ๋•Œ๋ฌธ์— ์•„๋ž˜์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.

     exp(x + C) / Σ(exp(x + C))

     ์ด๋•Œ, ์ƒ์ˆ˜ C๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ž…๋ ฅ๊ฐ’์˜ ์ตœ๋Œ“๊ฐ’์„ ์ด์šฉํ•œ๋‹ค.

 

     ๋˜ํ•œ softmax๋Š” ๊ฒฐ๊ณผ๊ฐ€ 0~1์‚ฌ์ด์ด๊ธฐ์— ์ด๋ฅผ ํ™•๋ฅ ์— ๋Œ€์‘ํ•˜์—ฌ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.(์ž…๋ ฅ์ด ํฌ๋ฉด ํ™•๋ฅ ๋„ ํฌ๋‹ค๋Š” ๋น„๋ก€๊ด€๊ณ„.)

     ์ฆ‰, ์–ด๋–ค ๊ฐ’์ด ๊ฐ€์žฅ ํฌ๊ฒŒ ์ถœ๋ ฅ๋  ์ง€ ์˜ˆ์ธก๊ฐ€๋Šฅํ•˜๋‹ค.

 

 

 

 

๐Ÿง   batch normalization 

๐Ÿคซ batch size๋ž€?

batch ํฌ๊ธฐ๋Š” ๋ชจ๋ธ ํ•™์Šต ์ค‘ parameter๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
์‚ฌ๋žŒ์ด ๋ฌธ์ œ ํ’€์ด๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ด ๋‚˜๊ฐ€๋Š” ๊ณผ์ •์„ ์˜ˆ๋กœ ๋“ค๋ฉด, batch ํฌ๊ธฐ๋Š” ๋ช‡ ๋ฌธ์ œ๋ฅผ ํ•œ ๋ฒˆ์— ์ญ‰ ํ’€๊ณ  ์ฑ„์ ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ์ด 100๊ฐœ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ๋•Œ, 20๊ฐœ์”ฉ ํ’€๊ณ  ์ฑ„์ ํ•œ๋‹ค๋ฉด batch ํฌ๊ธฐ๋Š” 20์ด๋‹ค.

์ด๋ฅผ ์ด์šฉํ•ด ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ batch ํฌ๊ธฐ๋งŒํผ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’๊ณผ ์‹ค์ œ ์ •๋‹ต ๊ฐ„์˜ ์˜ค์ฐจ(conf. 
์†์‹คํ•จ์ˆ˜)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ optimizer๊ฐ€ parameter๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. 






๐Ÿคซbatch normalization
layer์˜ ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ ์ •์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์ด๋ฅผ ์ด์šฉํ•˜๋ฉด gradient ์†Œ์‹ค๋ฌธ์ œ๋ฅผ ์ค„์—ฌ ํ•™์Šต ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์กด์žฌํ•œ๋‹ค.

 

Mini-batch mean, variance๋ฅผ ์ด์šฉํ•ด ์ •๊ทœํ™”(normalize)์‹œํ‚จ ํ›„ ํ‰๊ท  0, ๋ถ„์‚ฐ 1์˜ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๊ฒŒ ๋งŒ๋“ ๋‹ค. N(0, 1^2)

 

์ด๋•Œ, scale parameter γ์™€ shift parameter β๋ฅผ ์ด์šฉํ•ด ์ •๊ทœํ™”์‹œํ‚จ ๊ฐ’์„ Affine transformation์„ ํ•˜๋ฉด scale and shift๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.











๐Ÿคซ normalization?   standardization?  Regularization? 

Normalization

 

  • ๊ฐ’์˜ ๋ฒ”์œ„(scale)๋ฅผ 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ
  • ํ•™์Šต ์ „์— scalingํ•˜๋Š” ๊ฒƒ
    • ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ scale์ด ํฐ feature์˜ ์˜ํ–ฅ์ด ๋น„๋Œ€ํ•ด์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
    • ๋”ฅ๋Ÿฌ๋‹์—์„œ Local Minima์— ๋น ์งˆ ์œ„ํ—˜ ๊ฐ์†Œ(ํ•™์Šต ์†๋„ ํ–ฅ์ƒ)
  • scikit-learn์—์„œ MinMaxScaler

Standardization

  • ๊ฐ’์˜ ๋ฒ”์œ„(scale)๋ฅผ ํ‰๊ท  0, ๋ถ„์‚ฐ 1์ด ๋˜๋„๋ก ๋ณ€ํ™˜
  • ํ•™์Šต ์ „์— scalingํ•˜๋Š” ๊ฒƒ
    • ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ scale์ด ํฐ feature์˜ ์˜ํ–ฅ์ด ๋น„๋Œ€ํ•ด์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
    • ๋”ฅ๋Ÿฌ๋‹์—์„œ Local Minima์— ๋น ์งˆ ์œ„ํ—˜ ๊ฐ์†Œ(ํ•™์Šต ์†๋„ ํ–ฅ์ƒ)
  • ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ
    • Z-score(ํ‘œ์ค€ ์ ์ˆ˜)
    • -1 ~ 1 ์‚ฌ์ด์— 68%๊ฐ€ ์žˆ๊ณ , -2 ~ 2 ์‚ฌ์ด์— 95%๊ฐ€ ์žˆ๊ณ , -3 ~ 3 ์‚ฌ์ด์— 99%๊ฐ€ ์žˆ์Œ
    • -3 ~ 3์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด outlier์ผ ํ™•๋ฅ ์ด ๋†’์Œ
  • ํ‘œ์ค€ํ™”๋กœ ๋ฒˆ์—ญํ•˜๊ธฐ๋„ ํ•จ
  • scikit-learn์—์„œ StandardScaler



Regularization

  • weight๋ฅผ ์กฐ์ •ํ•˜๋Š”๋ฐ ๊ทœ์ œ(์ œ์•ฝ)๋ฅผ ๊ฑฐ๋Š” ๊ธฐ๋ฒ•
  • Overfitting์„ ๋ง‰๊ธฐ์œ„ํ•ด ์‚ฌ์šฉํ•จ
  • L1 regularization, L2 regularizaion ๋“ฑ์˜ ์ข…๋ฅ˜๊ฐ€ ์žˆ์Œ
    • L1: LASSO(๋ผ์˜), ๋งˆ๋ฆ„๋ชจ
    • L2: Lidge(๋ฆฟ์ง€), ์›

 

 

๐Ÿง   Drop Out 

๐Ÿคซ ๋“œ๋กญ ์•„์›ƒ (Drop out)

์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋“  ๋…ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์‹ ๊ฒฝ๋ง์—์„œ ์ผ๋ถ€ ๋…ธ๋“œ๋ฅผ ์ผ์‹œ์ ์œผ๋กœ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์ด๋•Œ, ์–ด๋–ค ๋…ธ๋“œ๋ฅผ ์ผ์‹œ์ ์œผ๋กœ ์ œ๊ฑฐํ•  ๊ฒƒ์ธ์ง€๋Š” ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜๋Š”๋ฐ, ์ด ๋ฐฉ๋ฒ•์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์—ฌ overfitting(๊ณผ์ ํ•ฉ)์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

+ Recent posts