In the neural network tutorial, I introduced the gradient descent algorithm which is used to train the weights in an artificial neural network. In reality, for deep learning and big data tasks standard gradient descent is not often used. Rather, a variant of gradient descent called *stochastic gradient descent *and in particular its cousin *mini-batch gradient descent* is used. That is the focus of this post.

## Gradient descent review

The gradient descent optimisation algorithm aims to minimise some cost/loss function based on that function’s gradient. Successive iterations are employed to progressively approach either a local or global minimum of the cost function. The figure below shows an example of gradient descent operating in a single dimension:

When training weights in a neural network, normal *batch* gradient descent usually takes the mean squared error of *all* the training samples when it is updating the weights of the network:

$$W = W – \alpha \nabla J(W,b)$$

where $W$ are the weights, $\alpha$ is the learning rate and $\nabla$ is the gradient of the cost function $J(W,b)$ with respect to changes in the weights. More details can be found in the neural networks tutorial, but in that tutorial the cost function $J$ was defined as:

\begin{align}

J(W,b) &= \frac{1}{m} \sum_{z=0}^m J(W, b, x^{(z)}, y^{(z)})

\end{align}

As can be observed, the overall cost function (and therefore the gradient) depends on the mean cost function calculated on *all* of the *m* training samples ($x^{(z)}$ and $y^{(z)}$ refer to each training sample pair). Is this the best way of doing things? Batch gradient descent is good because the training progress is nice and smooth – if you plot the average value of the cost function over the number of iterations / epochs it will look something like this:

As you can see, the line is mostly smooth and predictable. However, a problem with batch gradient descent in neural networks is that for every gradient descent update in the weights, you have to cycle through every training sample. For big data sets i.e. > 50,000 training samples, this can be time prohibitive. Batch gradient descent also has the following disadvantages:

- It requires the loading of the whole dataset into memory, which can be problematic for big data sets
- Batch gradient descent can’t be efficiently parallelised (compared to the techniques about to be presented) – this is because each update in the weight parameters requires a mean calculation of the cost function over
*all*the training samples. - The smooth nature of the reducing cost function tends to ensure that the neural network training will get stuck in local minimums, which makes it less likely that a global minimum of the cost function will be found.

Stochastic gradient descent is an algorithm that attempts to address some of these issues.

## Stochastic gradient descent

Stochastic gradient descent updates the weight parameters after evaluation the cost function *after each sample*. That is, rather than summing up the cost function results for all the sample then taking the mean, stochastic gradient descent (or SGD) updates the weights after every training sample is analysed. Therefore, the updates look like this:

$$W = W – \alpha \nabla J(W,b, x^{(z)}, y^{(z)})$$

Notice that an update to the weights (and bias) is performed after every sample $z$ in $m$. This is easily implemented by a minor variation of the batch gradient descent code in Python, by simply shifting the update component into the sample loop (the original train_nn function can be found in the neural networks tutorial and here):

def train_nn_SGD(nn_structure, X, y, iter_num=3000, alpha=0.25, lamb=0.000): W, b = setup_and_init_weights(nn_structure) cnt = 0 m = len(y) avg_cost_func = [] print('Starting gradient descent for {} iterations'.format(iter_num)) while cnt < iter_num: if cnt%50 == 0: print('Iteration {} of {}'.format(cnt, iter_num)) tri_W, tri_b = init_tri_values(nn_structure) avg_cost = 0 for i in range(len(y)): delta = {} # perform the feed forward pass and return the stored h and z values, # to be used in the gradient descent step h, z = feed_forward(X[i, :], W, b) # loop from nl-1 to 1 backpropagating the errors for l in range(len(nn_structure), 0, -1): if l == len(nn_structure): delta[l] = calculate_out_layer_delta(y[i,:], h[l], z[l]) avg_cost += np.linalg.norm((y[i,:]-h[l])) else: if l > 1: delta[l] = calculate_hidden_delta(delta[l+1], W[l], z[l]) # triW^(l) = triW^(l) + delta^(l+1) * transpose(h^(l)) tri_W[l] = np.dot(delta[l+1][:,np.newaxis], np.transpose(h[l][:,np.newaxis])) # trib^(l) = trib^(l) + delta^(l+1) tri_b[l] = delta[l+1] # perform the gradient descent step for the weights in each layer for l in range(len(nn_structure) - 1, 0, -1): W[l] += -alpha * (tri_W[l] + lamb * W[l]) b[l] += -alpha * (tri_b[l]) # complete the average cost calculation avg_cost = 1.0/m * avg_cost avg_cost_func.append(avg_cost) cnt += 1 return W, b, avg_cost_func

In the above function, to implement stochastic gradient descent, the following code was simply indented into the sample loop “for i in range(len(y)):” (and the averaging over *m* samples removed):

for l in range(len(nn_structure) - 1, 0, -1): W[l] += -alpha * (tri_W[l] + lamb * W[l]) b[l] += -alpha * (tri_b[l])

In other words, a very easy transition from batch to stochastic gradient descent. Where does the “stochastic” part come in? The stochastic component is in the selection of the random selection of training sample. However, if we use the scikit-learn test_train_split function the random selection has already occurred, so we can simply iterate through each training sample, which has a randomised order.

## Stochastic gradient descent performance

So how does SGD perform? Let’s take a look. The plot below shows the average cost versus the number of training epochs / iterations for batch gradient descent and SGD on the scikit-learn MNIST dataset. Note that both of these are operating off the same optimised learning parameters (i.e. learning rate, regularisation parameter) which were determined according to the methods described in this post.

Some interesting things can be noted from the above figure. First, SGD converges much more rapidly than batch gradient descent. In fact, SGD converges on a minimum *J *after < 20 iterations. Secondly, despite what the average cost function plot says, batch gradient descent after 1000 iterations *outperforms* SGD. On the MNIST test set, the SGD run has an accuracy of 94% compared to a BGD accuracy of 96%. Why is that? Let’s zoom into the SGD run to have a closer look:

As you can see in the figure above, SGD is *noisy*. That is because it responds to the effects of each and every sample, and the samples themselves will no doubt contain an element of noisiness. While this can be a benefit in that it can act to “kick” the gradient descent out of local minimum values of the cost function, it can also hinder it settling down into a good minimum. This is why, eventually, batch gradient descent has outperformed SGD after 1000 iterations. It might be argued that this is a worthwhile pay-off, as the running time of SGD versus BGD is greatly reduced. However, you might ask – is there a middle road, a trade-off?

There is, and it is called mini-batch gradient descent.

## Mini-batch gradient descent

Mini-batch gradient descent is a trade-off between stochastic gradient descent and batch gradient descent. In mini-batch gradient descent, the cost function (and therefore gradient) is averaged over a small number of samples, from around 10-500. This is opposed to the SGD batch size of *1* sample, and the BGD size of *all *the training samples. It looks like this:

$$W = W – \alpha \nabla J(W,b, x^{(z:z+bs)}, y^{(z:z+bs)})$$

Where $bs$ is the mini-batch size and the cost function is:

$$J(W,b, x^{(z:z+bs)}, y^{(z:z+bs)}) = \frac{1}{bs} \sum_{z=0}^{bs} J(W, b, x^{(z)}, y^{(z)})$$

What’s the benefit of doing it this way? First, it smooths out some of the noise in SGD, but not all of it, thereby still allowing the “kick” out of local minimums of the cost function. Second, the mini-batch size is still small, thereby keeping the performance benefits of SGD.

To create the mini-batches, we can use the following function:

from numpy import random def get_mini_batches(X, y, batch_size): random_idxs = random.choice(len(y), len(y), replace=False) X_shuffled = X[random_idxs,:] y_shuffled = y[random_idxs] mini_batches = [(X_shuffled[i:i+batch_size,:], y_shuffled[i:i+batch_size]) for i in range(0, len(y), batch_size)] return mini_batches

Then our new neural network training algorithm looks like this:

def train_nn_MBGD(nn_structure, X, y, bs=100, iter_num=3000, alpha=0.25, lamb=0.000): W, b = setup_and_init_weights(nn_structure) cnt = 0 m = len(y) avg_cost_func = [] print('Starting gradient descent for {} iterations'.format(iter_num)) while cnt < iter_num: if cnt%1000 == 0: print('Iteration {} of {}'.format(cnt, iter_num)) tri_W, tri_b = init_tri_values(nn_structure) avg_cost = 0 mini_batches = get_mini_batches(X, y, bs) for mb in mini_batches: X_mb = mb[0] y_mb = mb[1] # pdb.set_trace() for i in range(len(y_mb)): delta = {} # perform the feed forward pass and return the stored h and z values, # to be used in the gradient descent step h, z = feed_forward(X_mb[i, :], W, b) # loop from nl-1 to 1 backpropagating the errors for l in range(len(nn_structure), 0, -1): if l == len(nn_structure): delta[l] = calculate_out_layer_delta(y_mb[i,:], h[l], z[l]) avg_cost += np.linalg.norm((y_mb[i,:]-h[l])) else: if l > 1: delta[l] = calculate_hidden_delta(delta[l+1], W[l], z[l]) # triW^(l) = triW^(l) + delta^(l+1) * transpose(h^(l)) tri_W[l] += np.dot(delta[l+1][:,np.newaxis], np.transpose(h[l][:,np.newaxis])) # trib^(l) = trib^(l) + delta^(l+1) tri_b[l] += delta[l+1] # perform the gradient descent step for the weights in each layer for l in range(len(nn_structure) - 1, 0, -1): W[l] += -alpha * (1.0/bs * tri_W[l] + lamb * W[l]) b[l] += -alpha * (1.0/bs * tri_b[l]) # complete the average cost calculation avg_cost = 1.0/m * avg_cost avg_cost_func.append(avg_cost) cnt += 1 return W, b, avg_cost_func

Let’s see how it performs with a min-batch size of 100 samples:

As can be observed in the figure above, mini-batch gradient descent appears be the superior method of gradient descent to be used in neural networks training. The jagged decline in the average cost function is evidence that mini-batch gradient descent is “kicking” the cost function out of local minimum values to reach better, perhaps even the best, minimum. However, it is still able to find a good minimum and stick to it. This is confirmed in the test data – the mini-batch method achieves an accuracy of 98% compared to the next best, batch gradient descent, which has an accuracy of 96%. The great thing is – it gets to these levels of accuracy after only 150 iterations or so.

One final benefit of mini-batch gradient descent is that it can be performed in a distributed manner. That is, each mini-batch can be computed in parallel by “workers” across multiple servers, CPUs and GPUs to achieve significant improvements in training speeds. There are multiple algorithms and architectures to perform this parallel operation, but that is a topic for another day. In the mean-time, enjoy trying out mini-batch gradient descent in your neural networks.

Thanks a lot

Thanks a lot

A very clear explaination!, thanks!

A very clear explaination!, thanks!

Nice Tutorial, Clearly Explained..

Nice Tutorial, Clearly Explained..

Adventure In Machine Learning is one of the best tutorial websites I have ever found! Well done and many thanks for all these amazing tutorials!

Thanks Nils, glad it’s a help

Adventure In Machine Learning is one of the best tutorial websites I have ever found! Well done and many thanks for all these amazing tutorials!

Thanks Nils, glad it’s a help

İncele: Youtube Şifresiz Abone

Ꮋeⅼl᧐ mates, nice paragraph aand fastidious urging commened һere, I am truly enjoyng byy

tһese.

Tаke a looҝ att mʏ web site – camera

After I originally left a ⅽomment I appezr to һave clicked ߋn the -Notify me wһen new comments aгe added- checkbox аnd noѡ wһenever

а comment is аdded I receive 4 emails ԝith tһe ѕame comment.

Is there аn easy method ｙou ϲan remove

mｅ from thɑt service? Kudos!

Нere is my website … bitcoin (Carolincome23.tumblr.com)

I enjoy, cause I discoered јust ԝhat I ᥙsed to ƅe looking f᧐r.

You havе ended my f᧐ur ɗay long hunt! God Bless yⲟu man. Hаve

ɑ nice day. Bye

my site; kedi sahiplenme,

Do you have a spam problem on this blog; I also am a blogger, and I was wondering your situation; many of us have developed some nice methods and we are looking to trade methods with other folks, why not shoot me an e-mail if interested.

Is anyone here in a position to recommend Luxury Lingerie? Cheers xxx

Has anyone shopped at Haze Vapor Lounge Inc Ecig Shop in 4980 Beltline Rd Suite 130?

instagram mavi tikli hesaplardan yorum hizmeti satın alın.

maxwin138 MAXWIN138 Situs Slot Online Terbaik Indonesia

ankaradaki son dajkika haberlerine ulaşabilirsiniz.

roco coin geleceği hakkında bilgiler.

Does anyone have any idea Pyro Vapes ecigarette store located in 739 Scranton Carbondale Highway offers e-juice manufactured by BIGFinDEAL E-Liquid? I have emailed them at jksprouse@epbfi.com

şişli escort bayanlarla huzurlu ve sağlıklı anlar yaşayabilirsiniz

taksim escort bayanlar ile sizlerde her vakitte istediğiniz şeyleri yapabileceksiniz

taksim escort bayanlar istanbulun gözde isimlerinden seçilme amcığı geniş ve ferah kızlardır

MAXWIN 138 Website Game Slot Online Terlengkap

botox lip flip kansas city

Tehnik Menang Judi Bola

Bermain Judi Sportsbook

Cara Bermain Taruhan Bola

Trik Menang Judi Bola

marilea 6368f9b739 https://wakelet.com/wake/1P0Q4JTdbsr5hsOVTsAzD

deriferd 5d27fc3d97 https://wakelet.com/wake/vdG1K49Xg37Zhu5bSByUt

ngaileon c2bcea2ebb https://wakelet.com/wake/fjb4RDxHqg4G36XpCulWM

finlpalm c2bcea2ebb https://wakelet.com/wake/2DPlF2UXQ0wj6mkVKRUVX

pereval f91c64177c https://spinlilissiecetic.wixsite.com/hallcomlave/post/cell-physiology-membrane-transport-worksheet

evareni f91c64177c https://wakelet.com/wake/6I1dqbnPlhm2AdH_lU7ri

phyhav f91c64177c https://achulrlichiv.themedia.jp/posts/24389647

danimark f91c64177c https://seesaawiki.jp/tribringcolnea/d/Sanskrit Manual

landesb f91c64177c https://cdn.thingiverse.com/assets/8c/d0/7f/0d/0f/Heroes_Of_Might_And_Magic_3_Hd_Edition_Crack_89.html

tancale f91c64177c https://trello.com/c/4H9vg3IL/12-adobe-premiere-elements-11-serial-number-crack

pelors f91c64177c https://trello.com/c/d9EbKg3H/69-video-seks-anak-dengan-ibu-kandung

hindelly f91c64177c http://ummixtacksound.unblog.fr/2021/11/26/bajirao-mastani-2015-tamil-dubbed-tcrip-x264-800mb-47/

jeamrayb f91c64177c https://wakelet.com/wake/f9GGKF8CaeGz-B-L1bFGk

cedeifi f91c64177c https://kit.co/denvehinigg/dubbed-ae-dil-hai-mushkil-720-dts-bluray-watch-online-free-dual

jesstala f91c64177c https://berhostcodi.weebly.com/rival-kingdoms-age-of-ruin-hack-diamonds-gold.html

jesairl f91c64177c https://wakelet.com/wake/WvF5o1T2FDftqPL4FmJkY

walllel f91c64177c https://cyberszone.com/upload/files/2021/11/zaOEcZWyzV3dPwwnA8ke_26_ccb902c8af622cedb436a178ce45ea7c_file.pdf

vankah f91c64177c https://hotphycaso.sokuhou.wiki/d/Ansys 12.1 Pc Iso Final License Full Version X64 !!TOP!!

oswamar f91c64177c https://leigranadosenun.wixsite.com/detownrachy/post/robi-internet-mb-hacking

leamlovi f91c64177c https://wakelet.com/wake/eYAzxkIWbBfeqvEd_bKQt

talscip f91c64177c https://trello.com/c/VbPjvArD/24-dance-ejay-6-cd2-full-version

kaffphi f91c64177c https://wakelet.com/wake/ezllxF1wd1hWbMoDECkF-

gradjami f91c64177c https://klealumni.com/upload/files/2021/11/mapS8EAO6HffBCbsaiLz_26_f94848df8a2abb43c3cb03e943ae5813_file.pdf

gradjami f91c64177c https://klealumni.com/upload/files/2021/11/mapS8EAO6HffBCbsaiLz_26_f94848df8a2abb43c3cb03e943ae5813_file.pdf

gradjami f91c64177c https://klealumni.com/upload/files/2021/11/mapS8EAO6HffBCbsaiLz_26_f94848df8a2abb43c3cb03e943ae5813_file.pdf

bibivall f91c64177c https://tierramithell3631k.wixsite.com/adngeraric/post/shri-jyoti-star-6-free-pro-64bit-torrent-cracked

padiwhit f91c64177c https://www.xn--gber-0ra.com/post/1780_assassin-039-s-creed-unity-gold-edition-v-1-2-0-repack-maxagent-update-assassin.html

gradel f91c64177c https://piquarmosu.weebly.com/uploads/1/3/9/5/139523796/better-your-diary-english-patch-13.pdf

jaefav f91c64177c https://cdn.thingiverse.com/assets/ef/91/19/b2/cb/Sagar_Alias_Jacky_Torrent_Free_Download.html

jaysha f91c64177c https://cyberadvanced.network/upload/files/2021/11/PwBg4Cpz9lveUZW9MrI2_29_31df63c931a95dca87f0f3ce1317690c_file.pdf

vallgre f91c64177c https://wakelet.com/wake/BlQA_AIkx14UgBlqHbAk7

peppcha f91c64177c https://trello.com/c/NZW4fkY0/10-vcds-lite-10-crack-loader-rarl

forgra f91c64177c https://cdn.thingiverse.com/assets/20/89/42/af/83/Man_Luv_U_Alia_Full_Movie_Free_Download_indische_indien_ddow.html

hamnirv f91c64177c https://seesaawiki.jp/cailurwalo/d/Full Apexsql Log Zip Key Professional Windows Cracked ##TOP##

wethek f91c64177c https://wakelet.com/wake/rnMgPhCpd-KXI0U9cQD_q

wilfan f91c64177c https://coub.com/stories/1290001-wwe-wall-calendar-2017-mead

haukjewi f91c64177c https://wakelet.com/wake/gVKnJyVKYvdxLqTO5VqIy

terrnile f91c64177c https://dokumen.tips/technology/ren-tv-night-movies-download.html

wyanens f91c64177c https://likekaba.com/read-blog/62351

brabra f91c64177c https://uploads.strikinglycdn.com/files/bfd1d658-3f94-4ca9-a7f3-139393133792/Fundamentals-of-Complex-Analysis-with-Applications-to-Engineering-Science-and-Mathematicspdf.pdf

haslwad f91c64177c https://docs.google.com/viewerng/viewer?url=www.mycutiepies.com/upload/files/2021/11/cjmNO648TzJwzUNaouSP_29_ba43e11b3713f983fccbb8a481f34357_file.pdf

shawbirk f91c64177c https://vibrant-goldberg-a4af5a.netlify.app/MAXD-04–Sakura-Sakurada–The-Dog-Game-1l

cheeire f91c64177c https://rotrelactaibuyprem.wixsite.com/daficourhins/post/download-sybase-powerbuilder-9-enterprise

allisar f91c64177c https://wakelet.com/wake/MRKXal4MKqSsvHK-9FZBs

vermanr f91c64177c https://wakelet.com/wake/cfxCA5pDyToaA8Vjp4oYL

ingeyas f91c64177c https://cdn.thingiverse.com/assets/18/62/8e/e4/99/Randy-Newman–Little-Criminals.pdf

antjraq f91c64177c https://woodsdavid9.wixsite.com/abporordert/post/agi32-lighting-software-crack-keygen

amicrea f91c64177c https://uploads.strikinglycdn.com/files/edb8657b-3037-416b-b068-2151aad56975/Marvelous-Designer-65-Crack.pdf

deagoef f91c64177c https://seesaawiki.jp/bugireppost/d/Agnisakshi Video Watch Online Dts Utorrent Torrent 1080 rosebperky

jedaneil f91c64177c http://letbdetcu.yolasite.com/resources/Full-Tilt-Remix-Vol-40rar.pdf

fllifil dd23f8915e https://www.cloudschool.org/activities/ahFzfmNsb3Vkc2Nob29sLWFwcHI5CxIEVXNlchiAgICfm4aRCgwLEgZDb3Vyc2UYgIDAwKzogAkMCxIIQWN0aXZpdHkYgIDAkNOt4gsMogEQNTcyODg4NTg4Mjc0ODkyOA

tamdel dd23f8915e https://coub.com/stories/2780088-airprintactivator-v2-0b16

jilhal b9c45beda1 https://coub.com/stories/2633500-better-what-does-rowley-say-in-diary-of-a-wimpy-kid

engeham b9c45beda1 https://coub.com/stories/2820652-the-walking-dead-no-mans-land-android

yamakahl 79a0ff67a5 https://coub.com/stories/2783461-the-perfect-house-film-indonesia-download-hot

buy instagram mentions

meanhart 79a0ff67a5 https://coub.com/stories/2821318-lion-king-1080p-mega-261-hot

orsiban 79a0ff67a5 https://coub.com/stories/2625191-photodex-proshow-producer-4-52-3053-new-crack