How to get same word2vec/doc2vec/paragraph vectors in every time of training

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the packages implements it in the following weeks.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials :) ). I want you to understand word vectors from the coding level together with me.

Если у вас есть возможность обучать векторы слов самостоятельно, вы можете обнаружить, что модель и векторное представление различаются для каждого обучения, даже если вы вводите одни и те же данные обучения.Это из-за случайности, введенной в обучение Код может говорить сам, давайте посмотрим, откуда берется случайность и как ее полностью устранить. Я буду использоватьDL4j'Даimplementation of paragraph vectors to show the code. If you want to take look on the other package, go to doc2vec от Gensim, which has the same method of implementation.

Where the randomness comes

The initialization of weights and matrix

We know that before training, the weights of model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set seed as 0, we will get exact same initialization every time. Here is the place where the seed takes effect.

syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);

PV-DBOW algorithm

If we use PV-DBOWалгоритм (подробности я объясню в следующих постах) для обучения векторов абзацев, во время итераций он случайным образом подвыбирает слова из текстового окна для вычисления и обновления весов.Но этот случай не является настоящим случайным.Давайте посмотрим наcode.

// next random is an AtomicLong initialized by thread id
this.nextRandom = new AtomicLong(this.threadId);

And nextRandom is used in

trainSequence(sequence, nextRandom, alpha);

Where inside trainSequence, it will do

nextRandom.set(nextRandom.get() * 25214903917L + 11);

If we go deeper on this, we will find it generates nextRandomточно так же, поэтому число зависит только от идентификатора потока, где идентификатор потока равен 0, 1, 2, 3, ... Следовательно, он больше не является случайным.

Parallel tokenization

Он используется для параллельной токенизации, поскольку процесс сложного текста может потребовать много времени, параллельная токенизация может повысить производительность, в то время как согласованность между обучением не гарантируется.Последовательности, обработанные токенизатором, могут иметь в произвольном порядке для подачи в потоки для обучения.code, the runnable which is doing the tokenization, will be await until it finishes if we set allowParallelBuilder to false, where the order can maintain.

if (!allowParallelBuilder) {
    try {
        runnable.awaitDone();
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException(e);
    }
}

Queue that provides sequences to every thread to train

This LinkedBlockingQueueполучает последовательности из итератор обучающего текста и предоставляет эти последовательности каждому потоку. Поскольку все потоки могут появляться случайным образом, в каждый момент обучения каждый поток может получать разные последовательности для обучения. Давайте посмотрим наimplementation.

// initialize a sequencer to provide data to threads
val sequencer = new AsyncSequencer(this.iterator, this.stopWords);

// each threads are pointing to the same sequencer
for (int x = 0; x < workers; x++) {
    threads.add(x, new VectorCalculationsThread(x, ..., sequencer);                
    threads.get(x).start();            
}

// sequencer will initialize a LinkedBlockingQueue buffer
// and maintain the size between
private final LinkedBlockingQueue<Sequence<T>> buffer;
limitLower = workers * batchSize;
limitUpper = workers * batchSize * 2;

// threads get data from the queue through
buffer.poll(3L, TimeUnit.SECONDS);

Hence, if we set the number of worker as 1, it will run in single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set number of workers (threads) as 1.
Then we will have the exactly same results of word vector and paragraph vector if we feed into the same data.

If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

Reference

[1] Deeplearning4j, ND4J, DataVec и другие — глубокое обучение и линейная алгебра для Java/Scala с графическими процессорами + Spark — от Skyminddeeplearning4j.org GitHub.com/deep учиться в…

[2] Платформа Java™, спецификация API Standard Edition 8docs.Oracle.com/java-color/8/do…

[3] giphy.com/

[4] images.google.com/