Как обучить ИИ генерировать соответствующие HTML и CSS из дизайнерских рисунков?

Currently, the largest barrier to automating front-end development is computing power. However, we can use current deep learning algorithms, along with synthesized training data, to start exploring artificial front-end automation right now.

В этом посте мы научим нейронную сеть кодировать базовый веб-сайт HTML и CSS на основе изображения макета дизайна.Вот краткий обзор процесса:

1) Give a design image to the trained neural network

2) The neural network converts the image into HTML markup

3) Rendered output

Мы построим нейронную сеть за три итерации.

Во-первых, мы сделаем минимальную версию, чтобы получить представление о движущихся частях.Вторая версия, HTML, будет сосредоточена на автоматизации всех шагов и объяснении слоев нейронной сети.В окончательной версии, Bootstrap, мы создадим модель, которая может обобщать и изучите слой LSTM.

All the code is prepared on Github and FloydHub in Jupyter notebooks. All the FloydHub notebooks are inside the floydhub directory and the local equivalents are under local.

Модели основаны на модели Бельтрамелли.pix2code paperи Джейсона Браунлиimage caption tutorials. The code is written in Python and Keras, a framework on top of TensorFlow.

Если вы новичок в глубоком обучении, я бы порекомендовал получить представление о Python, обратном распространении и сверточных нейронных сетях.Мои три предыдущих сообщения в блоге FloydHub помогут вам начать работу:

My First Weekend Of Deep Learning
Coding The History Of Deep Learning
Colorizing B&W Photos with Neural Networks

Core Logic

Напомним нашу цель: мы хотим построить нейронную сеть, которая будет генерировать HTML/CSS-разметку, соответствующую скриншоту.

When you train the neural network, you give it several screenshots with matching HTML.

It learns by predicting all the matching HTML markup tags one by one. When it predicts the next markup tag, it receives the screenshot as well as all the correct markup tags until that point.

Here is a simple training data example in a Google Sheet.

Creating a model that predicts word by word is the most common approach today. There are other approaches, но это метод, который мы будем использовать в этом уроке.

Обратите внимание, что для каждого прогноза он получает один и тот же снимок экрана. Поэтому, если ему нужно предсказать 20 слов, он получит один и тот же макет дизайна двадцать раз. Пока не беспокойтесь о том, как работает нейронная сеть. Сосредоточьтесь на понимании входных данных и вывод нейронного сеть.

Давайте сосредоточимся на предыдущей разметке. Допустим, мы обучаем сеть предсказывать предложение «Я умею кодировать». Когда она получает «Я», тогда она предсказывает «могу». » Он получает все предыдущие слова и должен только предсказать следующее слово.

The neural network creates features from the data. The network builds features to link the input data with the output data. It has to create representations to understand what is in each screenshot, the HTML syntax, that it has predicted. This builds the knowledge to predict the next tag.

Когда вы хотите использовать обученную модель для реального использования, это похоже на то, когда вы обучаете модель.Текст генерируется один за другим с одним и тем же скриншотом каждый раз.Вместо того, чтобы наполнять его правильными HTML-тегами, он получает разметка у него есть генерируется до сих пор. Затем он прогнозирует следующий тег разметки. Прогноз начинается с «начального тега» и останавливается, когда он прогнозирует «конечный тег» или достигает максимального предела. Вот еще один пример вa Google Sheet.

Версия «Здравствуй, мир»

Давайте создадим версию «Hello World!» Мы скормим нейросети скриншот с веб-сайтом, отображающим «Hello World!», и научим ее генерировать разметку.

Во-первых, нейронная сеть отображает макет дизайна в список значений пикселей: от 0 до 255 в трех каналах — красном, синем и зеленом.

To represent the markup in a way that the neural network understands, I use one hot encoding. Таким образом, предложение «Я умею программировать» может быть отображено, как показано ниже.

In the above graphic, we include the start and end tag. These tags are cues for when the network starts its predictions and when to stop.

For the input data, we will use sentences, starting with the first word and then adding each word one by one. The output data is always one word.

Предложения следуют той же логике, что и слова. Им также нужна та же входная длина. Вместо того, чтобы ограничиваться словарным запасом, они ограничены максимальной длиной предложения. Если оно короче максимальной длины, вы заполняете его пустыми словами, словом с участием просто нули.

As you see, words are printed from right to left. This forces each word to change position for each training round. This allows the model to learn the sequence instead of memorizing the position of each word.

In the below graphic there are four predictions. Each row is one prediction. To the left are the images represented in their three color channels: red, green and blue and the previous words. Outside of the brackets are the predictions one by one, ending with a red square to mark the end.

green blocks = start tokens | red block = end token

#Length of longest sentence
max_caption_len = 3
#Size of vocabulary 
vocab_size = 3

# Load one screenshot for each word and turn them into digits 
images = []
for i in range(2):
    images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))
images = np.array(images, dtype=float)
# Preprocess input for the VGG16 model
images = preprocess_input(images)

#Turn start tokens into one-hot encoding
html_input = np.array(
            [[[0., 0., 0.], #start
             [0., 0., 0.],
             [1., 0., 0.]],
             [[0., 0., 0.], #start <HTML>Hello World!</HTML>
             [1., 0., 0.],
             [0., 1., 0.]]])

#Turn next word into one-hot encoding
next_words = np.array(
            [[0., 1., 0.], # <HTML>Hello World!</HTML>
             [0., 0., 1.]]) # end

# Load the VGG16 model trained on imagenet and output the classification feature
VGG = VGG16(weights='imagenet', include_top=True)
# Extract the features from the image
features = VGG.predict(images)

#Load the feature to the network, apply a dense layer, and repeat the vector
vgg_feature = Input(shape=(1000,))
vgg_feature_dense = Dense(5)(vgg_feature)
vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)
# Extract information from the input seqence 
language_input = Input(shape=(vocab_size, vocab_size))
language_model = LSTM(5, return_sequences=True)(language_input)

# Concatenate the information from the image and the input
decoder = concatenate([vgg_feature_repeat, language_model])
# Extract information from the concatenated output
decoder = LSTM(5, return_sequences=False)(decoder)
# Predict which word comes next
decoder_output = Dense(vocab_size, activation='softmax')(decoder)
# Compile and run the neural network
model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# Train the neural network
model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)

In the hello world version, we use three tokens: start, <HTML><center><H1>Hello World!</H1></center></HTML> and end. A token can be anything. It can be a character, word, or sentence. Character versions require a smaller vocabulary but constrain the neural network. Word level tokens tend to perform best.

Here we make the prediction:

# Create an empty sentence and insert the start token
sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]]
start_token = [1., 0., 0.] # start
sentence[0][2] = start_token # place start in empty sentence
    
# Making the first prediction with the start token
second_word = model.predict([np.array([features[1]]), sentence])
    
# Put the second word in the sentence and make the final prediction
sentence[0][1] = start_token
sentence[0][2] = np.round(second_word)
third_word = model.predict([np.array([features[1]]), sentence])
    
# Place the start token and our two predictions in the sentence 
sentence[0][0] = start_token
sentence[0][1] = np.round(second_word)
sentence[0][2] = np.round(third_word)
    
# Transform our one-hot predictions into the final tokens
vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]
for i in sentence[0]:
print(vocabulary[np.argmax(i)], end=' ')

Output

10 epochs: start start start
100 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
300 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> end

Mistakes I made:

Build the first working version before gathering the data. Early on in this project, I managed to get a copy of an old archive of the Geocities hosting website. It had 38 million websites. Blinded by the potential, I ignored the huge workload that would be required to reduce the 100K-sized vocabulary.
Dealing with a terabyte worth of data requires good hardware or a lot of patience. After having my mac run into several problems I ended up using a powerful remote server. Expect to rent a rig with 8 modern CPU cores and a 1GPS internet connection to have a decent workflow.
Nothing made sense until I understood the input and output data. The input, X, is one screenshot and the previous markup tags. The output, Y, is the next markup tag. When I got this, it became easier to understand everything between them. It also became easier to experiment with different architectures.
Be aware of the rabbit holes. Because this project intersects with a lot of fields in deep learning, I got stuck in plenty of rabbit holes along the way. I spent a week programming RNNs from scratch, got too fascinated by embedding vector spaces, and was seduced by exotic implementations.
Picture-to-code networks are image caption models in disguise. Even when I learned this, I still ignored many of the image caption papers, simply because they were less cool. Once I got some perspective, I accelerated my learning of the problem space.

Running the code on FloydHub

FloydHub — это учебная платформа для глубокого обучения. Я столкнулся с ними, когда впервые начал изучать глубокое обучение, и с тех пор я использую их для обучения и управления своими экспериментами по глубокому обучению. Вы можете установить его и запустить свою первую модель в течение 10 минут. Это лучший вариант для запуска моделей на облачных графических процессорах.

If you are new to FloydHub, do their 2-min installation or my 5-minute walkthrough.

Clone the repository

git clone https://github.com/emilwallner/Screenshot-to-code-in-Keras.git

Login and initiate FloydHub command-line-tool

cd Screenshot-to-code-in-Keras
floyd login
floyd init s2c

Run a Jupyter notebook on a FloydHub cloud GPU machine:

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter

Все блокноты подготовлены в каталоге FloydHub. Локальные эквиваленты находятся в локальном каталоге. После запуска вы можете найти первый блокнот здесь: floydhub/Helloworld/helloworld.ipynb .

If you want more detailed instructions and an explanation for the flags, check my earlier post.

HTML Version

В этой версии мы автоматизируем многие шаги из модели Hello World.В этом разделе основное внимание будет уделено созданию масштабируемой реализации и движущихся частей в нейронной сети.

Эта версия не сможет предсказать HTML со случайных веб-сайтов, но все же это отличная настройка для изучения динамики проблемы.

Overview

If we expand the components of the previous graphic it looks like this.

There are two major sections. First, the encoder. This is where we create image features and previous markup features. Features are the building blocks that the network creates to connect the design mockups with the markup. At the end of the encoder, we glue the image features to each word in the previous markup.

The decoder then takes the combined design and markup feature and creates a next tag feature. This feature is run through a fully connected neural network to predict the next tag.

Design mockup features

Since we need to insert one screenshot for each word, this becomes a bottleneck when training the network (example). Instead of using the images, we extract the information we need to generate the markup.

The information is encoded into image features. This is done by using an already pre-trained convolutional neural network (CNN). The model is pre-trained on Imagenet.

We extract the features from the layer before the final classification.

We end up with 1536 eight by eight pixel images known as features. Although they are hard to understand for us, a neural network can extract the objects and position of the elements from these features.

Markup features

В версии hello world для представления разметки использовалась однократная кодировка, в этой версии мы будем использовать встраивание слова для ввода и сохраним однократную кодировку для вывода.

The way we structure each sentence stays the same, but how we map each token is changed. One-hot encoding treats each word as an isolated unit. Instead, we convert each word in the input data to lists of digits. These represent the relationship between the markup tags.

Размерность этого встраивания слов равна восьми, но часто варьируется от 50 до 500 в зависимости от размера словаря.

The eight digits for each word are weights similar to a vanilla neural network. They are tuned to map how the words relate to each other (Mikolov et al., 2013).

Вот как мы начинаем разрабатывать функции разметки. Функции — это то, что разрабатывает нейронная сеть, чтобы связать входные данные с выходными данными. Пока не беспокойтесь о том, что это такое, мы углубимся в это в следующем разделе.

The Encoder

Мы возьмем вложения слов и пропустим их через LSTM и вернем последовательность функций разметки, которые проходят через распределенный по времени плотный слой — думайте об этом как о плотном слое с несколькими входами и выходами.

In parallel, the image features are first flattened. Regardless of how the digits were structured, they are transformed into one large list of numbers. Then we apply a dense layer on this layer to form a high-level feature. These image features are then concatenated to the markup features.

Это может быть трудно осознать — , поэтому давайте разберемся.

Markup features

Here we run the word embeddings through the LSTM layer. In this graphic, all the sentences are padded to reach the maximum size of three tokens.

To mix signals and find higher-level patterns, we apply a TimeDistributed dense layer to the markup features. TimeDistributed dense is the same as a dense layer, but with multiple inputs and outputs.

Image features

In parallel, we prepare the images. We take all the mini image features and transform them into one long list. The information is not changed, just reorganized.

Again, to mix signals and extract higher level notions, we apply a dense layer. Since we are only dealing with one input value, we can use a normal dense layer. To connect the image features to the markup features, we copy the image features.

In this case, we have three markup features. Thus, we end up with an equal amount of image features and markup features.

Concatenating the image and markup features

All the sentences are padded to create three markup features. Since we have prepared the image features, we can now add one image feature for each markup feature.

After sticking one image feature to each markup feature, we end up with three image-markup features. This is the input we feed into the decoder.

The Decoder

Here we use the combined image-markup features to predict the next tag.

In the below example, we use three image-markup feature pairs and output one next tag feature.

Обратите внимание, что слой LSTM имеет последовательность, установленную на false.Вместо того, чтобы возвращать длину входной последовательности, он предсказывает только одну функцию.В нашем случае это функция для следующего тега.Он содержит информацию для окончательного прогноза.

The final prediction

The dense layer works like a traditional feedforward neural network. It connects the 512 digits in the next tag feature with the 4 final predictions. Say we have 4 words in our vocabulary: start, hello, world, and end.

Прогноз словаря может быть [0,1, 0,1, 0,1, 0,7]. Активация softmax в плотном слое распределяет вероятность от 0 до 1, при этом сумма всех прогнозов равна 1. В этом случае он предсказывает, что 4-е слово следующий тег. Затем вы перевести однократное кодирование [0, 0, 0, 1] в отображаемое значение, скажем, «конец».

# Load the images and preprocess them for inception-resnet
images = []
all_filenames = listdir('images/')
all_filenames.sort()
for filename in all_filenames:
    images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))
images = np.array(images, dtype=float)
images = preprocess_input(images)

# Run the images through inception-resnet and extract the features without the classification layer
IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
features = IR2.predict(images)

# We will cap each input sequence to 100 tokens
max_caption_len = 100
# Initialize the function that will create our vocabulary 
tokenizer = Tokenizer(filters='', split=" ", lower=False)

# Read a document and return a string
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# Load all the HTML files
X = []
all_filenames = listdir('html/')
all_filenames.sort()
for filename in all_filenames:
    X.append(load_doc('html/'+filename))

# Create the vocabulary from the html files
tokenizer.fit_on_texts(X)

# Add +1 to leave space for empty words
vocab_size = len(tokenizer.word_index) + 1
# Translate each word in text file to the matching vocabulary index
sequences = tokenizer.texts_to_sequences(X)
# The longest HTML file
max_length = max(len(s) for s in sequences)

# Intialize our final input to the model
X, y, image_data = list(), list(), list()
for img_no, seq in enumerate(sequences):
    for i in range(1, len(seq)):
        # Add the entire sequence to the input and only keep the next word for the output
        in_seq, out_seq = seq[:i], seq[i]
        # If the sentence is shorter than max_length, fill it up with empty words
        in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
        # Map the output to one-hot encoding
        out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
        # Add and image corresponding to the HTML file
        image_data.append(features[img_no])
        # Cut the input sentence to 100 tokens, and add it to the input data
        X.append(in_seq[-100:])
        y.append(out_seq)

X, y, image_data = np.array(X), np.array(y), np.array(image_data)

# Create the encoder
image_features = Input(shape=(8, 8, 1536,))
image_flat = Flatten()(image_features)
image_flat = Dense(128, activation='relu')(image_flat)
ir2_out = RepeatVector(max_caption_len)(image_flat)

language_input = Input(shape=(max_caption_len,))
language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)
language_model = LSTM(256, return_sequences=True)(language_model)
language_model = LSTM(256, return_sequences=True)(language_model)
language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)

# Create the decoder
decoder = concatenate([ir2_out, language_model])
decoder = LSTM(512, return_sequences=False)(decoder)
decoder_output = Dense(vocab_size, activation='softmax')(decoder)

# Compile the model
model = Model(inputs=[image_features, language_input], outputs=decoder_output)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# Train the neural network
model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'START'
    # iterate over the whole length of the sequence
    for i in range(900):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = np.argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # Print the prediction
        print(' ' + word, end='')
        # stop if we predict the end of the sequence
        if word == 'END':
            break
    return

# Load and image, preprocess it for IR2, extract features and generate the HTML
test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))
test_image = np.array(test_image, dtype=float)
test_image = preprocess_input(test_image)
test_features = IR2.predict(np.array([test_image]))
generate_desc(model, tokenizer, np.array(test_features), 100)