Всем привет, я снова как положено, обновляю рекуррентную нейросеть.
Кажется, что в последнее время становится все меньше и меньше, но состояние по-прежнему очень плохое. Класс первокурсников собирается показать, я надеюсь, что приятели из 51 класса смогут получить хороший рейтинг. Этот блог использует LSTM/RNN для анализа обзоров фильмов.Эта сеть довольно сложная.После длительного обучения я чувствую роль GPU.
Кроме того, я также планирую открыть рубрику машинного обучения, не знаю, что все думают. Есть голосование позже, я надеюсь, что все могут голосовать! ! Спасибо! ! ! !
Следующее обновление трансферного обучения, это уже готовится, скоро! !

import tensorflow as tf
tf.__version__

'2.6.0'

tf.test.is_gpu_available()

WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.





True

Введение в рекуррентные нейронные сети (RNN)

Многие проблемы имеют своевременность, обработка естественного языка, обработка видеоизображения, информация о торговле акциями и т. Д.

Например:

Jupyter успешно присоединился к партии в 2021 году, успешно получил национальную награду в 2021 году и успешно гарантировал исследования в 2022 году. Ха-ха-ха, сначала помечтайте.
Каждый обнаружит, что на самом деле только успешные члены партии имеют предмет Юпитера, но все привычки чтения людей создаются Юпитером. Это время.

Многослойные полностью связанные нейронные сети или сверточные нейронные сети могут обрабатываться только в соответствии с текущим состоянием и не могут хорошо справляться с проблемами синхронизации.
(Кроме того, мы уже коснулись полносвязных и сверточных нейронных сетей)

Структура рекуррентной нейронной сети (RNN) совершенно особенная.Вход сети в последнем слое связан с выходом сети в предыдущем слое, так что информация предыдущего слоя может быть передана на следующий уровень. слой.
Но в обычном RNN произойдет исчезновение градиента и взрыв градиента (потому что его функция активации — это тангенциальная функция).

Когда последовательность слишком длинная из-за проблем с исчезновением градиента и взрывом градиента, для времени t генерируемый ею градиент исчезает после распространения нескольких слоев в историю по оси времени и вообще не может повлиять на слишком далекое прошлое.
RNN забудет информацию, полученную давно, и сможет запомнить только информацию, появившуюся недавно, поэтому RNN сложно эффективно обрабатывать длинные тексты.

Введение в сети с долговременной кратковременной памятью (LSTM)

在这里插入图片描述

Проблемы с RNN:

градиентный взрыв
Градиент исчезает

Решение:
Для градиентного взрыва это обычно можно решить с помощью алгоритма оптимизации с усечением, такого как отсечение градиента (если норма градиента больше заданного значения, градиент будет уменьшаться из года в год).
Улучшите структуру RNN с помощью LSTM, чтобы исключить исчезновение градиента.

Скрытый блок каждого шага традиционной RNN просто выполняет простую операцию tanh или RELU.
Базовая структура LSTM аналогична структуре RNN. Основное отличие состоит в том, что в LSTM улучшен скрытый слой. Каждый нейрон в LSTM эквивалентен ячейке памяти.

Преимущества LSTM перед RNN:

Облегчить проблему исчезающего градиента
С помощью вентильной структуры решается проблема дальнодействия

1. Самодельный набор данных

Такой подход более реалистичен
Основная мысль:

Получить данные, определить спецификацию формата данных
Сегментация слов, сегментация английских слов может быть разделена по пробелам, сегментация китайских слов может относиться к jieba
Создайте таблицу индексов слов и присвойте каждому слову числовой индекс.
Текст абзаца в индексный вектор слова
Текст абзаца в матрицу встраивания слов

import os
import tarfile
import urllib.request
import numpy as np
import re
from random import randint

# 数据地址
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
# 数据存放路径
file_path = 'data/aclImdb_v1.tar.gz'

if not os.path.exists('data'):
  os.mkdir('data')
if not os.path.isfile(file_path):
  print('downloading')
  result = urllib.request.urlretrieve(url,filename=file_path)
  print('ok',result)
else:
  print(file_path,'is existed!')

downloading
ok ('data/aclImdb_v1.tar.gz', <http.client.HTTPMessage object at 0x7f1599fc17d0>)

# 解压数据
if not os.path.exists('data/aclImdb'):
  tfile = tarfile.open(file_path,'r:gz')
  print('extracting…')
  result = tfile.extractall('data/') # tfile.extractall('data/')将文件解压到data目录下
  print('ok',result)
else:
  print('data/aclImdb is existed')

extracting…
ok None

# 读取数据集,题外话，对re不熟，需要补
# 将文本中不需要的字符清除，如html标签<br/>
def remove_tags(text):
  re_tag = re.compile(r'<[>]+>') # compile 函数用于编译正则表达式，生成一个 Pattern 对象
  return re_tag.sub('',text) # re_tag.sub('',text)将匹配到的字符换成空

# 读取数据集封装成函数
def read_files(file_type):

  # 1）将所有的文件的路径存入file_list，并统计正样本和负样本的个数
  path = 'data/aclImdb/'
  file_list = []
  positive_file_path = path+file_type+'/pos/'
  for f in os.listdir(positive_file_path):
    file_list.append(positive_file_path+f)
  positive_num = len(file_list)
  
  negitave_file_path = path+file_type+'/neg/'
  for f in os.listdir(negitave_file_path):
    file_list.append(negitave_file_path+f)
  negitave_num = len(file_list) - positive_num

  print('read',file_type,':',len(file_list))
  print('positive_num',positive_num)
  print('negitave_num',negitave_num)

  # 2）自己制作标签，因为这个数据集的文件夹名就是特征的标签
  labels = [[1,0]]*positive_num + [[0,1]]*negitave_num # 列表相加会拼接列表，列表×一个数字会重复里面的内容
  # 3）得到所有文本
  features = []
  for fi in file_list:
    with open(fi,'rt',encoding='utf8') as f:
      features+=[remove_tags(''.join(f.readlines()))]

  return features,labels

train_x,train_y = read_files('train')
test_x,test_y = read_files('test')
test_y = np.array(test_y)
train_y = np.array(train_y)

read train : 25000
positive_num 12500
negitave_num 12500
read test : 25000
positive_num 12500
negitave_num 12500

train_x[0] # 特征

'It started out slow after an excellent animated intro, as the director had a bunch of characters and school setting to develop. Once the bet is on, though, the movie picks up the pace as it\'s a race against time to see if a certain number of worms can be eaten by 7 pm. We had a good opportunity on the way home to discuss some things with our son: bullies, helping others, mind over matter when you don\'t want to do something.<br /><br />Of special note is the girl who played Erica (Erk): Hallie Kate Eisenberg. The director kinda sneaks her in unexpectedly, and when she is on-screen she is captivating. She\'s one of those "Hey, she looks familiar" faces, and then I remembered that she was the little girl that Pepsi featured about 8 years ago. She was also in "Paulie", that movie about the parrot who tries to find his way home.<br /><br />Ms. Eisenberg made many TV and movie appearances in \'99-00, but then was not seen much for the next few years. She\'s now 14 and is growing up to be a beautiful woman. Her smile really warms up the screen. If she can get some more good roles she could have as good a career (or better?) than Haley Joel Osment, another three named kid actor, but hopefully without some of the problems that Osment has been in lately.<br /><br />Anywhozitz, according to my 8 y.o. son, who just finished reading the story, the film did not seem to follow the book all that well, but was entertaining none the less. The ending of the film seemed like a big setup for some sequels (How to Eat Boiled Slugs? Escargot Kid\'s Style?), which might not be such a bad thing. It was nice to take the family to a movie and not have to worry about language, violence or sex scenes.<br /><br />One other good aspect of the movie was the respect/fear engendered by the principal Mr. Burdock (Boilerplate). Movies nowadays tend to show adult authority figures as buffoons. While he has one particular goofy scene, he ruled the school with a firm hand. It was also nice to see Andrea Martin getting some work.'

train_y[0] # 正评论

array([1, 0])

2. Обработка данных

1. Создайте словарь

token = tf.keras.preprocessing.text.Tokenizer(num_words=4000) # 4000是只统计4000个词汇

token.fit_on_texts(train_x) # 从train_x中建立字典

2. Список преобразования текста в число (вектор слов)

train_sequences = token.texts_to_sequences(train_x) # 将文本映射成词向量中的数字，也就是词出现的排名
test_sequences = token.texts_to_sequences(test_x)

train_x[0]

'It started out slow after an excellent animated intro, as the director had a bunch of characters and school setting to develop. Once the bet is on, though, the movie picks up the pace as it\'s a race against time to see if a certain number of worms can be eaten by 7 pm. We had a good opportunity on the way home to discuss some things with our son: bullies, helping others, mind over matter when you don\'t want to do something.<br /><br />Of special note is the girl who played Erica (Erk): Hallie Kate Eisenberg. The director kinda sneaks her in unexpectedly, and when she is on-screen she is captivating. She\'s one of those "Hey, she looks familiar" faces, and then I remembered that she was the little girl that Pepsi featured about 8 years ago. She was also in "Paulie", that movie about the parrot who tries to find his way home.<br /><br />Ms. Eisenberg made many TV and movie appearances in \'99-00, but then was not seen much for the next few years. She\'s now 14 and is growing up to be a beautiful woman. Her smile really warms up the screen. If she can get some more good roles she could have as good a career (or better?) than Haley Joel Osment, another three named kid actor, but hopefully without some of the problems that Osment has been in lately.<br /><br />Anywhozitz, according to my 8 y.o. son, who just finished reading the story, the film did not seem to follow the book all that well, but was entertaining none the less. The ending of the film seemed like a big setup for some sequels (How to Eat Boiled Slugs? Escargot Kid\'s Style?), which might not be such a bad thing. It was nice to take the family to a movie and not have to worry about language, violence or sex scenes.<br /><br />One other good aspect of the movie was the respect/fear engendered by the principal Mr. Burdock (Boilerplate). Movies nowadays tend to show adult authority figures as buffoons. While he has one particular goofy scene, he ruled the school with a firm hand. It was also nice to see Andrea Martin getting some work.'

type(train_sequences[0])

list

3. Сделайте преобразованный список чисел одинаковой длины

'''
tf.keras.preprocessing.sequence.pad_sequences(train_sequences, 浮点数或整数构成的两层嵌套列表
                                            padding='post',‘pre’或‘post’,确定当需要补0时，在序列的起始还是结尾补0
                                            truncating='post',‘pre’或‘post’,确定当截断序列时，从起始还是结尾截断
                                            maxlen=400)，’None或整数，为序列的最大长度。大于此长度的序列将会被截断，小于此长度’会填0
'''
train_x = tf.keras.preprocessing.sequence.pad_sequences(train_sequences,
                                                       padding='post',
                                                       truncating='post',
                                                       maxlen=400)
test_x = tf.keras.preprocessing.sequence.pad_sequences(test_sequences,
                                                       padding='post',
                                                       truncating='post',
                                                       maxlen=400)

train_x[0]

array([   9,  642,   43,  547,  100,   32,  318, 1121,   14,    1,  164,
         66,    3,  758,    4,  102,    2,  392,  953,    5, 2058,  277,
          1, 2130,    6,   20,  148,    1,   17, 2847,   53,    1, 1059,
         14,   42,    3, 1519,  426,   55,    5,   64,   44,    3,  810,
        608,    4,   67,   27,   31,  690,   72,   66,    3,   49, 1429,
         20,    1,   93,  341,    5,   46,  180,   16,  260,  489, 2753,
        405,  327,  117,  548,   51,   22,   89,  178,    5,   78,  139,
          7,    7,    4,  315,  851,    6,    1,  247,   34,  253, 1861,
          1,  164, 1927,   38,    8,    2,   51,   56,    6,   20,  265,
         56,    6, 3712,  438,   28,    4,  145, 1395,   56,  269, 1076,
       1586,    2,   92,   10, 2024,   12,   56,   13,    1,  114,  247,
         12, 2553,   41,  705,  150,  593,   56,   13,   79,    8,   12,
         17,   41,    1,   34,  494,    5,  166,   24,   93,  341,    7,
          7, 1559,   90,  108,  245,    2,   17, 3309,    8,   18,   92,
         13,   21,  107,   73,   15,    1,  372,  168,  150,  438,  147,
       2425,    2,    6, 1784,   53,    5,   27,    3,  304,  252,   38,
       1822,   63,   53,    1,  265,   44,   56,   67,   76,   46,   50,
         49,  552,   56,   97,   25,   14,   49,    3,  609,   39,  125,
         71,  157,  286,  769,  550,  281,   18, 2353,  206,   46,    4,
          1,  709,   12,   45,   74,    8,    7,    7, 1789,    5,   58,
        705, 1600,  489,   34,   40, 1763,  883,    1,   62,    1,   19,
        119,   21,  303,    5,  790,    1,  271,   29,   12,   70,   18,
         13,  439,  597,    1,  326,    1,  274,    4,    1,   19,  465,
         37,    3,  191,   15,   46, 2278,   86,    5, 1893,  402,   60,
        235,   21,   27,  138,    3,   75,  151,    9,   13,  324,    5,
        190,    1,  220,    5,    3,   17,    2,   21,   25,    5, 3230,
         41, 1098,  564,   39,  380,  136,    7,    7,   28,   82,   49,
       1247,    4,    1,   17,   13,    1, 1158, 1088,   31,    1,  440,
         99, 2876, 2345,    5,  120, 1155, 2576,   14,  134,   26,   45,
         28,  840, 2962,  133,   26,    1,  392,   16,    3,  505,    9,
         13,   79,  324,    5,   64, 1588,  394,   46,  154,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0], dtype=int32)

3. Строительная модель на основе структуры LSTM

model = tf.keras.models.Sequential()

# 词嵌入层，这里充当输入层
'''
model.add(tf.keras.layers.Embedding(output_dim=32,输出词向量的维度
                                   input_dim=4000,#输入词汇表的长度，最大词汇数+1
                                   input_length=400)) # 输入Tensor的长度
'''
model.add(tf.keras.layers.Embedding(output_dim=32,
                                   input_dim=4000,
                                   input_length=400))

# 平坦层
# model.add(tf.keras.layers.SimpleRNN(units=16)) # RNN
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=32))) # LSTM
# model.add(tf.keras.layers.GlobalAveragePooling1D())
# model.add(tf.keras.layers.Flatten())

# 全连接层
model.add(tf.keras.layers.Dense(units=256,activation='relu'))
# 丢弃层，防止过拟合
model.add(tf.keras.layers.Dropout(0.3))
# 输出层
model.add(tf.keras.layers.Dense(units=2,activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 400, 32)           128000    
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                16640     
_________________________________________________________________
dense (Dense)                (None, 256)               16640     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 514       
=================================================================
Total params: 161,794
Trainable params: 161,794
Non-trainable params: 0
_________________________________________________________________

4. Обучение

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

history = model.fit(train_x,train_y,validation_split=0.2,epochs=10,batch_size=128,verbose=1)

Epoch 1/10
157/157 [==============================] - 32s 154ms/step - loss: 0.5359 - accuracy: 0.7192 - val_loss: 0.5319 - val_accuracy: 0.7444
Epoch 2/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2882 - accuracy: 0.8847 - val_loss: 0.5372 - val_accuracy: 0.7904
Epoch 3/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2302 - accuracy: 0.9119 - val_loss: 0.3840 - val_accuracy: 0.8646
Epoch 4/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2008 - accuracy: 0.9280 - val_loss: 0.4596 - val_accuracy: 0.8344
Epoch 5/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1862 - accuracy: 0.9327 - val_loss: 0.5627 - val_accuracy: 0.7946
Epoch 6/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1749 - accuracy: 0.9380 - val_loss: 0.5431 - val_accuracy: 0.8148
Epoch 7/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1443 - accuracy: 0.9491 - val_loss: 0.4799 - val_accuracy: 0.8632
Epoch 8/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1283 - accuracy: 0.9553 - val_loss: 0.6568 - val_accuracy: 0.8078
Epoch 9/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1087 - accuracy: 0.9632 - val_loss: 0.6196 - val_accuracy: 0.8314
Epoch 10/10
157/157 [==============================] - 23s 149ms/step - loss: 0.0960 - accuracy: 0.9688 - val_loss: 0.4496 - val_accuracy: 0.8698

import matplotlib.pyplot as plt
def show_train_history(train_history,train_metrics,val_metrics):
  plt.plot(train_history[train_metrics])
  plt.plot(train_history[val_metrics])
  plt.title('Trian History')
  plt.ylabel(train_metrics)
  plt.xlabel('epoch')
  plt.legend(['trian','validation'],loc='upper left')
  plt.show()

show_train_history(history.history,'loss','val_loss')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kO1yDEaA-1634552592215)(output_36_0.png)]

show_train_history(history.history,'accuracy','val_accuracy')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ocyq9SL4-1634552592216)(output_37_0.png)]

Видя, что точность и потери проверочного набора колеблются, в то время как обучающий набор растет, можно приблизительно оценить, что он немного переобучен.

V. Оценка и прогноз

model.evaluate(test_x,test_y,verbose=1) # 0是无，1是进度条，2是一个epoch一个

782/782 [==============================] - 40s 51ms/step - loss: 0.5644 - accuracy: 0.8374





[0.5644006133079529, 0.8374000191688538]

pre = model.predict(test_x)

pre[0],test_y[0]

(array([9.996530e-01, 3.470438e-04], dtype=float32), array([1, 0]))

# 模型应用,我自己写的
x = ["This is really a junk movie. Jupyter doesn't like it. Thank you! It's really bad"]
x = token.texts_to_sequences(x)
x = tf.keras.preprocessing.sequence.pad_sequences(x,
                                                padding='post',
                                                truncating='post',
                                                maxlen=400)
x

array([[  11,    6,   63,    3, 2579,   17,  149,   37,    9, 1289,   22,
          42,   63,   75,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)

y = model.predict(x)
y

array([[0.12796064, 0.8720394 ]], dtype=float32)

state = {0:'pos',1:'neg'}
state[np.argmax(y)]

'neg'