I have some manga data, I even made an article so that you can collect this data set (with some modifications) see: https://towardsdatascience.com/scrape-multiple-pages-with-scrapy-ea8edfa4318

For each manga, I got his picture’s URL in the poster’s column. I will build my image corpus on this basis
import requests
manga["title"] = manga["title"].replace({"%" : ""}, regex=True)
manga= manga[(manga["title"] != "") & (manga["title"].notnull()) & (manga["poster"].notnull())]
import requests
from tqdm import tqdm
keep_in_memory = []
for url, title in tqdm(zip(manga["poster"], manga["title"])) :
str_name = "/home/jupyter/Untitled Folder/%s.jpg" %title
str_name = str_name.strip()
keep_in_memory.append(str_name)
with open(str_name, 'wb') as f:
f.write(requests.get(url).content)
manga["pics_ref"] = keep_in_memory
I will use the gender columns as a label for my image classifier. But I need to do a little cleaning before because I had ~3000 unique labels, I will clean my labels and reduce them.
manga["genre"] = manga["genre"].str.strip()
manga["genre"] = manga["genre"] + " ,"
import re
def clean(genre, string_to_filter) :
return re.sub('(?<={})(.*n?)(?=,)'.format(string_to_filter), '', genre)
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Comic'))
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Action'))
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Josei'))
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Shoujo'))
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Shounen'))
manga["genre"] = manga["genre"].apply(lambda x : clean(str(x), 'Horror'))
manga['genre'] = [list(map(str.strip, y)) for y in [x.split(',') for x in manga['genre']]]
manga['genre'] =[','.join(x) for x in manga['genre']]
my_cat = ['Action', 'Adventure', 'Comedy', 'Hentai',
'Harem', 'Fantasy', 'Drama', 'Horror', 'Romance','Josei',
'Fantasy', 'Seinen', 'Sci-Fi', 'Slice of Life', 'Mecha', 'Yaoi',
'Yuri', 'Thriller', 'Comic']
manga["genre"] = manga["genre"].apply(lambda z : [x for x in my_cat if x in z])
manga['genre'] =[','.join(x) for x in manga['genre']]
genre = manga["genre"].str.get_dummies(',')
manga = pd.concat([manga, genre], axis = 1)
I now have 18 labels and I OneHotEncoded them to have a binary matrix of each label.

I have another problem: my label is not balanced, so we will define a class weight that I will pass in my model later.
label_cnt = [(columns, manga[columns].sum()) for columns in manga.columns[8:]]
tot = sum([x[1] for x in label_cnt])
class_weight = dict(zip([x[0] for x in label_cnt], [x[1]/tot for x in label_cnt]))
Finally, we will build our input matrix.
train_image = []
index_to_drop = []
for i, png in tqdm(list(enumerate(manga["pics_ref"]))):
try :
img = image.load_img(png,target_size=(256,256,3))
img = image.img_to_array(img)
img = img/255
train_image.append(img)
except OSError :
index_to_drop.append(i)
manga = manga.drop(manga.index[index_to_drop])
X = np.array(train_image)
y = np.array(manga[manga.columns[8:].tolist()])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 777)
Then, we will define our model. There is an important point here regardless of the structure is that our forecasts must be activated via a sigmoid function and not a softmax because otherwise we will have a cumulative probability of belonging to such a label or we want an independent probability of belonging to a label.
And finally, increase the data with the ImageDataGenerator module that will generate new photos by changing the RGB channel, zooming, flipping the image and more…..
Our model has been fitted with very good accuracy but it is not the perfect metric for this task. Let’s try it and save it.
randint = np.random.randint(1, manga.shape[0], 1)[0]
poster = manga["pics_ref"].iloc[randint]
img = image.load_img(poster, target_size=(256,256,3))
img = image.img_to_array(img)
img = img/255
classes = np.array(manga.columns[8:])
proba = model.predict(img.reshape(1,256,256,3))
top_3 = np.argsort(proba[0])[:-4:-1]
for i in range(3):
print("{}".format(classes[top_3[i]])+" ({:.3})".format(proba[0][top_3[i]]))
plt.imshow(img)

from Keras.models import model_from_json
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")
# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
'''
loaded_model = model_from_json(loaded_model_json)
loaded_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
'''
Anyway, we are not interested in the accuracy of our classifier but in the features generated by the training, which will allow us to build our recommendation! First of all, we will reconstruct a model with our pre-training model by removing the last prediction layer.
image_features_extractor = Model(inputs=model.input, outputs=model.layers[-2].output)
img_features = image_features_extractor.predict(X)
cosSimilarities = cosine_similarity(img_features)
cos_similarities_df = pd.DataFrame(cosSimilarities,
columns=manga["title"],
index=manga["title"])
And finally, we get the poster that is the most similar to a given poster.
def most_similar_to(given_img, nb_closest_images = 5):
print("-----------------------------------------------------------------------")
print("original manga:")
original = load_img(given_img, target_size=(256,256,3))
plt.imshow(original)
plt.show()
print("-----------------------------------------------------------------------")
print("most similar manga:")
closest_imgs = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1].index
closest_imgs_scores = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1]
for i in range(0,len(closest_imgs)):
img = image.load_img(closest_imgs[i], target_size=(256,256,3))
plt.imshow(img)
plt.show()
print("similarity score : ",closest_imgs_scores[i])

- recommendations

It looks good! But it could be better!
To conclude
We programmed a convolutional neural network to classify the genres of our manga, and then we retrieved the variables generated during training to make it a system of recommendations. Several uses of the same model rather nice is not it! Thank you, the code is here: https://github.com/AlexWarembourg/Medium