sake.txt

I-choix du DataSet

Choix d'une DB adaptée -> au moins 2 Go, sans images car traitement pour reconnaissance et on fait pas ça
On a cherché sur Kaggle (détails dessus)

trouvé data set de 8Go sur des animes, et j'adore les anime donc on est partis sur ça : https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset

Il y a 6 fichiers : anime-dataset-2023.csv,anime-filtered.csv, final_animedataset.csv,user-filtered.csv,users-details-2023.csv, users-score-2023.csv

Les fichiers filtrés et finales permettent d'être plus rapidement prêts à l'emploi en fonction de l'analyse qu'on compte faire dessus. On préfère prendre les tables anime-dataset-2023, user-score-2023 et user-filtered
Voici une description de nos tables : 
------------------------------------------ "anime-dataset-2023.csv" ----------------------------------------------
anime_id: Unique ID for each anime.
Name: The name of the anime in its original language.
English name: The English name of the anime.
Other name: Native name or title of the anime(can be in Japanese, Chinese or Korean).
Score: The score or rating given to the anime.
Genres: The genres of the anime, separated by commas.
Synopsis: A brief description or summary of the anime's plot.
Type: The type of the anime (e.g., TV series, movie, OVA, etc.).
Episodes: The number of episodes in the anime.
Aired: The dates when the anime was aired.
Premiered: The season and year when the anime premiered.
Status: The status of the anime (e.g., Finished Airing, Currently Airing, etc.).
Producers: The production companies or producers of the anime.
Licensors: The licensors of the anime (e.g., streaming platforms).
Studios: The animation studios that worked on the anime.
Source: The source material of the anime (e.g., manga, light novel, original).
Duration: The duration of each episode.
Rating: The age rating of the anime.
Rank: The rank of the anime based on popularity or other criteria.
Popularity: The popularity rank of the anime.
Favorites: The number of times the anime was marked as a favorite by users.
Scored By: The number of users who scored the anime.
Members: The number of members who have added the anime to their list on the platform.
Image URL: The URL of the anime's image or poster.
The dataset offers valuable information for analyzing and comprehending the characteristics, ratings, popularity, and viewership of various anime shows. By utilizing this dataset, one can conduct a wide range of analyses, including identifying the highest-rated anime, exploring the most popular genres, examining the distribution of ratings, and gaining insights into viewer preferences and trends. Additionally, the dataset facilitates the creation of recommendation systems, time series analysis, and clustering to delve deeper into anime trends and user behavior.

--------------------------------------------- "users-details-2023.csv" ------------------------------------------------
Mal ID: Unique ID for each user.
Username: The username of the user.
Gender: The gender of the user.
Birthday: The birthday of the user (in ISO format).
Location: The location or country of the user.
Joined: The date when the user joined the platform (in ISO format).
Days Watched: The total number of days the user has spent watching anime.
Mean Score: The average score given by the user to the anime they have watched.
Watching: The number of anime currently being watched by the user.
Completed: The number of anime completed by the user.
On Hold: The number of anime on hold by the user.
Dropped: The number of anime dropped by the user.
Plan to Watch: The number of anime the user plans to watch in the future.
Total Entries: The total number of anime entries in the user's list.
Rewatched: The number of anime rewatched by the user.
Episodes Watched: The total number of episodes watched by the user.
The User Details Dataset provides valuable information for analyzing user behavior and preferences on the anime platform. By examining mean scores and anime genres, you can gain insights into user preferences. Users can be segmented into different groups based on their watching behavior, such as active users and casual viewers. Personalized recommendation systems can be built using users' completed and plan-to-watch lists. Location-based analysis reveals anime popularity and user engagement in various countries. Trends in watching behavior, user retention, and gender-based differences in anime preferences can be identified. Additionally, you can explore rewatching habits and perform time series analysis to understand user engagement patterns over time.

---------------------------------------------- "users-score-2023.csv" -------------------------------------------------
user_id: Unique ID for each user.
Username: The username of the user.
anime_id: Unique ID for each anime.
Anime Title: The title of the anime.
rating: The rating given by the user to the anime.
The User Score Dataset enables various analyses and insights into user interactions with anime. By examining user ratings for different anime titles, you can identify highly-rated and popular anime among users. Additionally, you can explore user preferences and watch patterns for specific anime titles. This dataset also forms the foundation for building recommendation systems based on user ratings, helping to suggest anime that align with individual tastes. Furthermore, you can perform collaborative filtering and similarity analysis to discover patterns of similar user interests. Overall, this dataset offers valuable information for understanding user engagement and preferences on the anime platform.


user-filtered contient juste l'id de l'anime, l'id du user qui a mis la note, et la note

Pour mettre le set sur la VM, j'ai dû installer une extension qui me permettait d'obtenir une commande wget (parce qu'il n'y  pas de lien pour le téléchargement sur kaggle), donc la commande a été mise dans un script pour faciliter les choses
mkdir dataset
cd dataset

(ne pas oublier le chmod +x au préalable pour les scripts)

./arch_dl.sh : 
wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7" --header="Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6,zh-CN;q=0.5,zh;q=0.4" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-data-sets/3384322/6207733/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240519%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240519T145233Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=1ac8f9216a239f62f3aa19666ce2b09c188d1d34d5199cf254a3677292e1b893eb10d0e2280baf0cbfb1f21d38a2b99f55e3e080beaa4a376d07326750503e15f35e123e2efd21c2c300a82c5bc06c787528bbe5e0d6b7be5a31bc0e6fb458b9a59456233fb852c658827d1dd547ca683890de508dd88940526568357bdd28611409ed5db0e479abf7b6f98855cd942d0cebfae55d463f288640c594bce7e11cd9f460e941cec80a7713e7faa54e69e3e9c4e9e3cd87b11bc35aa74439f96f80c2d592c6a97519353ca099d62e7276bec190a99e9327aee45ab9531d86f8f6be65fb3931148dbd4342712849494a71adcfe0b4eb54051582393fe8a98ebf68bc" -c -O 'archive.zip'

./arch_unzip.sh :
sudo yaml install -y unzip
unzip archive.zip


On doit d'abord commencer par comprendre le dataset qu'on a choisi, et donc l'analyser

II- Analyse du data set et élaboration de la stratégie d'analyse


III- Prétraitement des données


IV- Construction de la plateforme Big Data

	A- Préparation de la VM
installation de git et clonage du repo pour avoir accès aux conteneurs

sudo dnf -y install wget git
git clone https://gitlab.com/FormationOLB/ensiie.git


On a maintenant accès à docker, hive, ...

	B- Choix des outils

utilisation (docker) hadoop et hive
visu avec opensearch 


démarrage conteneurs hadoop et hive
./start-hadoop-ensiie.sh *2 pour vérif

vérification des conteneurs lancés :
docker compose -f ~/ensiie/exo/hadoop/docker-compose-ensiie-v3.yml ps

pas de soucis, certains conteneurs (datanode1&2) up depuis 2 semaines, mais je préfère pas y toucher avant de regarder hdfs)

On a donc 4 services : hdfs, hive, yarn et spark


AVEC COMMANDES PROF : 
On change le yaml pour ajouter un volume et mettre nos fichiers sur le conteneur : 
~/dataset/data_selected:/data/hdfs/dataset

On ouvre namenode:
docker exec -it namenode bash

On se place dans le bon fichier :
mkdir /data/hdfs/dataset
cd /data/hdfs/dataset 
On vérifie que c'est fonctionnel : 
hdfs dfs -df -h 

On crée le fichier dans le hdfs :
hdfs dfs -mkdir -p /users/projet
Si besoin de lister : 
hdfs dfs -l /users/projet
/////////////////
Au final on refait un montage avec un volume ça ira plus vite ahah
cf rapport Karnas -> utilisation du git


On met tout sur le hdfs : 
hdfs dfs -ls / -> 3 dossiers 
on se place dans /

hdfs dfs -put /data/hdfs/files/dataset /dataset
hdfs dfs -chown nnn /dataset


vérification ici :
http://162.19.124.170:9870/


Une fois que les conteneurs sont up et que la config hdfs est faite (en gros qu'on a accès aux fichiers), on passe à la partie sur hive


	C- Construction (aka le vif du sujet)

On passe sur hive :

docker exec -it hive-server bash 

On passe sur beeline
/opt/hive/bin/beeline -u jdbc:hive2://hive-server:10000

SI PB DE CO : 
On vérifie  si HiveServer2 est en cours d'exécution :
ps aux | grep HiveServer2
/opt/hive/bin/hiveserver2 &

ça ne fonctionne pas 
On regarde notre numéro de port :
nc -zv hive-server 10000


Nouvelle commande : 
/opt/hive/bin/beeline -u jdbc:hive2://hive-server:10000

Il faut donc maintenant construire la BD -> cf DB_commands


V- Analyse représentatives
	A- Rappel de la stratégie