diff --git a/sake.txt b/sake.txt
new file mode 100644
index 0000000000000000000000000000000000000000..837d5390f91c5b16e9825382e0a4f4615b12d23f
--- /dev/null
+++ b/sake.txt
@@ -0,0 +1,140 @@
+I-choix du DataSet
+
+Choix d'une DB adaptée -> au moins 2 Go, sans images car traitement pour reconnaissance et on fait pas ça
+On a cherché sur Kaggle (détails dessus)
+
+trouvé data set de 8Go sur des animes, et j'adore les anime donc on est partis sur ça : https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset
+
+Il y a 6 fichiers : anime-dataset-2023.csv,anime-filtered.csv, final_animedataset.csv,user-filtered.csv,users-details-2023.csv, users-score-2023.csv
+
+Les fichiers filtrés et finales permettent d'être plus rapidement prêts à l'emploi en fonction de l'analyse qu'on compte faire dessus. On préfère prendre les tables non filtrées et on laissera les filtrées de côté.
+Voici une description de nos tables : 
+------------------------------------------ "anime-dataset-2023.csv" ----------------------------------------------
+anime_id: Unique ID for each anime.
+Name: The name of the anime in its original language.
+English name: The English name of the anime.
+Other name: Native name or title of the anime(can be in Japanese, Chinese or Korean).
+Score: The score or rating given to the anime.
+Genres: The genres of the anime, separated by commas.
+Synopsis: A brief description or summary of the anime's plot.
+Type: The type of the anime (e.g., TV series, movie, OVA, etc.).
+Episodes: The number of episodes in the anime.
+Aired: The dates when the anime was aired.
+Premiered: The season and year when the anime premiered.
+Status: The status of the anime (e.g., Finished Airing, Currently Airing, etc.).
+Producers: The production companies or producers of the anime.
+Licensors: The licensors of the anime (e.g., streaming platforms).
+Studios: The animation studios that worked on the anime.
+Source: The source material of the anime (e.g., manga, light novel, original).
+Duration: The duration of each episode.
+Rating: The age rating of the anime.
+Rank: The rank of the anime based on popularity or other criteria.
+Popularity: The popularity rank of the anime.
+Favorites: The number of times the anime was marked as a favorite by users.
+Scored By: The number of users who scored the anime.
+Members: The number of members who have added the anime to their list on the platform.
+Image URL: The URL of the anime's image or poster.
+The dataset offers valuable information for analyzing and comprehending the characteristics, ratings, popularity, and viewership of various anime shows. By utilizing this dataset, one can conduct a wide range of analyses, including identifying the highest-rated anime, exploring the most popular genres, examining the distribution of ratings, and gaining insights into viewer preferences and trends. Additionally, the dataset facilitates the creation of recommendation systems, time series analysis, and clustering to delve deeper into anime trends and user behavior.
+
+--------------------------------------------- "users-details-2023.csv" ------------------------------------------------
+Mal ID: Unique ID for each user.
+Username: The username of the user.
+Gender: The gender of the user.
+Birthday: The birthday of the user (in ISO format).
+Location: The location or country of the user.
+Joined: The date when the user joined the platform (in ISO format).
+Days Watched: The total number of days the user has spent watching anime.
+Mean Score: The average score given by the user to the anime they have watched.
+Watching: The number of anime currently being watched by the user.
+Completed: The number of anime completed by the user.
+On Hold: The number of anime on hold by the user.
+Dropped: The number of anime dropped by the user.
+Plan to Watch: The number of anime the user plans to watch in the future.
+Total Entries: The total number of anime entries in the user's list.
+Rewatched: The number of anime rewatched by the user.
+Episodes Watched: The total number of episodes watched by the user.
+The User Details Dataset provides valuable information for analyzing user behavior and preferences on the anime platform. By examining mean scores and anime genres, you can gain insights into user preferences. Users can be segmented into different groups based on their watching behavior, such as active users and casual viewers. Personalized recommendation systems can be built using users' completed and plan-to-watch lists. Location-based analysis reveals anime popularity and user engagement in various countries. Trends in watching behavior, user retention, and gender-based differences in anime preferences can be identified. Additionally, you can explore rewatching habits and perform time series analysis to understand user engagement patterns over time.
+
+---------------------------------------------- "users-score-2023.csv" -------------------------------------------------
+user_id: Unique ID for each user.
+Username: The username of the user.
+anime_id: Unique ID for each anime.
+Anime Title: The title of the anime.
+rating: The rating given by the user to the anime.
+The User Score Dataset enables various analyses and insights into user interactions with anime. By examining user ratings for different anime titles, you can identify highly-rated and popular anime among users. Additionally, you can explore user preferences and watch patterns for specific anime titles. This dataset also forms the foundation for building recommendation systems based on user ratings, helping to suggest anime that align with individual tastes. Furthermore, you can perform collaborative filtering and similarity analysis to discover patterns of similar user interests. Overall, this dataset offers valuable information for understanding user engagement and preferences on the anime platform.
+
+Pour mettre le set sur la VM, j'ai dû installer une extension qui me permettait d'obtenir une commande wget (parce qu'il n'y  pas de lien pour le téléchargement sur kaggle), donc la commande a été mise dans un script pour faciliter les choses 
+
+mkdir dataset
+cd dataset
+
+(ne pas oublier le chmod +x au préalable pour les scripts)
+
+./arch_dl.sh : 
+wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7" --header="Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6,zh-CN;q=0.5,zh;q=0.4" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-data-sets/3384322/6207733/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240519%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240519T145233Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=1ac8f9216a239f62f3aa19666ce2b09c188d1d34d5199cf254a3677292e1b893eb10d0e2280baf0cbfb1f21d38a2b99f55e3e080beaa4a376d07326750503e15f35e123e2efd21c2c300a82c5bc06c787528bbe5e0d6b7be5a31bc0e6fb458b9a59456233fb852c658827d1dd547ca683890de508dd88940526568357bdd28611409ed5db0e479abf7b6f98855cd942d0cebfae55d463f288640c594bce7e11cd9f460e941cec80a7713e7faa54e69e3e9c4e9e3cd87b11bc35aa74439f96f80c2d592c6a97519353ca099d62e7276bec190a99e9327aee45ab9531d86f8f6be65fb3931148dbd4342712849494a71adcfe0b4eb54051582393fe8a98ebf68bc" -c -O 'archive.zip'
+
+./arch_unzip.sh :
+sudo yaml install -y unzip
+unzip archive.zip
+
+
+
+On doit d'abord commencer par comprendre le dataset qu'on a choisi, et donc l'analyser
+
+II- Analyse du data set et élaboration de la stratégie d'analyse
+
+
+
+III- Prétraitement des données
+
+
+IV- Construction de la plateforme Big Data
+
+	A- Préparation de la VM
+installation de git et clonage du repo pour avoir accès aux conteneurs
+
+sudo dnf -y install wget git
+git clone https://gitlab.com/FormationOLB/ensiie.git
+
+
+On a maintenant accès à docker, hive, ...
+
+	B- Choix des outils
+
+utilisation (docker) hadoop et hive
+visu avec opensearch 
+
+
+démarrage conteneurs hadoop et hive
+./start-hadoop-ensiie.sh *2 pour vérif
+
+vérification des conteneurs lancés :
+docker compose -f ~/ensiie/exo/hadoop/docker-compose-ensiie-v3.yml ps
+
+pas de soucis, certains conteneurs (datanode1&2) up depuis 2 semaines, mais je préfère pas y toucher avant de regarder hdfs)
+
+On a donc 4 services : hdfs, hive, yarn et spark
+
+On ouvre namenode:
+docker exec -it namenode bash
+
+On se place dans le bon fichier :
+mkdir /data/hdfs/dataset
+cd /data/hdfs/dataset 
+On vérifie que c'est fonctionnel : 
+hdfs dfs -df -h 
+
+On crée le fichier dans le hdfs :
+hdfs dfs -mkdir -p /users/projet
+Si besoin de lister : 
+hdfs dfs -l /users/projet
+
+
+vérification ici :
+http://162.19.124.170:9870/
+	C- Construction (aka le vif du sujet)
+
+
+V- Analyse représentatives
+	A- Rappel de la stratégie
+