Vietnam War Bombings Analysis

Description

This project has was made to inform people about the Vietnam War because:

Common Knowledge

Despite being such a relatively recent issue most schools do not teach about it, which is why we consider young people ackowledge it.

History is cyclical

As the spanish philosopher George Santayana said “Those who don’t know history are destined to repeat it.” which is what we are trying to prevent

Easy to understand

We thought that the best and easiest way to really see the impact was through charts and maps, thus our results are mostly shown as maps and charts.

Why do we use big data?

Because at the moment we had 3 datasets which together had a size of 1.5GB, but taking into account these were incomplete, we hoped that in the future we will be able to find more information. Taking this and that we had to compute multiple operations on the dataset in order to draw conclusions and make charts, we considered that the computanial costs were too high and required to be optimized, thus the reason to use big data.

What can we achieve with large-scale processing?

With large-scale processing we will be able to compute more in even less time, which allow us to take all the data we can find into account, without the need of disregarding any information. This will be incredibly helpful in case that the dataset grows, which is possibly the case due to its incompleteness.

Model Description

As said before, all of our information can be found in 3 datasets, these are available for everybody at Kaggle, and can be found clicking here the name of these datasets are:

THOR_Vietnam _Bombing_ Operations.csv

THOR_Vietnam _Aircraft_ Glossary.csv

THOR_Vietnam _Weapons_ Glossary.csv

We have made use of spark.sql.module to do all the filtering, grouping operations and also to help us implement statistical methods such as the mean, standard deviation or variances.

We had to filter lots of rows in order to make some of the frameworks work with less data size (like Plotly for maps). We hope we find a way to include all the data into account, perhaps by finding another library for mapping optimized for large chunks of data.

MODEL OF THE SOLUTION

Where to run it: It can either be runned on local or on a cluster, but the following steps need to be done wherever chosen

Dependencies: Run the installation scr¡pt called “install.sh” (no Spark or Python3 included)

How to use it: Once all the dependencies are installed, the script “run.sh” should be runned. It will show a menu where an option should be chosen

Link to the repository!

TOOLS AND INFRASTRUCTURE

TOOLS

TOOLS

Pyspark for data manipulation, Matplotlib/Seaborn for data visualization. Plotly for generating interactive maps.
INFRAS- TRUCTURE

INFRASTRUCTURE

Hadoop cluster composed of four EC2 AWS machines (AWS EMR), Shared S3 Bucket to store the dataset, additional data and results generated in the process.
GIT AND GITHUB

For version control.

Results

Here can be found all the charts for each option.

Vietnam War Bombings (Monthly totals)

Option 1

Vietnam War Missions (Monthly totals)

Option 2

Vietnam War Bombings (Totals by Country)

Option 3

Vietnam War Missions (Totals by Country)

Option 4

Vietnam War Most affected Countries

Option 5

Vietnam War Type of Missions

Option 6

Vietnam War Bombings Locations By Date

Option 7

Vietnam War Most used Aircrafts

Option 8

Vietnam War Mission Type per Aircraft

Option 9

Vietnam War Bombings Locations

Option 10

Vietnam War Bombings per Aircrafts

Option 11

Vietnam War Most Common Take-off Locations

Option 12

PERFORMANCE

Performance Evaluation

Speed-Up

About us

Raquel Pérez González de Ossuna

Robert Farzan Rodríguez

Miguel Robledo Casal

This project was made in 2020 for the class Cloud and Big Data.