By: John Salter-Cid

Published: 12/6/2019
Last Edited: 12/6/2019

View site on Github or Github Pages

Table of Contents:

Introduction

Part 1a: Load & Wrangle Arms Transfer Dataset

Part 1b: Initial Exploration into Arms Transfer Database

Part 2a: Load & Wrangle Arms Transfer Dataset

Part 2b: Initial Exploration into Arms Transfer Database

Part 3: Merging the Datasets

Part 4: K-Nearest Neighbors Regression

Part 5: Conclusion

Introduction:

If you are like me then you may feel overwhelmed at times by the onslaught of news coverage surrounding wars - or threat thereof- around the world. By simply going to your preferred news source, you can surely find a news anchor outlining the increasing military tensions in countries like North Korea and Iran or bemoaning atrocities in war-torn countries like Syria and Yemen. Along with news stories like these, you will also surely see discussions between experts surrounding the militarization or demilitarization of these regions that may point to the escalation or deescalation of the conflict being discussed.

Thus, if you ARE like me, you may also be asking yourself, what does militarization mean?

I decided to use this curiosity as the basis for this project. Here, I will use two datasets, the SIPRI Arms Transfers Databases and the UCDP Battle-Related Deaths Database to try to figure out which countries are buying mass amounts of armaments, who is selling it to them, and if there is a correlation between these armament purchases and the number of fatalities in the wars these governments are involved in. This notebook will take data from these datasets and, walking through the data life cycle, will ultimately aim to answer these questions.

Dataset #1: SIPRI Arms Transfers Database

"The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually". This dataset, and similar datasets also provided by SIPRI are rich in information concerning everything you could possibly need to know about weapons sales across the world. It should be noted that the SIPRI dataset does not use convential currency to value the arms transfers. Instead, SIPRI has developed a unique pricing system to measure the volume of deliveries of major conventional weapons and components using a common unit the "SIPRI trend-indicator value" (TIV). The TIV of an item being delivered is intended to reflect its military capability rather than its financial value. This common unit can be used to measure trends in the flow of arms between particular countries and regions over time—in effect, a military capability price index.

Dataset #2: UCDP Battle-Related Deaths Data

"This dataset contains information on the number of battle-related deaths in the conflicts around the world from the years 1989 to 2018. The Uppsala Conflict Data Program (UCDP) is the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years". This dataset has everything needed for deep looks into every armed conflict since 1989 and will prove usefule in our comparison with arms sales from the SIPRI dataset.

Part 1a: Load & Wrangle Arms Transfer Dataset

In [91]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
%matplotlib inline

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
In [92]:
#These files can be found on the github page linked at the top of notebook.
arms_50_69 = pd.read_csv('arms_transfer_1950-1969.txt', sep=';')
arms_70_84 = pd.read_csv('arms_transfer_1970-1984.txt', sep=';')
arms_84_99 = pd.read_csv('arms_transfer_1985-1999.txt', sep=';')
arms_00_18 = pd.read_csv('arms_transfer_2000-2018.txt', sep=';')
In [93]:
#combine the files into one
dataframes = [arms_50_69,arms_70_84, arms_84_99, arms_00_18]
arms = pd.concat(dataframes)
# arms = arms.sort_values(by= ['Buyer','Delivery year'])
In [94]:
arms.head()
Out[94]:
Deal ID Seller Buyer Designation Description Armament category Order date Order date is estimate Numbers delivered Numbers delivered is estimate Delivery year Delivery year is estimate Status SIPRI estimate TIV deal unit TIV delivery values Local production
0 1929 United States Mexico Cessna-150 Trainer/light ac Aircraft 1967 Yes 2 No 1968 No New 0.10 0.10 0.20 No
1 1943 Soviet Union Algeria T-54 Tank Armoured vehicles 1965 Yes 25 No 1966 No Second hand 1.10 0.44 11.00 No
2 1944 Soviet Union Algeria T-55 Tank Armoured vehicles 1965 Yes 25 No 1966 No New 1.25 1.25 31.25 No
3 1957 United States Morocco DC-3/C-47 Skytrain Transport aircraft Aircraft 1964 Yes 5 No 1966 No Second hand 2.50 1.00 5.00 No
4 1968 Soviet Union Morocco Il-28 Bomber aircraft Aircraft 1960 Yes 2 No 1961 No Second hand 7.20 2.88 5.76 No

As we can see, SIPRI does a very good job of curating and maintaining their data. It doesn't appear that a lot of wrangling will be needed to get this dataset ready for analyzing. However, we should still look for null values and make sure the data types are how we want them.

Below you can see in the count row that there are no nulls in this dataset! SIPRI helps us out yet again.

In [95]:
arms.describe(include='all')
Out[95]:
Deal ID Seller Buyer Designation Description Armament category Order date Order date is estimate Numbers delivered Numbers delivered is estimate Delivery year Delivery year is estimate Status SIPRI estimate TIV deal unit TIV delivery values Local production
count 54321.000000 54321 54321 54321 54321 54321 54321.000000 54321 54321.000000 54321 54321.000000 54321 54321 54321.000000 54321.000000 54321.000000 54321
unique NaN 132 257 3827 248 11 NaN 2 NaN 2 NaN 2 3 NaN NaN NaN 2
top NaN United States India M-113 APC Aircraft NaN Yes NaN Yes NaN No New NaN NaN NaN No
freq NaN 16834 2252 439 2891 15991 NaN 31031 NaN 37156 NaN 33436 44682 NaN NaN NaN 46079
mean 25898.554040 NaN NaN NaN NaN NaN 1983.341249 NaN 59.816020 NaN 1986.905285 NaN NaN 7.812927 6.569592 35.408285 NaN
std 17247.522631 NaN NaN NaN NaN NaN 17.974236 NaN 216.611329 NaN 18.349691 NaN NaN 28.270133 23.833243 85.932376 NaN
min 2.000000 NaN NaN NaN NaN NaN 1940.000000 NaN 1.000000 NaN 1950.000000 NaN NaN 0.010000 0.004000 0.020000 NaN
25% 14184.000000 NaN NaN NaN NaN NaN 1970.000000 NaN 2.000000 NaN 1973.000000 NaN NaN 0.250000 0.200000 4.000000 NaN
50% 24108.000000 NaN NaN NaN NaN NaN 1982.000000 NaN 10.000000 NaN 1986.000000 NaN NaN 1.100000 1.000000 10.000000 NaN
75% 36902.000000 NaN NaN NaN NaN NaN 1999.000000 NaN 38.000000 NaN 2003.000000 NaN NaN 5.100000 4.750000 30.000000 NaN
max 62537.000000 NaN NaN NaN NaN NaN 2018.000000 NaN 7500.000000 NaN 2018.000000 NaN NaN 1579.000000 1250.000000 2150.860000 NaN
In [96]:
arms.dtypes
Out[96]:
Deal ID                            int64
Seller                            object
Buyer                             object
Designation                       object
Description                       object
Armament category                 object
Order date                         int64
Order date is estimate            object
Numbers delivered                  int64
Numbers delivered is estimate     object
Delivery year                      int64
Delivery year is estimate         object
Status                            object
SIPRI estimate                   float64
TIV deal unit                    float64
TIV delivery values              float64
Local production                  object
dtype: object

Looking at the data types listed above, we can see that they are almost all where we want them. The only change we need is to "Armament category". It is currently a object and it should be a category data type. A simple "astype" command is all we need to fix that.

In [97]:
arms['Armament category'] = arms['Armament category'].astype('category')

To demonstrate what type of information this dataset can show us, look at command below. Here we can see the arms deals had between Germany and Singapore in the year 2009. It shows that Germany delivered 2 new turbofan engines, 8 new diesel engines, and 12 used tanks totaling 38.68 million TIV values (the TIV values are in millions).

In [98]:
arms.loc[(arms["Delivery year"] == 2009) & (arms["Seller"] == 'Germany') & (arms["Buyer"] == 'Singapore')]
Out[98]:
Deal ID Seller Buyer Designation Description Armament category Order date Order date is estimate Numbers delivered Numbers delivered is estimate Delivery year Delivery year is estimate Status SIPRI estimate TIV deal unit TIV delivery values Local production
783 38809 Germany Singapore BR-710 Turbofan Engines 2007 No 2 Yes 2009 No New 2.5 2.50 5.00 No
11319 36197 Germany Singapore MTU-8000 Diesel engine Engines 2000 Yes 8 No 2009 No New 4.0 4.00 32.00 No
11597 31953 Germany Singapore Leopard-2A4 Tank Armoured vehicles 2007 No 12 No 2009 No Second hand but modernized 4.0 2.64 31.68 No

Now, the arms dataset is ready to be analyzed! Before we load the other datasets, let's do some exploratory analysis to see what the arms dataframe has to tell us.

Part 1b: Initial Exploration into Arms Transfer Database

Let's start simple. Let's make a lineplot showing the TIV delivery values across time. I prefer Seaborn so I will be using for many of the graphs in this article.

In [99]:
#Total TIV delivery values across the globe since 1950
sns.lineplot(x='Delivery year', y="TIV delivery values", data=arms)
Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c2049448>

After a spike in the early 1950's, total sales of armament has seemed to fluctuate between 25 and 50 million delivery values per year.

To align with our original inquiries about who are the top buyers and sellers of armaments across the world, let's look at the top 10 buyers and sellers as defined by the greatest total number of TIV delivery values.

Let's do the top 10 sellers first.

In [100]:
# #filtering data for the 10 ten sellers as dictated by total sales since 1950

arms = arms.replace({"Soviet Union" : "Soviet Union/Russia", "Russia" : "Soviet Union/Russia"})#noticing the Soviet Union and Russia are seperated, I combine them here.

sum_seller = arms.groupby(['Seller']).sum() #groupby seller and aggregate by sum
sum_seller = sum_seller.sort_values(by=["TIV delivery values"], ascending=False) #sort the result by TIV delivery values to get the top sellers on top
sum_seller["rank"]= sum_seller["TIV delivery values"].rank(ascending= False) #create a rank column to make it easy to pull a certain slice of the data
top10_seller = sum_seller.loc[sum_seller["rank"]<=10] #pull just the top 10 sellers

top10_seller = top10_seller.reset_index()

top10_seller_list = top10_seller['Seller'].tolist() 
arms_seller = arms.groupby(['Seller',"Delivery year"]).sum()

arms_seller = arms_seller.reset_index()

arms_seller = arms_seller.loc[arms_seller["Seller"].isin(top10_seller_list)]

#plot showing the top 10 sellers and how much they have sold since 1950
sns.set_style("whitegrid")

f, ax = plt.subplots(figsize=(20, 10))
ax.set(xlim=(1950, 2018))
ax.set_title('Top 10 ten sellers by total TIV delivery values since 1950')
sns.lineplot(x='Delivery year', y="TIV delivery values", hue='Seller', ax=ax, data=arms_seller)
Out[100]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c212f548>

Not surprisingly, we can see that the United States and Russia are far and beyond the leading sellers of armaments in the world. To demonstrate that further we can plot the the percentage of world sales that is attributed to just the US and Russia per year.

In [101]:
#take USA and Russia per year and then divide total per year and get new column of percentage per year
usa_russia = arms.loc[(arms["Seller"]=="United States")|(arms["Seller"]=="Soviet Union/Russia")]
usa_russia = usa_russia.groupby(["Delivery year"]).sum()
usa_russia = usa_russia["TIV delivery values"]
usa_russia = usa_russia.reset_index()
world = arms.groupby(["Delivery year"]).sum()
world = world["TIV delivery values"]
world = world.reset_index()
In [102]:
usa_russia= pd.merge(usa_russia, world, left_on='Delivery year', right_on='Delivery year')
In [103]:
usa_russia["(USA+Russia) / World"] = (usa_russia["TIV delivery values_x"])/(usa_russia["TIV delivery values_y"])
In [104]:
usa_russia.head(2)
Out[104]:
Delivery year TIV delivery values_x TIV delivery values_y (USA+Russia) / World
0 1950 5762.5456 8467.3282 0.680562
1 1951 8577.0240 12442.6120 0.689327
In [105]:
fig, ax = plt.subplots(figsize=(13,8))

usa_russia.plot(x="Delivery year", y="(USA+Russia) / World", ax=ax)

ax.set_title('Proportion of world armament sales by USA and Russia alone')
ax.set_xlabel("Year")
ax.set_ylabel("Ratio of (USA+Russia) sales to total world sales")
Out[105]:
Text(0, 0.5, 'Ratio of (USA+Russia) sales to total world sales')

At one point (around 1963-64), you can see that US and Russia alone accounted for over 80% of world armament sales! With the industrialization of other nations having occured since this time, it is not surprising to see a general decrease in this proportion over time.

Though seeing just how much the US and Russia dominate world arms sales is interesting, to answer our question of how the buying of armaments affects the number of casualites in war, let's look at the top 10 buyers.

In [106]:
#filtering data for the 10 ten buyers as dictated by total sales since 1950

sum_buyer = arms.groupby(['Buyer']).sum()
sum_buyer = sum_buyer.sort_values(by=["TIV delivery values"], ascending=False)
sum_buyer["rank"] = sum_buyer["TIV delivery values"].rank(ascending= False)
top10_buyer = sum_buyer.loc[sum_buyer["rank"]<=10]

top10_buyer = top10_buyer.reset_index()

top10_buyer_list = top10_buyer['Buyer'].tolist()

arms_buyer = arms.groupby(['Buyer',"Delivery year"]).sum()

arms_buyer = arms_buyer.reset_index()

arms_buyer = arms_buyer.loc[arms_buyer["Buyer"].isin(top10_buyer_list)]

arms_buyer.head()
Out[106]:
Buyer Delivery year Deal ID Order date Numbers delivered SIPRI estimate TIV deal unit TIV delivery values
1237 China 1950 269649 33139 2150 27.27 19.200 2608.368
1238 China 1951 283683 35090 1650 37.09 30.736 2272.350
1239 China 1952 200077 25347 1807 30.99 29.676 3504.600
1240 China 1953 182350 25357 1655 34.41 33.096 3219.100
1241 China 1954 304845 37092 1553 209.47 122.128 3500.444
In [107]:
#plot showing the top 10 buyers and how much they have bought since 1950.
sns.set_style("whitegrid")

f, ax = plt.subplots(figsize=(20, 10))
ax.set_title('Top 10 ten buyers by total TIV delivery values since 1950')

sns.lineplot(x='Delivery year', y="TIV delivery values", hue='Buyer', palette=sns.color_palette('Paired', n_colors=10), ax=ax, data=arms_buyer)
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c2439208>

Below you can see the top 10 lists of both buyers and sellers in tabular form.

In [111]:
top10_buyer #top 10 buyers all-time (1950-2018)
Out[111]:
Buyer Deal ID Order date Numbers delivered Delivery year SIPRI estimate TIV deal unit TIV delivery values rank
0 India 57887627 4468473 187608 4483608 26054.56 25031.9922 121103.7664 1.0
1 China 30574155 2031451 49224 2039464 12618.17 11023.3800 78690.3560 2.0
2 Saudi Arabia 33467199 2215622 137250 2219039 9032.50 8925.7442 65447.7042 3.0
3 Egypt 31782730 2636362 113208 2640598 11080.65 9076.3044 65303.4236 4.0
4 Japan 44979667 3235545 37685 3248327 12850.80 12542.8300 64616.6400 5.0
5 Turkey 33063476 2634744 89147 2640060 16281.01 12331.0512 56468.6808 6.0
6 Germany 22384407 1774168 111758 1778040 4956.51 4377.0516 55870.5744 7.0
7 Iraq 25869597 2202360 121918 2205350 4591.87 4203.1516 54047.2000 8.0
8 South Korea 32065888 2423808 47588 2429460 10840.70 9403.7300 53230.0040 9.0
9 Iran 19612021 1812797 106839 1816485 5864.09 5382.2480 46671.1400 10.0
In [112]:
top10_seller #top 10 sellers all-time (1950-2018)
Out[112]:
Seller Deal ID Order date Numbers delivered Delivery year SIPRI estimate TIV deal unit TIV delivery values rank
0 United States 419502244 33358573 1095372 33415508 140837.60 111734.3926 682606.7928 1.0
1 Soviet Union/Russia 246221641 22883356 1045335 22922271 81985.82 73158.9280 593358.3480 2.0
2 United Kingdom 85694167 7602670 111228 7617535 39548.38 31624.8866 141384.8036 3.0
3 France 115408766 10100470 390869 10123095 30082.55 28199.4768 122557.5860 4.0
4 Germany 80282433 5376753 186632 5390688 34997.34 32559.2224 86740.1198 5.0
5 China 59288476 3725235 87779 3732447 11183.03 10897.0980 54394.8700 6.0
6 Italy 52317122 3630452 20109 3638111 11759.66 11177.4400 32854.5300 7.0
7 Czechoslovakia 6165839 671462 21344 672885 587.30 565.5860 29328.9300 8.0
8 Netherlands 34332690 2207456 8802 2212278 17475.41 12420.5500 24302.4430 9.0
9 Israel 44022856 2418445 50978 2422255 4811.15 4188.0110 17456.6030 10.0

Before we move on to the other datasets, let's just look at something other than the TIV delivery values. Personally, I was interested to see how the types of armaments purchased changed from 1950 to 2018.

In [113]:
fig, ax = plt.subplots(figsize=(14, 8))

armament_count_1950 = arms.loc[arms['Delivery year']==1950]['Armament category'].value_counts()
armament_count_2018 = arms.loc[arms['Delivery year']==2018]['Armament category'].value_counts()

ax.set_title("Armament Categories in 1950")
armament_count_1950.plot.pie(ax=ax, legend=False)
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c38686c8>
In [114]:
fig, ax = plt.subplots(figsize=(14, 8))

ax.set_title("Armament Categories in 2018")

armament_count_2018.plot.pie(ax=ax)
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c45cc108>

Introducing the UCDP Battle-Related Deaths Data

Now that we have taken a look at the arms dataset, let's load and look at the UCDP Batle-Related Deaths Data.

Part 2a: Load & Wrangle the Dataset

In [115]:
deaths = pd.read_csv('BattleDeaths_v19_1.csv')
deaths.head()
Out[115]:
conflict_id dyad_id location_inc side_a side_a_id side_a_2nd side_b side_b_id side_b_2nd incompatibility territory_name year bd_best bd_low bd_high type_of_conflict battle_location gwno_a gwno_a_2nd gwno_b gwno_b_2nd gwno_loc gwno_battle region version
0 205 406 Iran Government of Iran 114 NaN KDPI 164 NaN 1 Kurdistan 1990 31 31 335 3 Iran 630 NaN NaN NaN 630 630 2 19.1
1 205 406 Iran Government of Iran 114 NaN KDPI 164 NaN 1 Kurdistan 1993 110 110 110 3 Iran, Iraq 630 NaN NaN NaN 630 630, 645 2 19.1
2 205 406 Iran Government of Iran 114 NaN KDPI 164 NaN 1 Kurdistan 1996 27 27 29 3 Iran, Iraq 630 NaN NaN NaN 630 630, 645 2 19.1
3 205 406 Iran Government of Iran 114 NaN KDPI 164 NaN 1 Kurdistan 2016 30 30 137 3 Iran, Iraq 630 NaN NaN NaN 630 630, 645 2 19.1
4 205 406 Iran Government of Iran 114 NaN KDPI 164 NaN 1 Kurdistan 2018 44 33 78 3 Iran, Iraq 630 NaN NaN NaN 630 630, 645 2 19.1

There are a lot of columns we don't need, let's drop them now.

In [116]:
deaths = deaths.drop(columns=['dyad_id', 'location_inc', 'side_a_id', 'side_a_2nd',
                            'side_b_id', 'side_b_2nd', 'territory_name', 'battle_location',
                            'gwno_a', 'gwno_a_2nd', 'gwno_b', 'gwno_b_2nd', 'gwno_loc', 
                            'gwno_battle', 'version'])
deaths.head()
Out[116]:
conflict_id side_a side_b incompatibility year bd_best bd_low bd_high type_of_conflict region
0 205 Government of Iran KDPI 1 1990 31 31 335 3 2
1 205 Government of Iran KDPI 1 1993 110 110 110 3 2
2 205 Government of Iran KDPI 1 1996 27 27 29 3 2
3 205 Government of Iran KDPI 1 2016 30 30 137 3 2
4 205 Government of Iran KDPI 1 2018 44 33 78 3 2
In [117]:
deaths.describe(include='all')
Out[117]:
conflict_id side_a side_b incompatibility year bd_best bd_low bd_high type_of_conflict region
count 1565.000000 1565 1565 1565.000000 1565.000000 1565.000000 1565.000000 1565.000000 1565.000000 1565
unique NaN 86 349 NaN NaN NaN NaN NaN NaN 7
top NaN Government of India IS NaN NaN NaN NaN NaN NaN 3
freq NaN 161 71 NaN NaN NaN NaN NaN NaN 564
mean 1441.824281 NaN NaN 1.549521 2003.270927 900.244089 786.077955 1156.854952 3.150799 NaN
std 3582.614254 NaN NaN 0.500263 9.082850 3722.826610 2862.735593 4064.040425 0.411170 NaN
min 205.000000 NaN NaN 1.000000 1989.000000 25.000000 1.000000 24.000000 2.000000 NaN
25% 288.000000 NaN NaN 1.000000 1995.000000 43.000000 42.000000 60.000000 3.000000 NaN
50% 333.000000 NaN NaN 2.000000 2003.000000 118.000000 112.000000 173.000000 3.000000 NaN
75% 388.000000 NaN NaN 2.000000 2012.000000 489.000000 454.000000 728.000000 3.000000 NaN
max 14609.000000 NaN NaN 3.000000 2018.000000 68614.000000 50000.000000 68627.000000 4.000000 NaN
In [118]:
deaths.dtypes
Out[118]:
conflict_id          int64
side_a              object
side_b              object
incompatibility      int64
year                 int64
bd_best              int64
bd_low               int64
bd_high              int64
type_of_conflict     int64
region              object
dtype: object

As shown above this data has now has no nulls and all the correct data types. However, we will still have to do some wrangling to get this dataset ready for analyzing. First, the "side_a" column that displays the government side of the given conflict needs to be trimmed down. Every cell starts with "Government of..." and then the name of the country. We just want the country name. Also, some of the conflicts have more than one country cited in the 'side_a" column, we just want the first country listed (the main one).

In [119]:
#remove the 'Government of..' from the beginning of every side_a value
deaths['side_a'] = deaths.side_a.apply(lambda x: x[14:])

#the 'side_a column' in a handful of instances has more than one country/government. I split the column and
#just kept the first(main) country
deaths[['side_a','others']] = pd.DataFrame(deaths['side_a'].str.split(',',1).tolist(), 
                                                                    columns = ['side_a','others'])
deaths = deaths.drop(columns=['others'])
In [120]:
deaths.head()
Out[120]:
conflict_id side_a side_b incompatibility year bd_best bd_low bd_high type_of_conflict region
0 205 Iran KDPI 1 1990 31 31 335 3 2
1 205 Iran KDPI 1 1993 110 110 110 3 2
2 205 Iran KDPI 1 1996 27 27 29 3 2
3 205 Iran KDPI 1 2016 30 30 137 3 2
4 205 Iran KDPI 1 2018 44 33 78 3 2

A similar problem that we encountered occurs in the "region" column, there are some rows with more than one region cited. Let's split the column and do two things, keep the main region (the first one cited) and create a dataframe with the counts of conflicts per region per year (to be used in some interesting exploratory analysis).

In [121]:
#I want to break up the region of the conflict for later analysis (some of the rows have multiple regions divided by commas)
#the first region cited is the main region, so I will keep it in the data frame and pull the rest out into a seperate dataframe.
deaths[['region_main','region2','region3','region4',]] = pd.DataFrame(deaths['region'].str.split(',',4).tolist(), 
                                                                    columns = ['region_main','region2','region3','region4'])

deaths["region1_bool"] = (deaths["region_main"] == "1") | (deaths["region2"] == "1") | (deaths["region3"] == "1") | (deaths["region4"] == "1")
deaths["region2_bool"] = (deaths["region_main"] == "2") | (deaths["region2"] == "2") | (deaths["region3"] == "2") | (deaths["region4"] == "2")
deaths["region3_bool"] = (deaths["region_main"] == "3") | (deaths["region2"] == "3") | (deaths["region3"] == "3") | (deaths["region4"] == "3")
deaths["region4_bool"] = (deaths["region_main"] == "4") | (deaths["region2"] == "4") | (deaths["region3"] == "4") | (deaths["region4"] == "4")
deaths["region5_bool"] = (deaths["region_main"] == "5") | (deaths["region2"] == "5") | (deaths["region3"] == "5") | (deaths["region4"] == "5")

regions =deaths.copy()
deaths = deaths.drop(columns=["region","region2","region3","region4", "region1_bool", "region2_bool", "region3_bool", "region4_bool", "region5_bool"])
deaths.head()
Out[121]:
conflict_id side_a side_b incompatibility year bd_best bd_low bd_high type_of_conflict region_main
0 205 Iran KDPI 1 1990 31 31 335 3 2
1 205 Iran KDPI 1 1993 110 110 110 3 2
2 205 Iran KDPI 1 1996 27 27 29 3 2
3 205 Iran KDPI 1 2016 30 30 137 3 2
4 205 Iran KDPI 1 2018 44 33 78 3 2
In [122]:
#To create our seperate dataframe with the counts of the conflicts per region we will need to pull out only the "True" values from the dataframe, 
#as False is not very useful.

from functools import reduce

regions =regions[["year", "region1_bool", "region2_bool", "region3_bool", "region4_bool", "region5_bool"]]

regions1 = regions.groupby(["year","region1_bool"])
one = regions1.size().unstack().add_prefix('region1')
regions2 = regions.groupby(["year","region2_bool"])
two = regions2.size().unstack().add_prefix('region2')
regions3 = regions.groupby(["year","region3_bool"])
three = regions3.size().unstack().add_prefix('region3')
regions4 = regions.groupby(["year","region4_bool"])
four = regions4.size().unstack().add_prefix('region4')
regions5 = regions.groupby(["year","region5_bool"])
five = regions5.size().unstack().add_prefix('region5')

data_frames = [one,two,three,four,five]


regions = reduce(lambda  left,right: pd.merge(left,right,on=['year'],
                                            how='outer'), data_frames).fillna(0)
In [123]:
#The region codes are laid out in the codebook on my github.
columns = ["region1True","region5True", "region2True","region4True","region3True"]
regions = regions[columns]
regions = regions.rename (columns={
    "region1True" : "Europe",
    "region2True" : "Middle East",
    "region3True" : "Asia",
    "region4True" : "Africa",
    "region5True" : "Americas"
})
regions.head()
Out[123]:
Europe Americas Middle East Africa Asia
year
1989 3 12 7 17 21
1990 3 9 8 18 27
1991 8 7 9 23 20
1992 10 6 6 17 20
1993 11 5 7 15 16

Now we are ready for some exploratory data analysis!

Part 2b: Initial Exploration into UCDP Battle-Related Deaths Data

Let's start with the 'regions' dataframe we just created and plot a stack plot

We see that Africa and Asia are usually the regions with the most conflicts!

In [124]:
fig, ax = plt.subplots(figsize=(12,7))

regions.plot(kind='area', figsize=[16,6], stacked=True, colormap='rainbow', ax=ax)

ax.set_title('Conflict by region from 1990-2018')
ax.set_ylabel('Number of conflicts')
Out[124]:
Text(0, 0.5, 'Number of conflicts')
In [125]:
deaths['region_main'] = deaths['region_main'].map({
    "1" : "Europe",
    "2" : "Middle East",
    "3" : "Asia",
    "4" : "Africa",
    "5" : "Americas"
})

plot = deaths.groupby(["region_main","year"]).sum().reset_index()

fig, ax = plt.subplots(figsize=(12,7))

sns.lineplot(x='year', y="bd_best", hue='region_main', palette=sns.color_palette('Paired', n_colors=5), ax=ax, data=plot)
Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x284c5dfb148>

Interestingly, the Middle East, despite having lower numbers of conflicts in recent years compared to Africa and Asia, clearly shows that its battles are far more deadly.

Part 3: Merging the two datasets ('deaths','arms')

To merge the datasets, we will need to edit the spellings of the countries to make sure they match

In [126]:
# edit names of countries in deaths dataset to match names in arms dataset
side_a_list = deaths['side_a'].unique()
np.sort(side_a_list, axis=None)
Out[126]:
array(['Afghanistan', 'Algeria', 'Angola', 'Australia', 'Azerbaijan',
       'Bangladesh', 'Bosnia-Herzegovina', 'Burkina Faso', 'Burundi',
       'Cambodia (Kampuchea)', 'Cameroon', 'Central African Republic',
       'Chad', 'China', 'Colombia', 'Comoros', 'Congo', 'Croatia',
       'DR Congo (Zaire)', 'Djibouti', 'Ecuador', 'Egypt', 'El Salvador',
       'Eritrea', 'Ethiopia', 'Georgia', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Haiti', 'India', 'Indonesia', 'Iran', 'Iraq',
       'Israel', 'Ivory Coast', 'Jordan', 'Kenya', 'Laos', 'Lebanon',
       'Lesotho', 'Liberia', 'Libya', 'Macedonia', 'Malaysia', 'Mali',
       'Mauritania', 'Mexico', 'Moldova', 'Morocco', 'Mozambique',
       'Myanmar (Burma)', 'Nepal', 'Nicaragua', 'Niger', 'Nigeria',
       'Pakistan', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
       'Philippines', 'Rumania', 'Russia (Soviet Union)', 'Rwanda',
       'Senegal', 'Serbia (Yugoslavia)', 'Sierra Leone', 'Somalia',
       'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Syria',
       'Tajikistan', 'Thailand', 'Trinidad and Tobago', 'Tunisia',
       'Turkey', 'Uganda', 'Ukraine', 'United Kingdom',
       'United States of America', 'Uzbekistan', 'Venezuela',
       'Yemen (North Yemen)'], dtype=object)
In [127]:
#edit the deaths dataframes spellings to match they way they are presented above
deaths['side_a']= deaths['side_a'].replace({
    "Yemen (North Yemen)" : "Yemen",
    "United States of America" : "United States",
    "Serbia (Yugoslavia)" : "Serbia",
    "Russia (Soviet Union)" : "Soviet Union/Russia",
    "Myanmar (Burma)" : "Myanmar",
    "DR Congo (Zaire)" : "DR Congo",
    "Cambodia (Kampuchea)" : "Cambodia"
})
In [128]:
#the deaths df starts in 1989 so we can just pull the arms transfer data from those years before merging. Then merge!
arms_merge = arms.loc[arms['Delivery year']>=1989]
arms_merge = arms_merge.groupby(['Buyer','Delivery year']).sum().reset_index()

merged = pd.merge(arms_merge, deaths, how='outer', left_on=['Buyer','Delivery year'],
                  right_on=['side_a','year'])
merged.head(2)
Out[128]:
Buyer Delivery year Deal ID Order date Numbers delivered SIPRI estimate TIV deal unit TIV delivery values conflict_id side_a side_b incompatibility year bd_best bd_low bd_high type_of_conflict region_main
0 Afghanistan 1989.0 488573.0 43675.0 2427.0 98.56 69.85 2175.52 333.0 Afghanistan Jam'iyyat-i Islami-yi Afghanistan 2.0 1989.0 1545.0 1374.0 1745.0 4.0 Asia
1 Afghanistan 1989.0 488573.0 43675.0 2427.0 98.56 69.85 2175.52 333.0 Afghanistan Hizb-i Islami-yi Afghanistan 2.0 1989.0 181.0 1.0 181.0 4.0 Asia
In [129]:
#notice the thousands of nulls in the merged dataframe. Time to eliminate all nulls!
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4248 entries, 0 to 4247
Data columns (total 18 columns):
Buyer                  3921 non-null object
Delivery year          3921 non-null float64
Deal ID                3921 non-null float64
Order date             3921 non-null float64
Numbers delivered      3921 non-null float64
SIPRI estimate         3921 non-null float64
TIV deal unit          3921 non-null float64
TIV delivery values    3921 non-null float64
conflict_id            1565 non-null float64
side_a                 1565 non-null object
side_b                 1565 non-null object
incompatibility        1565 non-null float64
year                   1565 non-null float64
bd_best                1565 non-null float64
bd_low                 1565 non-null float64
bd_high                1565 non-null float64
type_of_conflict       1565 non-null float64
region_main            1565 non-null object
dtypes: float64(14), object(4)
memory usage: 630.6+ KB
In [130]:
#Fill all NaNs in the new merged dataset. 

merged['Buyer'] = merged['Buyer'].fillna(merged['side_a']) #Buyer and Side_a are the same 
merged['year'] = merged['year'].fillna(merged['Delivery year']) #Delivery year and year of conflict are the same

#All columns in 'cols' are columns that are zero when null (either no purchase or no conflict).
#Note, '0' in the columns 'incompatibility' and 'type_of_conflict' will become new categories that represent
#"no conflict"
cols = ['Numbers delivered','TIV deal unit','TIV delivery values', 'incompatibility',
       'bd_best', 'bd_low', 'bd_high', 'type_of_conflict']
merged[cols] = merged[cols].replace({np.nan:0})

#For year of no conflict, set 'conflict_id' and 'side_b' to 'no conflict'
merged[['conflict_id','side_b', 'region_main']] = merged[['conflict_id','side_b', 'region_main']].replace({np.nan:"no conflict"})
In [131]:
#drop unnecessary columns
merged = merged.drop(columns=['Delivery year', 'Order date', 'side_a', 'Deal ID', 'SIPRI estimate'])
In [132]:
#No more NaNs! We just have to change 'incompatibility', 'region_main' and 'type_of_conflict' to category data types.

for col in ['incompatibility', "type_of_conflict", 'region_main']:
    merged[col] = merged[col].astype('category')
    
print(merged.info())
    
#Now we're ready to analyze the data.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4248 entries, 0 to 4247
Data columns (total 13 columns):
Buyer                  4248 non-null object
Numbers delivered      4248 non-null float64
TIV deal unit          4248 non-null float64
TIV delivery values    4248 non-null float64
conflict_id            4248 non-null object
side_b                 4248 non-null object
incompatibility        4248 non-null category
year                   4248 non-null float64
bd_best                4248 non-null float64
bd_low                 4248 non-null float64
bd_high                4248 non-null float64
type_of_conflict       4248 non-null category
region_main            4248 non-null category
dtypes: category(3), float64(7), object(3)
memory usage: 378.1+ KB
None
In [133]:
merged.head(2)
Out[133]:
Buyer Numbers delivered TIV deal unit TIV delivery values conflict_id side_b incompatibility year bd_best bd_low bd_high type_of_conflict region_main
0 Afghanistan 2427.0 69.85 2175.52 333 Jam'iyyat-i Islami-yi Afghanistan 2.0 1989.0 1545.0 1374.0 1745.0 4.0 Asia
1 Afghanistan 2427.0 69.85 2175.52 333 Hizb-i Islami-yi Afghanistan 2.0 1989.0 181.0 1.0 181.0 4.0 Asia
In [134]:
#The dataframe has yet to be grouped by buyer/year/conflict
#we can do that now by pulling out the columns that identify conflicts (Buyer->year->type_of_conflict->incompatibility->region_main) and summing the rest
merged = merged.groupby(['Buyer','year','type_of_conflict','incompatibility','region_main']).sum()
merged = merged.loc[merged['bd_best'].notnull()].reset_index()
In [137]:
#drop the null rows and we're done!
merged = merged.dropna()
merged.head()
Out[137]:
Buyer year type_of_conflict incompatibility region_main Numbers delivered TIV deal unit TIV delivery values bd_best bd_low bd_high
0 Afghanistan 1989.0 4.0 2.0 Asia 12135.0 349.25 10877.60 5174.0 1904.0 17051.0
1 Afghanistan 1990.0 3.0 2.0 Asia 11980.0 304.70 9498.35 1478.0 1323.0 2275.0
2 Afghanistan 1991.0 3.0 2.0 Asia 5780.0 91.88 5751.84 3302.0 3156.0 3802.0
3 Afghanistan 1992.0 3.0 2.0 Asia 0.0 0.00 0.00 4276.0 4259.0 5460.0
4 Afghanistan 1993.0 3.0 2.0 Asia 0.0 0.00 0.00 3721.0 3713.0 7805.0

Above is our final dataframe that we will now use to complete some more advanced analysis with.

Part 4: K-Nearest Neighbors Regression

Now that we have successfully loaded, wrangled and merged our dataframes, we can try to answer the question that brought us here in the first place: is there a correlation between how much a country spends on armaments and the number of casualties in the wars they are fighting in on the year the weapons are delivered? In other words, if a country purchased, for example, 20 fighter jets in 2015, do we see an uptick in the number of deaths in the war(s) that the country is involved in?

To accomplish this, we will run a K-Nearest Neighbors regression using 10, 50, and 100 nearest neighbors.

In [147]:
my_dpi = 150

X_train = merged[["TIV delivery values"]]
y_train = merged["bd_best"]

X_new = pd.DataFrame()
X_new["TIV delivery values"] = np.arange(0, 15000, 100)

def get_NN_prediction(x_new, n):
    """Given new observation, returns n-nearest neighbors prediction
    """
    dists = ((X_train - x_new) ** 2).sum(axis=1)
    inds_sorted = dists.sort_values().index[:n]
    return y_train.loc[inds_sorted].mean()

fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi)

plt.scatter(merged['TIV delivery values'], merged['bd_best'], c="black", alpha=.3)
# plt.xscale('log')
# plt.yscale('log')
plt.ylim(0,25000)
plt.ylabel('deaths from conflict')
plt.xlabel('TIV Delivery Values')
plt.title('TIV Delivery Values Per Year Per Country vs Death Toll From Conflict (1989-2018)')

colors=['blue','green','red']

for i,k in enumerate ([10,50,100]):
    y_new_pred = X_new.apply(get_NN_prediction, axis=1, args=(k,))
    y_new_pred.index = X_new
    y_new_pred.plot.line(color=colors[i], label=str(k), legend=True)
In [153]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = ["TIV delivery values"]
X_dict = merged[features].to_dict(orient="records")
y = merged["bd_best"]

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()

def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    rmse = np.sqrt((-cross_val_score(
        pipeline, X_dict, y, 
        cv=5, scoring="neg_mean_squared_error")).mean())
    return rmse
    
ks = pd.Series(range(1, 100))
ks.index = range(1, 100)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()
print(test_errs.min())
print(test_errs.idxmin())
2476.3939796958857
99

It is immediately clear that there is no correlation between TIV delivery values and the number of deaths that occur in conflicts!

Part 5: Conclusion

We began this project with the question, "Is increased military spending (increased TIV delivery values), correlated with increased casualties in war". Given the skills I have acquired in this course, I could not find a correlation between these two variables! Interestingly, it appeared that some of the bloodiest conflicts involved countries that spent the least on weapons. My rationalization for this phenomenon is that the countries that spend the most on weapons are actually using the weapons as war deterrents instead of actually using the weapons for war. Much like how countries will produce nuclear weapons to deter invaders, countries that spend the most on weapons may looking to avoid war. I will be taking the machine learning course next semester and will look forward to revisiting this project with more data analysis skills!