If you are like me then you may feel overwhelmed at times by the onslaught of news coverage surrounding wars - or threat thereof- around the world. By simply going to your preferred news source, you can surely find a news anchor outlining the increasing military tensions in countries like North Korea and Iran or bemoaning atrocities in war-torn countries like Syria and Yemen. Along with news stories like these, you will also surely see discussions between experts surrounding the militarization or demilitarization of these regions that may point to the escalation or deescalation of the conflict being discussed.
Thus, if you ARE like me, you may also be asking yourself, what does militarization mean?
I decided to use this curiosity as the basis for this project. Here, I will use two datasets, the SIPRI Arms Transfers Databases and the UCDP Battle-Related Deaths Database to try to figure out which countries are buying mass amounts of armaments, who is selling it to them, and if there is a correlation between these armament purchases and the number of fatalities in the wars these governments are involved in. This notebook will take data from these datasets and, walking through the data life cycle, will ultimately aim to answer these questions.
Dataset #1: SIPRI Arms Transfers Database
"The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually". This dataset, and similar datasets also provided by SIPRI are rich in information concerning everything you could possibly need to know about weapons sales across the world. It should be noted that the SIPRI dataset does not use convential currency to value the arms transfers. Instead, SIPRI has developed a unique pricing system to measure the volume of deliveries of major conventional weapons and components using a common unit the "SIPRI trend-indicator value" (TIV). The TIV of an item being delivered is intended to reflect its military capability rather than its financial value. This common unit can be used to measure trends in the flow of arms between particular countries and regions over time—in effect, a military capability price index.
Dataset #2: UCDP Battle-Related Deaths Data
"This dataset contains information on the number of battle-related deaths in the conflicts around the world from the years 1989 to 2018. The Uppsala Conflict Data Program (UCDP) is the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years". This dataset has everything needed for deep looks into every armed conflict since 1989 and will prove usefule in our comparison with arms sales from the SIPRI dataset.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
%matplotlib inline
# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
#These files can be found on the github page linked at the top of notebook.
arms_50_69 = pd.read_csv('arms_transfer_1950-1969.txt', sep=';')
arms_70_84 = pd.read_csv('arms_transfer_1970-1984.txt', sep=';')
arms_84_99 = pd.read_csv('arms_transfer_1985-1999.txt', sep=';')
arms_00_18 = pd.read_csv('arms_transfer_2000-2018.txt', sep=';')
#combine the files into one
dataframes = [arms_50_69,arms_70_84, arms_84_99, arms_00_18]
arms = pd.concat(dataframes)
# arms = arms.sort_values(by= ['Buyer','Delivery year'])
arms.head()
arms.describe(include='all')
arms.dtypes
arms['Armament category'] = arms['Armament category'].astype('category')
arms.loc[(arms["Delivery year"] == 2009) & (arms["Seller"] == 'Germany') & (arms["Buyer"] == 'Singapore')]
#Total TIV delivery values across the globe since 1950
sns.lineplot(x='Delivery year', y="TIV delivery values", data=arms)
# #filtering data for the 10 ten sellers as dictated by total sales since 1950
arms = arms.replace({"Soviet Union" : "Soviet Union/Russia", "Russia" : "Soviet Union/Russia"})#noticing the Soviet Union and Russia are seperated, I combine them here.
sum_seller = arms.groupby(['Seller']).sum() #groupby seller and aggregate by sum
sum_seller = sum_seller.sort_values(by=["TIV delivery values"], ascending=False) #sort the result by TIV delivery values to get the top sellers on top
sum_seller["rank"]= sum_seller["TIV delivery values"].rank(ascending= False) #create a rank column to make it easy to pull a certain slice of the data
top10_seller = sum_seller.loc[sum_seller["rank"]<=10] #pull just the top 10 sellers
top10_seller = top10_seller.reset_index()
top10_seller_list = top10_seller['Seller'].tolist()
arms_seller = arms.groupby(['Seller',"Delivery year"]).sum()
arms_seller = arms_seller.reset_index()
arms_seller = arms_seller.loc[arms_seller["Seller"].isin(top10_seller_list)]
#plot showing the top 10 sellers and how much they have sold since 1950
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(20, 10))
ax.set(xlim=(1950, 2018))
ax.set_title('Top 10 ten sellers by total TIV delivery values since 1950')
sns.lineplot(x='Delivery year', y="TIV delivery values", hue='Seller', ax=ax, data=arms_seller)
#take USA and Russia per year and then divide total per year and get new column of percentage per year
usa_russia = arms.loc[(arms["Seller"]=="United States")|(arms["Seller"]=="Soviet Union/Russia")]
usa_russia = usa_russia.groupby(["Delivery year"]).sum()
usa_russia = usa_russia["TIV delivery values"]
usa_russia = usa_russia.reset_index()
world = arms.groupby(["Delivery year"]).sum()
world = world["TIV delivery values"]
world = world.reset_index()
usa_russia= pd.merge(usa_russia, world, left_on='Delivery year', right_on='Delivery year')
usa_russia["(USA+Russia) / World"] = (usa_russia["TIV delivery values_x"])/(usa_russia["TIV delivery values_y"])
usa_russia.head(2)
fig, ax = plt.subplots(figsize=(13,8))
usa_russia.plot(x="Delivery year", y="(USA+Russia) / World", ax=ax)
ax.set_title('Proportion of world armament sales by USA and Russia alone')
ax.set_xlabel("Year")
ax.set_ylabel("Ratio of (USA+Russia) sales to total world sales")
#filtering data for the 10 ten buyers as dictated by total sales since 1950
sum_buyer = arms.groupby(['Buyer']).sum()
sum_buyer = sum_buyer.sort_values(by=["TIV delivery values"], ascending=False)
sum_buyer["rank"] = sum_buyer["TIV delivery values"].rank(ascending= False)
top10_buyer = sum_buyer.loc[sum_buyer["rank"]<=10]
top10_buyer = top10_buyer.reset_index()
top10_buyer_list = top10_buyer['Buyer'].tolist()
arms_buyer = arms.groupby(['Buyer',"Delivery year"]).sum()
arms_buyer = arms_buyer.reset_index()
arms_buyer = arms_buyer.loc[arms_buyer["Buyer"].isin(top10_buyer_list)]
arms_buyer.head()
#plot showing the top 10 buyers and how much they have bought since 1950.
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(20, 10))
ax.set_title('Top 10 ten buyers by total TIV delivery values since 1950')
sns.lineplot(x='Delivery year', y="TIV delivery values", hue='Buyer', palette=sns.color_palette('Paired', n_colors=10), ax=ax, data=arms_buyer)
top10_buyer #top 10 buyers all-time (1950-2018)
top10_seller #top 10 sellers all-time (1950-2018)
fig, ax = plt.subplots(figsize=(14, 8))
armament_count_1950 = arms.loc[arms['Delivery year']==1950]['Armament category'].value_counts()
armament_count_2018 = arms.loc[arms['Delivery year']==2018]['Armament category'].value_counts()
ax.set_title("Armament Categories in 1950")
armament_count_1950.plot.pie(ax=ax, legend=False)
fig, ax = plt.subplots(figsize=(14, 8))
ax.set_title("Armament Categories in 2018")
armament_count_2018.plot.pie(ax=ax)
Now that we have taken a look at the arms dataset, let's load and look at the UCDP Batle-Related Deaths Data.
deaths = pd.read_csv('BattleDeaths_v19_1.csv')
deaths.head()
deaths = deaths.drop(columns=['dyad_id', 'location_inc', 'side_a_id', 'side_a_2nd',
'side_b_id', 'side_b_2nd', 'territory_name', 'battle_location',
'gwno_a', 'gwno_a_2nd', 'gwno_b', 'gwno_b_2nd', 'gwno_loc',
'gwno_battle', 'version'])
deaths.head()
deaths.describe(include='all')
deaths.dtypes
#remove the 'Government of..' from the beginning of every side_a value
deaths['side_a'] = deaths.side_a.apply(lambda x: x[14:])
#the 'side_a column' in a handful of instances has more than one country/government. I split the column and
#just kept the first(main) country
deaths[['side_a','others']] = pd.DataFrame(deaths['side_a'].str.split(',',1).tolist(),
columns = ['side_a','others'])
deaths = deaths.drop(columns=['others'])
deaths.head()
#I want to break up the region of the conflict for later analysis (some of the rows have multiple regions divided by commas)
#the first region cited is the main region, so I will keep it in the data frame and pull the rest out into a seperate dataframe.
deaths[['region_main','region2','region3','region4',]] = pd.DataFrame(deaths['region'].str.split(',',4).tolist(),
columns = ['region_main','region2','region3','region4'])
deaths["region1_bool"] = (deaths["region_main"] == "1") | (deaths["region2"] == "1") | (deaths["region3"] == "1") | (deaths["region4"] == "1")
deaths["region2_bool"] = (deaths["region_main"] == "2") | (deaths["region2"] == "2") | (deaths["region3"] == "2") | (deaths["region4"] == "2")
deaths["region3_bool"] = (deaths["region_main"] == "3") | (deaths["region2"] == "3") | (deaths["region3"] == "3") | (deaths["region4"] == "3")
deaths["region4_bool"] = (deaths["region_main"] == "4") | (deaths["region2"] == "4") | (deaths["region3"] == "4") | (deaths["region4"] == "4")
deaths["region5_bool"] = (deaths["region_main"] == "5") | (deaths["region2"] == "5") | (deaths["region3"] == "5") | (deaths["region4"] == "5")
regions =deaths.copy()
deaths = deaths.drop(columns=["region","region2","region3","region4", "region1_bool", "region2_bool", "region3_bool", "region4_bool", "region5_bool"])
deaths.head()
#To create our seperate dataframe with the counts of the conflicts per region we will need to pull out only the "True" values from the dataframe,
#as False is not very useful.
from functools import reduce
regions =regions[["year", "region1_bool", "region2_bool", "region3_bool", "region4_bool", "region5_bool"]]
regions1 = regions.groupby(["year","region1_bool"])
one = regions1.size().unstack().add_prefix('region1')
regions2 = regions.groupby(["year","region2_bool"])
two = regions2.size().unstack().add_prefix('region2')
regions3 = regions.groupby(["year","region3_bool"])
three = regions3.size().unstack().add_prefix('region3')
regions4 = regions.groupby(["year","region4_bool"])
four = regions4.size().unstack().add_prefix('region4')
regions5 = regions.groupby(["year","region5_bool"])
five = regions5.size().unstack().add_prefix('region5')
data_frames = [one,two,three,four,five]
regions = reduce(lambda left,right: pd.merge(left,right,on=['year'],
how='outer'), data_frames).fillna(0)
#The region codes are laid out in the codebook on my github.
columns = ["region1True","region5True", "region2True","region4True","region3True"]
regions = regions[columns]
regions = regions.rename (columns={
"region1True" : "Europe",
"region2True" : "Middle East",
"region3True" : "Asia",
"region4True" : "Africa",
"region5True" : "Americas"
})
regions.head()
fig, ax = plt.subplots(figsize=(12,7))
regions.plot(kind='area', figsize=[16,6], stacked=True, colormap='rainbow', ax=ax)
ax.set_title('Conflict by region from 1990-2018')
ax.set_ylabel('Number of conflicts')
deaths['region_main'] = deaths['region_main'].map({
"1" : "Europe",
"2" : "Middle East",
"3" : "Asia",
"4" : "Africa",
"5" : "Americas"
})
plot = deaths.groupby(["region_main","year"]).sum().reset_index()
fig, ax = plt.subplots(figsize=(12,7))
sns.lineplot(x='year', y="bd_best", hue='region_main', palette=sns.color_palette('Paired', n_colors=5), ax=ax, data=plot)
# edit names of countries in deaths dataset to match names in arms dataset
side_a_list = deaths['side_a'].unique()
np.sort(side_a_list, axis=None)
#edit the deaths dataframes spellings to match they way they are presented above
deaths['side_a']= deaths['side_a'].replace({
"Yemen (North Yemen)" : "Yemen",
"United States of America" : "United States",
"Serbia (Yugoslavia)" : "Serbia",
"Russia (Soviet Union)" : "Soviet Union/Russia",
"Myanmar (Burma)" : "Myanmar",
"DR Congo (Zaire)" : "DR Congo",
"Cambodia (Kampuchea)" : "Cambodia"
})
#the deaths df starts in 1989 so we can just pull the arms transfer data from those years before merging. Then merge!
arms_merge = arms.loc[arms['Delivery year']>=1989]
arms_merge = arms_merge.groupby(['Buyer','Delivery year']).sum().reset_index()
merged = pd.merge(arms_merge, deaths, how='outer', left_on=['Buyer','Delivery year'],
right_on=['side_a','year'])
merged.head(2)
#notice the thousands of nulls in the merged dataframe. Time to eliminate all nulls!
merged.info()
#Fill all NaNs in the new merged dataset.
merged['Buyer'] = merged['Buyer'].fillna(merged['side_a']) #Buyer and Side_a are the same
merged['year'] = merged['year'].fillna(merged['Delivery year']) #Delivery year and year of conflict are the same
#All columns in 'cols' are columns that are zero when null (either no purchase or no conflict).
#Note, '0' in the columns 'incompatibility' and 'type_of_conflict' will become new categories that represent
#"no conflict"
cols = ['Numbers delivered','TIV deal unit','TIV delivery values', 'incompatibility',
'bd_best', 'bd_low', 'bd_high', 'type_of_conflict']
merged[cols] = merged[cols].replace({np.nan:0})
#For year of no conflict, set 'conflict_id' and 'side_b' to 'no conflict'
merged[['conflict_id','side_b', 'region_main']] = merged[['conflict_id','side_b', 'region_main']].replace({np.nan:"no conflict"})
#drop unnecessary columns
merged = merged.drop(columns=['Delivery year', 'Order date', 'side_a', 'Deal ID', 'SIPRI estimate'])
#No more NaNs! We just have to change 'incompatibility', 'region_main' and 'type_of_conflict' to category data types.
for col in ['incompatibility', "type_of_conflict", 'region_main']:
merged[col] = merged[col].astype('category')
print(merged.info())
#Now we're ready to analyze the data.
merged.head(2)
#The dataframe has yet to be grouped by buyer/year/conflict
#we can do that now by pulling out the columns that identify conflicts (Buyer->year->type_of_conflict->incompatibility->region_main) and summing the rest
merged = merged.groupby(['Buyer','year','type_of_conflict','incompatibility','region_main']).sum()
merged = merged.loc[merged['bd_best'].notnull()].reset_index()
#drop the null rows and we're done!
merged = merged.dropna()
merged.head()
my_dpi = 150
X_train = merged[["TIV delivery values"]]
y_train = merged["bd_best"]
X_new = pd.DataFrame()
X_new["TIV delivery values"] = np.arange(0, 15000, 100)
def get_NN_prediction(x_new, n):
"""Given new observation, returns n-nearest neighbors prediction
"""
dists = ((X_train - x_new) ** 2).sum(axis=1)
inds_sorted = dists.sort_values().index[:n]
return y_train.loc[inds_sorted].mean()
fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi)
plt.scatter(merged['TIV delivery values'], merged['bd_best'], c="black", alpha=.3)
# plt.xscale('log')
# plt.yscale('log')
plt.ylim(0,25000)
plt.ylabel('deaths from conflict')
plt.xlabel('TIV Delivery Values')
plt.title('TIV Delivery Values Per Year Per Country vs Death Toll From Conflict (1989-2018)')
colors=['blue','green','red']
for i,k in enumerate ([10,50,100]):
y_new_pred = X_new.apply(get_NN_prediction, axis=1, args=(k,))
y_new_pred.index = X_new
y_new_pred.plot.line(color=colors[i], label=str(k), legend=True)
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = ["TIV delivery values"]
X_dict = merged[features].to_dict(orient="records")
y = merged["bd_best"]
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
rmse = np.sqrt((-cross_val_score(
pipeline, X_dict, y,
cv=5, scoring="neg_mean_squared_error")).mean())
return rmse
ks = pd.Series(range(1, 100))
ks.index = range(1, 100)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
test_errs.sort_values()
print(test_errs.min())
print(test_errs.idxmin())