Objective: Create a visual representation of the correlation matrix to highlight potential relationships between different risk tags.
Essential Terminology:
1. Smart Contract: A self-executing program stored on the blockchain that automatically carries out an agreement when certain conditions are met without needing a middleman.
- For Example: Let’s say you want to buy a digital artwork. A smart contract can be set up so that as soon as you send the payment, the artwork is automatically transferred to you. No need for a third party like PayPal or a bank.
- Smart contracts' use cases are ever-growing. They can even be used for something as simple as a pair of friends making a bet on which team will win a sports game.
- Why Are They Important?
2. Risk Tag: Think of risk tags like warning labels on food—they alert you to potential dangers before you interact with a wallet or contract. These risk tags are provided by Webacy, a security platform that helps protect crypto and NFT assets. There are a total of 32 risk tags, each representing a different type of risk.
- What do they detect?
The libraries:
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Scipy
- networkx
The dataset you will be using is a "compiled_risk_dataset". What exactly is in this dataset? This dataset contains 1094 entries of smart contract vulnerabilities. The first 3 columns contain essential information about the smart contract: the project name, the smart contract address, and the chain. The remaining columns are 32 potential risk tags that may be present in any given smart contract. The dataset is essentially a table that lists what specific risks are present in each contract.
Download the dataset and save it into a pandas dataframe. Print the first five rows using the .head()
function.
How will we calculate the correlations?
We will use the Phi coefficient, which is specifically designed for binary data. The Phi Coefficient is a measure of the association between two binary variables. To calculate the Phi coefficient, we first need to establish a function that can handle this calculation. Create a function that can compute correlations of 2 binary variables.
The following code creates a contingency table:
contingency_table = pd.crosstab(x, y)
The following code calculates the Phi coefficient:
chi2 = scipy.stats.chi2_contingency(contingency_table, correction=False)[0]
n = np.sum(np.sum(contingency_table))
phi = np.sqrt(chi2 / n)
phi_matrix = pd.DataFrame(index=risk_df.columns, columns=risk_df.columns)
Use the following code to calculate Phi coefficient for each pair of binary variables:
for var1 in risk_df.columns:
for var2 in risk_df.columns:
phi_matrix.loc[var1, var2] = phi_coefficient(risk_df[var1], risk_df[var2])
Now it is time to visualize our analysis. Use the following code to create a heatmap:
plt.figure(figsize=(12, 10)) # Setting the size of the plot
sns.heatmap(phi_matrix.astype(float), annot=False, fmt=".2f", cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Heatmap of Phi Coefficients Between Risk Tags')
plt.xticks(rotation=45, ha='right')
plt.show()
Your heatmap should look like the following:
Positive correlations:
1. buy_tax and sell_tax
2. encode_packed_parameters and encode_packed_collision
3. encode_packed_collision and is_airdrop_scam
4. illegal_unicode and is_airdrop_scam
5. anti_whale_modifiable and slippage_modifiable
6. is_false_token and is_airdrop_scam
7. shadowing_local and is_airdrop_scam
8. shadowing_local and encode_packed_collision
Deeper insights: