Author: Anderson Hitoshi Uyekita
Project: Wrangle and Analyze Data
Course: Data Science - Foundations II
COD: ND111
Date: 02/01/2019
Version: 2.0
This project aims to give to the student a real case of how to gather, assess, clean, and analyze the data, in other words its englobes the Data Wrangling and Exploratory Data Analysis. The database used as an example is about the @dog_rate user from Twitter, as known as WeRateDogs™, this account has more than 7,572,000 followers, 9,500 tweets, and 141,000 likes.
The Data Gathering process bundled three different tasks, the first one download file from URL and later loading to the Jupyter Notebook, which requires a manual step, the second downloading a file programmatically, and the third gathering data from the Twitter API. This step has also required to save these data in a local machine.
Based on the data gathered, I have assessed the most evident issues (17 issues in total) and documented it to create a record of modifications. Later, in Data Cleaning process I have fixed all identified issues, and I have also merged (the two downloaded files from the Data Gathering process) into one and added some missing values (from the archive downloaded from the Twitter API). The final data frame was stored as twitter_archive_master.csv
.
In the Data Analysis and Visualization, which I have interpreted as Exploratory Analysis, I have posed few questions to guide my analysis, which lead me to found strong evidence of:
Finally, the Project also has other deliverables, which could be accessed by the following links:
This Jupyter Notebook aims to document the process of Gathering, Assessing, and Cleaning of Data Science Foundations II Nanodegree Program Project 02.
I have written this report using the Jupyter Notebook, which will allow anyone to reproduce this document in a local computer. I have made a great effort to document every step to turn this document understandable.
I have performed this project using:
This report has written using four software:
I have used the Atom to push to Github repository and to visualize some files, and nothing more than this.
I kindly ask you to install each of this packages before you run the next steps.
You can access all files of this report in this repository:
For this project I will use some libraries to manage data frames, download file, Twitter API Client code, etc.. Let's import this packages to the Jupyter Notebook environment.
# Importing Library.
# Importing pandas to work with DataFrames.
import pandas as pd
pd.set_option('mode.chained_assignment','raise')
# Importing numpy to general methods.
import numpy as np
# Importing OS package to write and open files.
import os
# Importing the requests to create a object from a given URL.
import requests
# Importing the Client Code to use the Twitter API.
import tweepy
# Importing the json package to easy my task with json files
import json
# Importing the time package to calculate the elapse time to access the Twitter API
import time
# Importing the re package to use the regular expressions.
import re
# Importing the matplotlib to create graphics
import matplotlib.pyplot as plt
%matplotlib inline
# Import seaborn to better the visualization
import seaborn as sns
There are two (2) files hosted in Udacity website that I must have to gather (download).
image-predictions.tsv
, and;twitter-archive-enhanced.csv
.Later, I will access the Twitter API to gather additional data.
First, I will create a folder to store the data and later I will download each file.
# Creating a folder to store the files.
folder = "01-Dataset" # Folder's name.
# Checking if the folder already exist.
if not folder in os.listdir():
os.mkdir(folder) # Creating the folder when 01-Dataset do not exist.
# List of files.
list_url = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv',
'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv']
# Loop to download the two files hosted in Udacity website.
for url in list_url:
if not url.split('/')[-1].replace('-','_') in os.listdir(folder): # Check if the files already exist.
response = requests.get(url) # Creating an object of the given URL.
with open(os.path.join(folder,url.split('/')[-1].replace('-','_')), mode = 'wb') as file: # Defining the file name.
file.write(response.content) # Saving the object.
Now that all file is saved in the local machine (stored in the 01-Dataset
folder), let's load each file into the Jupyter Notebook environment.
# Loading the image precictions
df_img = pd.read_csv('01-Dataset/image_predictions.tsv', sep = '\t') # tsv file need to configure the sep as tabular.
# Loading the archive of WeRateDogs.
df_ach = pd.read_csv('01-Dataset/twitter_archive_enhanced.csv') # regular csv file.
These two files are ready for the next step of Data Wrangling, which is called Data Assessing.
One of the Project requirements is to access the Twitter API to create the tweet_json.txt
completing some missing/wrong values of the tweet-json
file. I will use the tweepy package (a client code) to access the Twitter API.
Please, fill with your own consumer_key
, consumer_secret
, access_token
and access_secret
.
# Complete with your own keys and secretes.
consumer_key = '' # API key
consumer_secret = '' # API secret key
access_token = '' # Access token
access_secret = '' # Access token secret
Based on the given consumer_key
, consumer_secret
, access_token
and access_secret
I will access the Twitter API.
The following chunk of code has a default configuration of the API, bear in mind, the wait_on_rate_limit
and wait_on_rate_limit_notify
, both are set as True
.
- wait_on_rate_limit – Whether or not to automatically wait for rate limits to replenish
- wait_on_rate_limit_notify – Whether or not to print a notification when Tweepy is waiting for rate limits to replenish
Obs.: The excerpt was extracted from the Tweepy website.
# Authentication (default configuration).
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
The reason to use the Twitter API is to gather additional data and missing data from the twitter_archive_enhanced.csv
file. For this reason, the strategy is to find for each tweet_id the missing and the additional data.
First, I need to know the quantity of unique tweets twitter_archive_enhanced.csv
have.
# Unique tweets of twitter-archive-enhanced-2.csv file.
u_twt_id = df_ach.tweet_id.tolist()
# Print the number of unique tweets.
len(df_ach.tweet_id.unique())
2356
The quantity probably will raise/trigger the TweepError
, which require the API use wait some minutes to keep gathering data from Twitter.
I have also configured the tweet_mode
to extended
because I want all messages without the ...
(elipsis), Twitter has increased the character's number from 140 to 280.
For a deeper understanding of _json
dictionary, please read the Appendix.
# Initializing the dicitionary to store the raw data from Twitter API JSON file.
dct_twt_raw = {}
# This is the list of error.
df_error = []
# Current time when the Twitter API starts.
str_loop = time.time()
# Loop to find each tweet_id of u_twt_id list.
for tweet in u_twt_id:
# This is necessary to avoid an interruption if any tweet_id is not find.
try:
# The get_status will find a specific tweet_id and return all the tweet.
twt_status = api.get_status(tweet,
wait_on_rate_limit = True, # Automatically wait for rate limits to replenish.
wait_on_rate_limit_notify = True, # Print the notification "Rate limit reached..."
tweet_mode = 'extended') # I want the full text without "...".
# The data I want is the _json dictionary.
dct_twt_raw[str(tweet)] = twt_status._json # Bear in mind, _json is a dictionary.
# In case of a problem, a message will be printed.
except:
df_error.append(str(tweet)) # Update df_error list.
print("tweet_id did not found: {}".format(str(tweet)))
# Current time when the Twitter API ends.
end_loop = time.time()
tweet_id did not found: 888202515573088257 tweet_id did not found: 873697596434513921 tweet_id did not found: 872668790621863937 tweet_id did not found: 869988702071779329 tweet_id did not found: 866816280283807744 tweet_id did not found: 861769973181624320 tweet_id did not found: 845459076796616705 tweet_id did not found: 842892208864923648 tweet_id did not found: 837012587749474308 tweet_id did not found: 827228250799742977 tweet_id did not found: 812747805718642688 tweet_id did not found: 802247111496568832 tweet_id did not found: 775096608509886464 tweet_id did not found: 770743923962707968
Rate limit reached. Sleeping for: 565
tweet_id did not found: 754011816964026368 tweet_id did not found: 680055455951884288
Rate limit reached. Sleeping for: 566
Unfortunately, some tweet_id was not found in @dog_rates
tweets database, for some reason the tweet disappeared.
I have stored this information in a separated list in case of necessity I can use it later.
For sanity reason, I have recorded the time elapsed to gather data from Twitter API.
To query all of the tweet IDs in the WeRateDogs Twitter archive, 20-30 minutes of running time can be expected. Printing out each tweet ID after it was queried and using a code timer were both helpful for sanity reasons.
# Calculating the elapsed time in minutes.
elapsed_time = (end_loop - str_loop)/60
# Printing the elapsed time of the loop.
print("Elapsed time: {} minutes".format(round(elapsed_time,2)))
Elapsed time: 33.57 minutes
Following the project's instruction, let's create the tweet_json.txt
file.
Bear in mind, the dct_twt_raw
is a dictionary, and I will save it as a JSON file.
# Writing the tweet_json.txt in the 01-Dataset folder.
with open('01-Dataset/tweet_json.txt', mode = 'w') as file:
json.dump(dct_twt_raw, file)
Also according to the project's instructions, I need to read the exported file (tweet_json.txt
).
Take consideration tweet_json.txt
file is a dictionary and I do want a DataFrame, I can use the read_json
from pandas package to import as a DataFrame. I have also defined the orient
as an index
to be in conformity with the definition of tidy dataset preconized by Hadley Wickham.
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
# Loading the tweet_json.txt as a panda DataFrame.
df_twt_raw = pd.read_json('01-Dataset/tweet_json.txt', orient = 'index') # Atention to the orient defined as index.
# Printing the first 2 rows.
df_twt_raw.head(2)
contributors | coordinates | created_at | display_text_range | entities | extended_entities | favorite_count | favorited | full_text | geo | ... | quoted_status | quoted_status_id | quoted_status_id_str | quoted_status_permalink | retweet_count | retweeted | retweeted_status | source | truncated | user | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1998-04-12 22:37:23.555336193 | NaN | NaN | 2017-08-01 16:23:56 | [0, 85] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | 38075 | 0 | This is Phineas. He's a mystical boy. Only eve... | NaN | ... | NaN | NaN | NaN | NaN | 8328 | 0 | NaN | <a href="http://twitter.com/download/iphone" r... | 0 | {'id': 4196983835, 'id_str': '4196983835', 'na... |
1998-04-10 03:03:41.306343426 | NaN | NaN | 2017-08-01 00:17:27 | [0, 138] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | 32680 | 0 | This is Tilly. She's just checking pup on you.... | NaN | ... | NaN | NaN | NaN | NaN | 6151 | 0 | NaN | <a href="http://twitter.com/download/iphone" r... | 0 | {'id': 4196983835, 'id_str': '4196983835', 'na... |
2 rows × 32 columns
Finally, I have gathered all three datasets:
tweet_json.txt
as df_twt_raw
;image_predictions.tsv
as df_img
, and;twitter_archive_enhanced.csv
as df_ach
.The issues could be defined into two types:
Project requirements extracted from the project rubric.
At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.
I will perform a straightforward assessment.
So, for each dataset, I will point out all problem I have detected, and later in the Data Cleaning process I will clean and fix these issues.
The twitter_archive_enhanced.csv
file has loaded to Jupyter Notebook environment as df_ach
. As the first contact to this data frame, let's start to investigate this archive using the traditional visual assessment such .head()
and .tail()
methods.
.head()
¶# Print the first 5 rows of the data set.
df_ach.head()
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
.tail()
¶# Print the last 5 rows of the data set.
df_ach.tail()
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
As a result of the .head()
and .tail()
methods, I can point out the following issues recorded in Table 1.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
1 | Quality | Validity | Visual | name | Invalid names or non-standard names. |
2 | Tidiness | - | Visual | source | HTML tags, URL, and content in a single column. |
I have found some weird names, which has required a deeper analysis using the .value_counts()
method to examine the recurrence of this error.
.value_counts()
¶# What is the most commom name to a Dog!?
df_ach.name.value_counts()
None 745 a 55 Charlie 12 Lucy 11 Oliver 11 Cooper 11 Penny 10 Lola 10 Tucker 10 Bo 9 Winston 9 the 8 Sadie 8 Bailey 7 an 7 Daisy 7 Toby 7 Buddy 7 Dave 6 Stanley 6 Rusty 6 Jack 6 Koda 6 Oscar 6 Leo 6 Milo 6 Scout 6 Jax 6 Bella 6 Louis 5 ... such 1 Jeb 1 Fletcher 1 Laela 1 space 1 Monkey 1 Ferg 1 Longfellow 1 Mojo 1 Ricky 1 Vinscent 1 Koko 1 Chloe 1 Shiloh 1 Thor 1 Rumpole 1 Kanu 1 Filup 1 Berkeley 1 Goose 1 Superpup 1 Eazy 1 Comet 1 Tonks 1 Zara 1 Bluebert 1 Lili 1 Caryl 1 Sprout 1 Pepper 1 Name: name, Length: 957, dtype: int64
The .value_counts()
method applied to the name
's columns shows a higher number of weird and probably wrong dogs names.
This list ensures the problem already identified visually.
Let's deep in the rating_denominator
and rating_numerator
.
# What is the most commom denominator. I guess it must be 10.
df_ach.rating_denominator.value_counts()
10 2333 11 3 50 3 80 2 20 2 2 1 16 1 40 1 70 1 15 1 90 1 110 1 120 1 130 1 150 1 170 1 7 1 0 1 Name: rating_denominator, dtype: int64
# In the past @dog_rates used to rate with grades between 0 to 10.
df_ach.rating_numerator.value_counts()
12 558 11 464 10 461 13 351 9 158 8 102 7 55 14 54 5 37 6 32 3 19 4 17 1 9 2 9 420 2 0 2 15 2 75 2 80 1 20 1 24 1 26 1 44 1 50 1 60 1 165 1 84 1 88 1 144 1 182 1 143 1 666 1 960 1 1776 1 17 1 27 1 45 1 99 1 121 1 204 1 Name: rating_numerator, dtype: int64
As you can see, both columns have problems in some observations. Table 2 summarizes the issues.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
1 | Quality | Validity | Visual | name | Invalid names or non-standard names. |
2 | Tidiness | - | Visual | source | HTML tags, URL, and content in a single column. |
3 | Quality | Validity | Programmatic | rating_numerator | Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from int to float . |
4 | Quality | Validity | Programmatic | rating_denominator | Invalid denominator, I expected a fixed base. Data Structure must be converted from int to float . |
.info()
¶# An overview of the df_ach.
df_ach.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): tweet_id 2356 non-null int64 in_reply_to_status_id 78 non-null float64 in_reply_to_user_id 78 non-null float64 timestamp 2356 non-null object source 2356 non-null object text 2356 non-null object retweeted_status_id 181 non-null float64 retweeted_status_user_id 181 non-null float64 retweeted_status_timestamp 181 non-null object expanded_urls 2297 non-null object rating_numerator 2356 non-null int64 rating_denominator 2356 non-null int64 name 2356 non-null object doggo 2356 non-null object floofer 2356 non-null object pupper 2356 non-null object puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 220.9+ KB
The .info()
method provide a good overview of each column, which allow me to identify latents issues. Table 3 aggregates these new issues.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
1 | Quality | Validity | Visual | name | Invalid names or non-standard names. |
2 | Tidiness | - | Visual | source | HTML tags, URL, and content in a single column. |
3 | Quality | Validity | Programmatic | rating_numerator | Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from int to float . |
4 | Quality | Validity | Programmatic | rating_denominator | Invalid denominator, I expected a fixed base. Data Structure must be converted from int to float . |
5 | Tidiness | - | Programmatic | doggo, floofer, pupper, and puppo | This is a categorical variable, and I can combine these columns into one column. |
6 | Tidiness | - | Programmatic | text | There is two information in a single column. Split the text from the URL. |
7 | Quality | Validity | Programmatic | timestamp | Convert to date. |
8 | Quality | Validity | Programmatic | tweet_id | Following the example of zip code, it must be a string. |
9 | Quality | Accuracy | Programmatic | retweeted_status_id | The same dog could be recorded twice or more in cases of retweets. |
10 | Quality | Accuracy | Programmatic | in_reply_to_status_id | The same dog could be recorded twice or more in cases of reply. |
The retweeted_status_id
, retweeted_status_user_id
, in_reply_to_status_id
, and in_reply_to_user_id
could lead wrong results because it will duplicate the same dog picture in each retweet or reply. For this reason, it's good to remove this tweet_id
's.
An example of this problem is shown below. The dog so-called "Sierra" will appears twice due to the retweet.
# Subsetting the df_ach to find a retweet.
df_ach[df_ach.name == 'Sierra']
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97 | 873337748698140672 | NaN | NaN | 2017-06-10 00:35:19 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: This is Sierra. She's one preci... | 8.732138e+17 | 4.196984e+09 | 2017-06-09 16:22:42 +0000 | https://www.gofundme.com/help-my-baby-sierra-g... | 12 | 10 | Sierra | None | None | pupper | None |
98 | 873213775632977920 | NaN | NaN | 2017-06-09 16:22:42 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Sierra. She's one precious pupper. Abs... | NaN | NaN | NaN | https://www.gofundme.com/help-my-baby-sierra-g... | 12 | 10 | Sierra | None | None | pupper | None |
Let's take a look at the duplicated tweet_id
.
# Duplicated tweet_id.
sum(df_ach.tweet_id.duplicated())
0
Although there are no duplicate tweet_id
, I need to be aware of the retweet, which creates a new tweet_id
for the same dog, resulting in double entries (or more).
# Printing the first 5 rows.
df_img.head()
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
.tail()
¶# Printing the last 5 rows.
df_img.tail()
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
Visually I did not find much, unless the inconsistency in the breed name, some has the first letter capitalized and other not. I will record this issue in Table 4.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
11 | Quality | Consistency | Visual | p1, p2, and p3 | Dog's breed has no standard. Capital letter or lowercase names. |
.info()
¶# An overview of the df_img.
df_img.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2075 non-null int64 jpg_url 2075 non-null object img_num 2075 non-null int64 p1 2075 non-null object p1_conf 2075 non-null float64 p1_dog 2075 non-null bool p2 2075 non-null object p2_conf 2075 non-null float64 p2_dog 2075 non-null bool p3 2075 non-null object p3_conf 2075 non-null float64 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 119.6+ KB
One more issue observed is the type of tweet_id variable. Table 5 summarizes the issues found in image_predictions.tsv
.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
11 | Quality | Consistency | Visual | p1, p2, and p3 | Dog's breed has no standard. Capital letter or lowercase names. |
12 | Quality | Validity | Programmatic | tweet_id | Convert to string. |
Let's analyze the duplicated entries.
.duplicated()
¶# Duplicated tweet_id.
sum(df_img.tweet_id.duplicated())
0
# Duplicated jpg_url.
sum(df_img.jpg_url.duplicated())
66
The tweet_id
is a unique key to the tweet and not to the rated dog. For this reason, it is a good idea to filter for a unique picture of the dog, which will also remove the retweets. Table 6 summarizes the issues of the image_predictions.tsv
file.
Issue ID | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|
11 | Quality | Consistency | Visual | p1, p2, and p3 | Dog's breed has no standard. Capital letter or lowercase names. |
12 | Quality | Validity | Programmatic | tweet_id | Convert to string. |
13 | Quality | Validity | Programmatic | jpg_url | It has duplicated images and consequently double entry. |
Based on the tidy data concept, these two tables (twitter_archive_enhanced.csv
and image_predictions.tsv
) could be merged into one, using the tweet_id
as a mapping key.
Summarizing Table 3, Table 6, and this new issue, Table 7 is the preliminary version of the issues on all dataset of this project.
Issue ID | Table | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|---|
1 | df_ach | Quality | Validity | Visual | name | Invalid names or non-standard names. |
2 | df_ach | Tidiness | - | Visual | source | HTML tags, URL, and content in a single column. |
3 | df_ach | Quality | Validity | Programmatic | rating_numerator | Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from int to float . |
4 | df_ach | Quality | Validity | Programmatic | rating_denominator | Invalid denominator, I expected a fixed base. Data Structure must be converted from int to float . |
5 | df_ach | Tidiness | - | Programmatic | doggo, floofer, pupper, and puppo | This is a categorical variable, and I can combine these columns into one column. |
6 | df_ach | Tidiness | - | Programmatic | text | There is two information in a single column. Split the text from the URL. |
7 | df_ach | Quality | Validity | Programmatic | timestamp | Convert to date. |
8 | df_ach | Quality | Validity | Programmatic | tweet_id | Following the example of zip code, it must be a string. |
9 | df_ach | Quality | Accuracy | Programmatic | retweeted_status_id | The same dog could be recorded twice or more in cases of retweets. |
10 | df_ach | Quality | Accuracy | Programmatic | in_reply_to_status_id | The same dog could be recorded twice or more in cases of reply. |
11 | df_img | Quality | Consistency | Visual | p1, p2, and p3 | Dog's breed has no standard. Capital letter or lowercase names. |
12 | df_img | Quality | Validity | Programmatic | tweet_id | Convert to string. |
13 | df_img | Quality | Validity | Programmatic | jpg_url | It has duplicated images and consequently double entry. |
14 | twt_ach_mstr | Tidiness | - | Programmatic | - | Merging these two tables (df_ach and df_img ) into one. |
Where:
df_ach
: twitter_archive_enhanced.csv;df_img
: image_predictions.tsv;twt_ach_mstr
: twitter_archive_master.csv, and;df_twt_raw
: tweet_json.txt.Have in mind the Data Assessing is not an exhaustive step, and neither aims to find all issues of these two tables. Probably, this step will be revisited again to find hidden issues.
# Copying the df_img.
df_img_cln = df_img.copy()
# Copying the df_ach.
df_ach_cln = df_ach.copy()
# Copying the df_twt_raw.
df_twt_cln = df_twt_raw.copy()
There is no correct sequence to fix the issues, for this reason, I have decided to start fixing the problems related to duplicated entries, data type, and later merging the tables df_ach
and df_img
.
Issue: Duplicated images and consequently double entry.
# Vector to indexing the duplicated jpg url.
indexing = df_img.jpg_url.duplicated()
# I want the opposite. All observations except the duplicated.
indexing = np.logical_not(indexing)
# Subsetting and assigning to the df_img cleaned.
df_img_cln = df_img_cln[indexing]
# New dimensions of df_img_cln
print("Before: {} rows.\nAfter: {} rows.".format(df_img_cln.shape[0], df_img.shape[0]))
Before: 2009 rows. After: 2075 rows.
# Testing the code given the defined solution.
print("{} duplicated.".format(sum(df_img_cln.jpg_url.duplicated())))
0 duplicated.
As you can see, I have removed 66 observations (the original observation's number is 2075). To ensure I have printed the number of duplicated items in the jpg_url column, which is now zero.
Due to the same nature of these two issues, I decided to bundle it to avoid repetition.
Issue ID 9: The same dog could be recorded twice or more in cases of retweets.
Issue ID 10: The same dog could be recorded twice or more in cases of reply.
retweeted_status_id
non-null, and;in_reply_to_status_id
non-null.Following the same idea of Issue 14, I will remove any retweet and reply.
# CODE - Issue ID 9
# Creating a vector to select the retweeted_status_id different of Null.
indexing = df_ach_cln.retweeted_status_id.isnull()
# Subsetting the df_ach_cln excluding the retweeted.
df_ach_cln = df_ach_cln[indexing]
# CODE - Issue ID 10
# Creating a vector to select the in_reply_to_status_id different of Null.
indexing = df_ach_cln.in_reply_to_status_id.isnull()
# Subsetting the df_ach_cln excluding the reply.
df_ach_cln = df_ach_cln[indexing]
# Testing: Expect all rows as True when using the .isnull()
print("Number of rows with true in retweeted_status_id:", sum(df_ach_cln.retweeted_status_id.isnull()))
# Testing: Expect all rows as True when using the .isnull()
print("Number of rows with true in in_reply_to_status_id:", sum(df_ach_cln.in_reply_to_status_id.isnull()))
# Number of rows in df_ach_cln.
print("Number of rows of df_ach_cln:",df_ach_cln.shape[0])
Number of rows with true in retweeted_status_id: 2097 Number of rows with true in in_reply_to_status_id: 2097 Number of rows of df_ach_cln: 2097
The test shows both columns, retweeted_status_id and in_reply_to_status_id, do not have any content, which means are two empty columns.
Issue: Following the example of zip code, it must be string.
Following the same idea from this thread of Stack Overflow:
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
tweet_id
to str.# Converting the tweet_id to string.
df_ach_cln.tweet_id = df_ach_cln.tweet_id.astype(str)
# Printing the info to ensure the convertion. Bear in mind, in pandas object is string.
df_ach_cln.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 17 columns): tweet_id 2097 non-null object in_reply_to_status_id 0 non-null float64 in_reply_to_user_id 0 non-null float64 timestamp 2097 non-null object source 2097 non-null object text 2097 non-null object retweeted_status_id 0 non-null float64 retweeted_status_user_id 0 non-null float64 retweeted_status_timestamp 0 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null int64 rating_denominator 2097 non-null int64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: float64(4), int64(2), object(11) memory usage: 204.8+ KB
# Converting the tweet_id to string.
df_img_cln.tweet_id = df_img_cln.tweet_id.astype(str)
# Printing the info to ensure the convertion. Bear in mind, in pandas object is string.
df_img_cln.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2009 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2009 non-null object jpg_url 2009 non-null object img_num 2009 non-null int64 p1 2009 non-null object p1_conf 2009 non-null float64 p1_dog 2009 non-null bool p2 2009 non-null object p2_conf 2009 non-null float64 p2_dog 2009 non-null bool p3 2009 non-null object p3_conf 2009 non-null float64 p3_dog 2009 non-null bool dtypes: bool(3), float64(3), int64(1), object(5) memory usage: 123.6+ KB
Based on the .info()
the tweet_id
has converted as object
, which is in pandas the same of a string.
Issue: Based on the concept of tidy data preconized by Hadley Wickham, it is a good idea to merge these two tables.
The merging process will use the tweet_id
as a Key.
Obs.: Read more about merging in Pandas Website.
df_ach_cln
and df_img_cln
.# Creating the twitter_archive_master.
twt_ach_mstr = pd.merge(df_ach_cln, df_img_cln, on = 'tweet_id')
# The first 5 rows of the merged table.
twt_ach_mstr.head()
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | ... | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | ... | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | ... | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | ... | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | ... | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | ... | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
5 rows × 28 columns
# Printing the number of columns of df_ach_cln and df_img_cln.
print("df_ach_cln: {} columns\ndf_img_cln: {} columns".format(df_ach_cln.shape[1], df_img_cln.shape[1]))
df_ach_cln: 17 columns df_img_cln: 12 columns
As you can see, the twt_ach_mstr
, data frame result from the merging process of df_ach_cln
and df_img_cln
, has 28 columns, which is in compass to the summation of the columns of df_ach_cln
and df_img_cln
minus the "key" parameter (17 + 12 - 1 = 28).
Issue: Invalid names or non-standard names.
The expression used to track the dog's name (in other words the regular expression) is likely "This is DOG_NAME" when DOG_NAME is changed by "a funny" or "an incredible" or "very something", the expression did not "scrap" the right name. In some cases neither a name the dog has, but the expression with taking whatever it lays in that position.
For this reason, I have changed every non-standard name to "None".
# Initialization of variable.
non_names = []
# Loop to find ordinary word.
for index in twt_ach_mstr.name:
# Will check every name start with lowercase.
if index.islower():
# If yes will append to non_names.
non_names.append(index)
# The list will filter only unique values
non_names = list(set(non_names))
# Printing non-standard names.
non_names
['not', 'this', 'just', 'actually', 'my', 'one', 'infuriating', 'the', 'unacceptable', 'all', 'by', 'his', 'light', 'very', 'quite', 'an', 'officially', 'a', 'space', 'getting', 'incredibly', 'such']
The list of names shown above has all non-standard names from name column. I had replaced it by None
using the .replace()
method.
# Loop to replace each non standard name (non_name).
for index in non_names:
twt_ach_mstr.name.replace(index,
"None",
inplace = True)
# Based on the list of non-standard names. Let's ensure if all have been removed.
sum(twt_ach_mstr.name.isin(non_names))
0
The test ensure there are no more non-standard names.
Issue: HTML tags, URL, and content in a single column.
I will edit it because there are two pieces ofinformations stored in a single column.
Obs.: I have read this thread in Stack Overflow to revise how to use the .concat()
.
href
, and;source
column the content between HTML tags.# Creating auxiliary list to store the source and source_url.
aux_ls_sr = []
aux_ls_sr_url = []
# This loop will remove all HTML tags using .split() and indexing.
for index in twt_ach_mstr.source:
aux_ls_sr.append(index.split('</a')[0].split('<a href="')[1].split('" rel="nofollow">')[1] )
aux_ls_sr_url.append(index.split('</a')[0].split('<a href="')[1].split('" rel="nofollow">')[0] )
# Updating the source column with aux_ls_sr.
twt_ach_mstr.source = aux_ls_sr
# Appending a new column with source_url info.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(aux_ls_sr_url, columns = ['source_url'])], axis = 1)
# Printing the first 5 rows of a subsetted data frame.
twt_ach_mstr[['tweet_id','source','source_url']].head()
tweet_id | source | source_url | |
---|---|---|---|
0 | 892420643555336193 | Twitter for iPhone | http://twitter.com/download/iphone |
1 | 892177421306343426 | Twitter for iPhone | http://twitter.com/download/iphone |
2 | 891815181378084864 | Twitter for iPhone | http://twitter.com/download/iphone |
3 | 891689557279858688 | Twitter for iPhone | http://twitter.com/download/iphone |
4 | 891327558926688256 | Twitter for iPhone | http://twitter.com/download/iphone |
# Printing the categories of source column.
twt_ach_mstr.source.value_counts()
Twitter for iPhone 1932 Twitter Web Client 28 TweetDeck 11 Name: source, dtype: int64
# Printing the new column.
twt_ach_mstr.source_url.value_counts()
http://twitter.com/download/iphone 1932 http://twitter.com 28 https://about.twitter.com/products/tweetdeck 11 Name: source_url, dtype: int64
The source
column now does not show any HTML tag, even more, the href URL from the HTML tag was converted into a new feature stored in source_url
column.
Issue: There are two information in a single column.
In the same way, the text
column (of twt_ach_mstr
) has two pieces of information, the text and the URL at the end of the message. Let's divide this into two separated columns.
text
column, and;# Creating auxiliary list to store the text and text_url.
aux_txt = []
aux_txt_url = []
# This loop will remove all HTML tags using .split() and indexing.
for index in twt_ach_mstr.text:
aux_txt.append(' '.join(index.split(" ")[:-1]))
aux_txt_url.append(index.split(" ")[-1])
# Updating the source column with aux_ls_sr.
twt_ach_mstr.text = aux_txt
# Appending a new column with source_url info.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(aux_txt_url, columns = ['text_url'])], axis = 1)
# Printing the first 5 rows of a subsetted dataframe.
twt_ach_mstr[['tweet_id','text','text_url']].head()
tweet_id | text | text_url | |
---|---|---|---|
0 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | https://t.co/MgUWQ76dJU |
1 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | https://t.co/0Xxu71qeIV |
2 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | https://t.co/wUnZnhtVJB |
3 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | https://t.co/tD36da7qLQ |
4 | 891327558926688256 | This is Franklin. He would like you to stop ca... | https://t.co/AtUZn91f7f |
# Printing the first 5 rows of text column.
twt_ach_mstr.text.tolist()[:5]
["This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10", "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10", 'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10', 'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us', 'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek']
The URL was removed and assigned to a new column (text_url
) and the text
is free of URL at the end of each tweet.
Issue: Invalid ratings. Value varies from 1776 to 0. Data Strucutre must be converted from int to float.
Issue: Invalid denominator, I expected a fixed base. Data Strucutre must be converted from int to float.
First of all, I will convert both columns, rating_numerator
and rating_denominator
, from int
to float
because in some cases the @dog_rates
uses float
rating number.
rating_numerator
and rating_denominator
to float
;rating_numerator
, and;# Converting the rating_numerator and rating_denominator to float.
twt_ach_mstr.rating_numerator = twt_ach_mstr.rating_numerator.astype(float)
twt_ach_mstr.rating_denominator = twt_ach_mstr.rating_denominator.astype(float)
Most of the numerator values are in conformity to what I expected, but few have extreme values such as 1776, 420, 204, and so on.
For this reason, I am going to subset the twt_ach_mstr
data frame to realize what is in common with these observations.
Obs.: I have read this thread in Stack Overflow to learn about .isin()
# Creating the black list
black_list = sorted(twt_ach_mstr.query('rating_numerator > 15').rating_numerator.tolist())
# Let's subset the twt_ach_mstr using the .isin() method.
twt_ach_mstr[twt_ach_mstr.rating_numerator.isin(black_list)][['tweet_id','text']]
tweet_id | text | |
---|---|---|
320 | 820690176645140481 | The floofs have been released I repeat the flo... |
382 | 810984652412424192 | Meet Sam. She smiles 24/7 & secretly aspir... |
499 | 786709082849828864 | This is Logan, the Chow who lived. He solemnly... |
549 | 778027034220126208 | This is Sophie. She's a Jubilant Bush Pupper. ... |
658 | 758467244762497024 | Why does this never happen at my front door...... |
722 | 749981277374128128 | This is Atticus. He's quite simply America af.... |
842 | 731156023742988288 | Say hello to this unbelievably well behaved sq... |
918 | 716439118184652801 | This is Bluebert. He just saw that both #Final... |
939 | 713900603437621249 | Happy Saturday here's 9 puppers on a bench. 99... |
963 | 710658690886586372 | Here's a brigade of puppers. All look very pre... |
981 | 709198395643068416 | From left to right:\nCletus, Jerome, Alejandro... |
1045 | 704054845121142784 | Here is a whole flock of puppers. 60/50 I'll ... |
1120 | 697463031882764288 | Happy Wednesday here's a bucket of pups. 44/40... |
1288 | 684222868335505415 | Someone help the girl is being mugged. Several... |
1359 | 680494726643068929 | Here we have uncovered an entire battalion of ... |
1420 | 677716515794329600 | IT'S PUPPERGEDDON. Total of 144/120 ...I think |
1478 | 675853064436391936 | Here we have an entire platoon of puppers. Tot... |
1696 | 670842764863651840 | After so many requests... here you go.\n\nGood... |
I have found some tweets not related to grading/rating:
And I have found some special ratings:
These three observations will be removed. Although this is not a good practice, I will insert manually the tweet_id
codes to a list.
# Remove those three tweet_id's.
rm_list = ['810984652412424192',
'749981277374128128',
'670842764863651840']
# Creating a vector to subset twt_ach_mstr and remove the tweet_id from the rm_list.
indexing = np.logical_not(twt_ach_mstr.tweet_id.isin(rm_list))
# Updating the twt_ach_mstr data frame.
twt_ach_mstr = twt_ach_mstr[indexing]
I know there are some rating that is not an integer, and for this reason, I will use a .for()
loop to find all float rating values. The only way to find the float ratings values are using the regex (regular expression). The problem identified in the rating_numerator
and rating_denominator
is the original regex used to scrap the rating only take account the digits after the point, which lead an error when the rating is 9.75 becomes 75.
Obs. 1: I have had some trouble to write this chunk of code because I can not update a DataFrame view. I read this document to surpass this issue.
Obs. 2: This is quite difficult to understand the regex, but I found this document and this one to guide me.
# Loop to find any non integer rating.
for index in twt_ach_mstr.tweet_id:
# Extracting the text of each tweet
twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
# This regex will find any "float" rating among the words text.
rating = re.findall('\d+\.\d+\/\d+', twt_txt)
# If any rating were found, the length of rating will be different of zero
if len(rating) > 0:
# rating is a list of one element like this [9.9/10], I want only the first part the 9.9.
numerator = rating[0].split('/')[0]
# This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)
The next observation there is more than one pet, so I decided to calculate the mean.
Now, I am looking for some exceptions, like messages when the pattern XX/XX
appears more than once. It occurs when @dog_rates
gives two or more ratings in a single tweet (or even talks about a specific day). The error identified in rating_numerador
and rating_denominator
is the second rating is totally drop. I will do the average between these two ratings as a final rating of the tweet.
I have identified one special case which the expression 50/50 (fifty/fifty) is used, and the regex parses it as a rating.
# Loop to find any non integer rating.
for index in twt_ach_mstr.tweet_id:
# Extracting the text of each tweet
twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
# This regex will find the pattern NN/NN.
rating = re.findall('\d+\d\/\d+', twt_txt)
# If any rating is a list of two elements.
if len(rating) == 2:
if(rating[0] == '50/50'):
# rating is a list of two elements like this [50/50, 9.9/10], I want only the first part of the second element.
numerator = rating[1].split('/')[0]
denominator = rating[1].split('/')[1]
# This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = float(denominator)
else:
# Average of two ratings.
numerator = (float(rating[0].split('/')[0]) + float(rating[1].split('/')[0])) * 0.5
# This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator']
Finally, in some tweets the user @dog_rates
has an aggregate in a single rating many rates, in other words, If the picture has 7 dogs, the rating going to be 84/70, 12/10 in average.
I will correct these exceptions doing a simple "scale".
# Loop to find any non integer rating.
for index in twt_ach_mstr.query('rating_numerator > 15').tweet_id:
# Extracting the text of each tweet
twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
# This regex will find the pattern NN/NN.
rating = re.findall('\d+\d\/\d+', twt_txt)
# If any rating were found, the length of rating will be different of zero
if len(rating) == 1:
# rating is a list like this [9.9/10], I want only the first part
numerator = float(rating[0].split('/')[0])
denominator = float(rating[0].split('/')[1])
# This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = numerator/(denominator/10)
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = denominator/(denominator/10)
Let's fix the denominator problems.
# Slicing the problematic values from the value_counts.
denominator_problem = twt_ach_mstr.rating_denominator.value_counts().index[1:].tolist()
# Subsetting the dataframe to extrac the tweet_id with problem.
tweet_id_denominator_problem = twt_ach_mstr[twt_ach_mstr.rating_denominator.isin(denominator_problem)].tweet_id
for index in tweet_id_denominator_problem:
# Extracting the text of each tweet
twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
# This regex will find the pattern NN/NN.
rating = re.findall('\d+\d\/\d+', twt_txt)
# Only update if the rating is valid rating NN/NN
if len(rating) == 1:
numerator = rating[0].split('/')[0]
denominator = rating[0].split('/')[1]
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)
twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = float(denominator)
# Printing the values of numerator.
twt_ach_mstr.rating_numerator.value_counts()
12.00 450 10.00 419 11.00 398 13.00 254 9.00 150 8.00 95 7.00 51 14.00 34 6.00 32 5.00 32 3.00 19 4.00 15 2.00 9 1.00 5 9.75 1 0.00 1 11.26 1 13.50 1 11.27 1 Name: rating_numerator, dtype: int64
# Printing the values of denominator
twt_ach_mstr.rating_denominator.value_counts()
10.0 1967 2.0 1 Name: rating_denominator, dtype: int64
Unfortunately, it is not possible to remove all problems in the denominator column (there is still one non-standard value). This happened due to the use of fraction (3 1/2). I did not find a good regex to filter this issue.
Issue: This is a categorical variable and could be combine into one column.
In this tidiness issue, I am going to combine four columns into a single column.
I have realized in same cases the same dog is classified twice:
doggo
and pupper
;doggo
and floofer
, and;doggo
and puppo
.Figure 1 shows the doggo
, pupper
, puppo
, and floofer
definitions.
Obs.: I have used the numpy arrays to "add" vector, and to do so I have read this thread on Stack Overflow.
# Subset the twt_ach_mstr just to pick the "dogtionary columns".
dog_cols = twt_ach_mstr[['doggo','floofer','pupper','puppo']]
# Replace all "None" values to "almost null".
dog_cols = dog_cols.replace('None', '')
# Adding the columns to create a new one.
dogtionary = np.array(dog_cols['doggo']) + np.array(dog_cols['floofer']) + np.array(dog_cols['pupper']) + np.array(dog_cols['puppo'])
# Printing
pd.DataFrame(dogtionary, columns = ['dogtionary']).dogtionary.value_counts()
1665 pupper 201 doggo 63 puppo 22 doggopupper 8 floofer 7 doggofloofer 1 doggopuppo 1 Name: dogtionary, dtype: int64
Let's convert all double classified dog as multiclass
.
# Fixing the dogs with two classifications.
dogtionary[dogtionary == 'doggopupper'] = 'multiclass'
dogtionary[dogtionary == 'doggofloofer'] = 'multiclass'
dogtionary[dogtionary == 'doggopuppo'] = 'multiclass'
# Printing to see the results.
pd.DataFrame(dogtionary, columns = ['dogtionary']).dogtionary.value_counts()
1665 pupper 201 doggo 63 puppo 22 multiclass 10 floofer 7 Name: dogtionary, dtype: int64
Finally, let's append this new column called dogtionary
to the twt_ach_mstr
data frame.
# Reseting the index of twt_ach_mstr.
twt_ach_mstr.reset_index(drop=True, inplace=True)
# Appending dogtionary to twt_ach_mstr.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(dogtionary, columns = ['dogtionary'])], axis = 1)
# Printing the value_counts of the new columns dogtionary.
twt_ach_mstr.dogtionary.value_counts()
1665 pupper 201 doggo 63 puppo 22 multiclass 10 floofer 7 Name: dogtionary, dtype: int64
The new column bundle the 4 columns (pupper, puppo, doggo, and floofer) in one.
Issue: Convert the
timestamp
to a properly data type.
The timestamp
variable is an object
, which means it is a string for pandas. I am going to convert it to a datetime
data type.
Unfortunately, the timestamp notation has some additional info (' +0000') unnecessary, which I need to remove.
Obs.: The numpy Datetimes documentation was essentials to understand how to convert it.
timestamp
to date time variable.# Extracting the timestamp from a separated variable.
timestamp = twt_ach_mstr.timestamp.tolist()
# I want to remove the ' +0000', to do so I will use the split.
timestamp_str = list(map( lambda x : str(x).split(' +0000')[0], timestamp))
# Converting a regular list to a numpy Data Frame.
timestamp_str = pd.DataFrame(timestamp_str, columns = ['timestamp'])
# Converting using the numpy datetime, and assigning to the twt_ach_mstr.
twt_ach_mstr.timestamp = timestamp_str.timestamp.apply(np.datetime64)
Let's print the data type to ensure the conversion of the timestamp
.
# Printing the data type of each variable.
twt_ach_mstr.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1968 entries, 0 to 1967 Data columns (total 31 columns): tweet_id 1968 non-null object in_reply_to_status_id 0 non-null float64 in_reply_to_user_id 0 non-null float64 timestamp 1968 non-null datetime64[ns] source 1968 non-null object text 1968 non-null object retweeted_status_id 0 non-null float64 retweeted_status_user_id 0 non-null float64 retweeted_status_timestamp 0 non-null object expanded_urls 1968 non-null object rating_numerator 1968 non-null float64 rating_denominator 1968 non-null float64 name 1968 non-null object doggo 1968 non-null object floofer 1968 non-null object pupper 1968 non-null object puppo 1968 non-null object jpg_url 1968 non-null object img_num 1968 non-null int64 p1 1968 non-null object p1_conf 1968 non-null float64 p1_dog 1968 non-null bool p2 1968 non-null object p2_conf 1968 non-null float64 p2_dog 1968 non-null bool p3 1968 non-null object p3_conf 1968 non-null float64 p3_dog 1968 non-null bool source_url 1968 non-null object text_url 1968 non-null object dogtionary 1968 non-null object dtypes: bool(3), datetime64[ns](1), float64(9), int64(1), object(17) memory usage: 305.6+ KB
Issue: Dog's breed is not standardized, some is capitalized and other lowercase.
I have written I straightforward function to deal with the non-standardized names.
P1
, P2
and P3
columns.def fix_names(list_names):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This functions aims to convert any letter to lower case, also replace space by underscores. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| list_names data frame A column of a given data frame which contains non-standardized letters |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| list_names data frame Column with 'fixed' content. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Convert any uppercase in lowercase.
list_names = list(map(lambda x : x.lower(), list_names ))
# Change spaces in underscores.
list_names = list(map(lambda x : x.replace(' ','_'), list_names ))
# Change dash to underscores.
list_names = list(map(lambda x : x.replace('-','_'), list_names ))
return list_names # Return the "fixed" names.
Let's use this function to fix p1
, p2
, and p3
.
twt_ach_mstr.p1 = fix_names(twt_ach_mstr.p1)
twt_ach_mstr.p2 = fix_names(twt_ach_mstr.p2)
twt_ach_mstr.p3 = fix_names(twt_ach_mstr.p3)
Let's take a look at the breed's name to ensure the conversion.
# Subsetting the twt_ach_mstr to show only p1, p2, p3, and tweet_id.
twt_ach_mstr[['tweet_id','p1','p2','p3']].head(10)
tweet_id | p1 | p2 | p3 | |
---|---|---|---|---|
0 | 892420643555336193 | orange | bagel | banana |
1 | 892177421306343426 | chihuahua | pekinese | papillon |
2 | 891815181378084864 | chihuahua | malamute | kelpie |
3 | 891689557279858688 | paper_towel | labrador_retriever | spatula |
4 | 891327558926688256 | basset | english_springer | german_short_haired_pointer |
5 | 891087950875897856 | chesapeake_bay_retriever | irish_terrier | indian_elephant |
6 | 890971913173991426 | appenzeller | border_collie | ice_lolly |
7 | 890729181411237888 | pomeranian | eskimo_dog | pembroke |
8 | 890609185150312448 | irish_terrier | irish_setter | chesapeake_bay_retriever |
9 | 890240255349198849 | pembroke | cardigan | chihuahua |
Based on the Project's instructions:
Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.
For this reasons, I have had to aggregate to Table 7, two more issues, 15 and 16. I also need to remove any non-used columns.
Issue ID | Table | Issue Type | Dimension | Method | Column | Description |
---|---|---|---|---|---|---|
15 | df_img | Quality | Completeness | Programmatic | "retweet count" | Gather additional info in tweet_json.txt file. |
16 | df_img | Quality | Completeness | Programmatic | "favorite count" | Gather additional info in tweet_json.txt file. |
17 | twt_ach_mstr | Quality | Validity | Programmatic | "many columns" | Remove in_reply_to_status_id , in_reply_to_user_id , retweeted_status_timestamp , retweeted_status_id , and retweeted_status_user_id . |
# Lesson Learned: DO NOT use chaining indexing.
favorite_count = df_twt_cln.loc[:,('id','favorite_count')]
# Renaming the columns to be apt to merge using tweet_id as key.
favorite_count.columns = ['tweet_id','favorite_count']
# Converting the tweet_if to str.
favorite_count.loc[favorite_count.tweet_id.index,'tweet_id'] = favorite_count.tweet_id.astype(str)
# Merging the subsetted data frame called favorite_count to twt_ach_mstr.
twt_ach_mstr = pd.merge(twt_ach_mstr, favorite_count, on = 'tweet_id', how = 'left')
As you can see, I have configured the .merge()
as left join, because I do not want to drop the observations without this favorite_count
.
# Print the first 5 favorite_count rows. Have in mind, this is a subset with few columns.
twt_ach_mstr.loc[:,('tweet_id','favorite_count')].head()
tweet_id | favorite_count | |
---|---|---|
0 | 892420643555336193 | NaN |
1 | 892177421306343426 | NaN |
2 | 891815181378084864 | 24593.0 |
3 | 891689557279858688 | 41433.0 |
4 | 891327558926688256 | 39613.0 |
Bear in mind, neither all tweet_id
has favorite_count
.
Issue: Gather favorite count info in tweet_json.txt file.
Similarly as done to favorite_count
, the retweet_count
will require the same sequence of code lines with minor changes.
twitter_archive_master.csv
the retweet_count
values from the tweet_json.txt
#favorite_count = df_twt_cln[['tweet_id','favorite_count']]
retweet_count = df_twt_cln.loc[:,('id','retweet_count')]
# Renaming the columns to be apt to merge using tweet_id as key.
retweet_count.columns = ['tweet_id','retweet_count']
# Converting the tweet_if to str.
retweet_count.tweet_id = retweet_count.tweet_id.astype(str)
# Merging the subsetted data frame called favorite_count to twt_ach_mstr.
twt_ach_mstr = pd.merge(twt_ach_mstr, retweet_count, on = 'tweet_id', how = 'left')
I have also configured the .merge()
as left join because I do not want to drop the observations without this retweet_count
.
# Print the first 5 retweet_count rows. Have in mind, this is a subset with few columns.
twt_ach_mstr.loc[:,('tweet_id','retweet_count')].head()
tweet_id | retweet_count | |
---|---|---|
0 | 892420643555336193 | NaN |
1 | 892177421306343426 | NaN |
2 | 891815181378084864 | 4070.0 |
3 | 891689557279858688 | 8471.0 |
4 | 891327558926688256 | 9166.0 |
Bear in mind, neither all tweet_id
has retweet_count
.
Issue: Remove
in_reply_to_status_id
,in_reply_to_user_id
,retweeted_status_timestamp
,retweeted_status_id
, andretweeted_status_user_id
.
in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_timestamp
, retweeted_status_id
, and retweeted_status_user_id
.First, let's take into account the numbers of columns.
# Calculates the number of columns.
len(twt_ach_mstr.columns.tolist())
33
What is the columns in twt_ach_mstr
?
twt_ach_mstr.columns.tolist()
['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'source_url', 'text_url', 'dogtionary', 'favorite_count', 'retweet_count']
There more 4 columns from dogtionary, which I can drop as well.
# Columns to be removed.
rm_columns = ['in_reply_to_status_id',
'in_reply_to_user_id',
'retweeted_status_timestamp',
'retweeted_status_id',
'retweeted_status_user_id',
'doggo',
'pupper',
'puppo',
'floofer']
# Copy of all column of twt_ach_mstr.
list_columns = twt_ach_mstr.columns.tolist()
# Loop to remove each non desired column.
for rm in rm_columns:
# Remove a specific column.
list_columns.remove(rm)
# Priting the remaining columns.
list_columns
['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'source_url', 'text_url', 'dogtionary', 'favorite_count', 'retweet_count']
Let's subset the twt_ach_mstr
data frame.
# Cleaned twt_ach_mstr.
twt_ach_mstr = twt_ach_mstr[list_columns]
# Printing the first 5 rows.
twt_ach_mstr.head()
tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | jpg_url | img_num | ... | p2_conf | p2_dog | p3 | p3_conf | p3_dog | source_url | text_url | dogtionary | favorite_count | retweet_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 2017-08-01 16:23:56 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve... | https://twitter.com/dog_rates/status/892420643... | 13.0 | 10.0 | Phineas | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | ... | 0.085851 | False | banana | 0.076110 | False | http://twitter.com/download/iphone | https://t.co/MgUWQ76dJU | NaN | NaN | |
1 | 892177421306343426 | 2017-08-01 00:17:27 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | https://twitter.com/dog_rates/status/892177421... | 13.0 | 10.0 | Tilly | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | ... | 0.090647 | True | papillon | 0.068957 | True | http://twitter.com/download/iphone | https://t.co/0Xxu71qeIV | NaN | NaN | |
2 | 891815181378084864 | 2017-07-31 00:18:03 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | https://twitter.com/dog_rates/status/891815181... | 12.0 | 10.0 | Archie | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | ... | 0.078253 | True | kelpie | 0.031379 | True | http://twitter.com/download/iphone | https://t.co/wUnZnhtVJB | 24593.0 | 4070.0 | |
3 | 891689557279858688 | 2017-07-30 15:58:51 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal... | https://twitter.com/dog_rates/status/891689557... | 13.0 | 10.0 | Darla | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | ... | 0.168086 | True | spatula | 0.040836 | False | http://twitter.com/download/iphone | https://t.co/tD36da7qLQ | 41433.0 | 8471.0 | |
4 | 891327558926688256 | 2017-07-29 16:00:24 | Twitter for iPhone | This is Franklin. He would like you to stop ca... | https://twitter.com/dog_rates/status/891327558... | 12.0 | 10.0 | Franklin | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | ... | 0.225770 | True | german_short_haired_pointer | 0.175219 | True | http://twitter.com/download/iphone | https://t.co/AtUZn91f7f | 39613.0 | 9166.0 |
5 rows × 24 columns
Last look to the .info()
to ensure if the columns in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_timestamp
, retweeted_status_id
, and retweeted_status_user_id
have been removed.
# Printing the info.
twt_ach_mstr.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1968 entries, 0 to 1967 Data columns (total 24 columns): tweet_id 1968 non-null object timestamp 1968 non-null datetime64[ns] source 1968 non-null object text 1968 non-null object expanded_urls 1968 non-null object rating_numerator 1968 non-null float64 rating_denominator 1968 non-null float64 name 1968 non-null object jpg_url 1968 non-null object img_num 1968 non-null int64 p1 1968 non-null object p1_conf 1968 non-null float64 p1_dog 1968 non-null bool p2 1968 non-null object p2_conf 1968 non-null float64 p2_dog 1968 non-null bool p3 1968 non-null object p3_conf 1968 non-null float64 p3_dog 1968 non-null bool source_url 1968 non-null object text_url 1968 non-null object dogtionary 1968 non-null object favorite_count 1280 non-null float64 retweet_count 1280 non-null float64 dtypes: bool(3), datetime64[ns](1), float64(7), int64(1), object(12) memory usage: 251.8+ KB
After identifying 17 issues and solve them, the final data frame (twt_ach_mstr
) has one particularity:
favorite_count
and retweet_count
columns.This particularity will not affect the further studies, because it is not related to the quality of the data.
# Saving the `twt_ach_mstr` as twitter_archive_master.csv.
twt_ach_mstr.to_csv('01-Dataset/twitter_archive_master.csv', index=False, encoding = 'utf-8')
Finally, the Data Cleaning ends here.
Let's load the twitter_archive_master.csv
, this time I will name the loaded data frame as df_viz
.
# Loading the tweet_json.txt as a panda DataFrame.
df_viz = pd.read_csv('01-Dataset/twitter_archive_master.csv')
Although, I have already converted the timestamp
to datetime in Issue ID 7 when I export (to make a backup) and load (the exported file) the timestamp
column will be coerced to a string, which obliges me to convert (again).
# Converting using the numpy datetime, and assigning to the twt_ach_mstr.
df_viz.timestamp = df_viz.timestamp.apply(np.datetime64)
In order to provide a default configuration to all graphics, I will define a chunk to wrap this configuration.
# General configuration for all plots.
font = 14 # X and Y axis fontsize.
font_title = 18 # Graphic's Title fontsize.
transparency = 0.5 # Transparency in cases of scatter plot.
width = 0.7 # Bar width
In the next section, I have tried to perform an Exploratory Data Analysis (EDA). I will pose some silly questions to guide me to walk through the project requirements.
Let's compare the output of each algorithm, plotting a bar graph to analysis visually. To do so, I will write a function to bundle some code lines and to avoid repetition.
# Function to plot a barchart.
def gp_pop_dog(df_popular, gp_title, xlabel, ylabel,font = font, font_title = font_title):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be |
| used. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_popular data frame The entire data frame. |
| |
| gp_title str Graphic title. |
| |
| xlabel str X axis label. |
| |
| ylabel str Y axis label. |
| |
| font str Axis labels font size. |
| |
| font_title str Graphic Title font size. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| graphic - Returns a matplotlib graphic. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Plotting only the breed's dog with more than 20 occurencies.
df_popular[df_popular > 20].plot(kind = 'barh', # Set the kind of graphic.
figsize = [14, 6]); # Graphic size.
plt.title(label = gp_title, # Graphic Title.
fontsize = font_title); # Graphic title fontsize.
plt.xlabel(xlabel = xlabel, # X axis label
fontsize = font); # X axis fontsize
plt.ylabel(ylabel = ylabel, # Y axis label
fontsize = font); # Y axis fontsize
plt.xlim(0, 140) # X axis limit - set default as 140.
Premised on the three (3) algorithms, I will plot three graphics and compare the results.
# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p1_dog].p1.value_counts()
# Plotting the graphic.
gp_pop_dog(dog_pref,
gp_title = 'Graphic 1 - Algorithm 1 - Dog\'s Breed with more than 20 appearances.',
xlabel = 'Occurrences',
ylabel = 'Dog\'s Breed')
# Dog Preference - Algorithm P2
# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p2_dog].p2.value_counts()
# Plotting the graphic.
gp_pop_dog(dog_pref,
gp_title = 'Graphic 2 - Algorithm 2 - Dog\'s Breed with more than 20 appearances.',
xlabel = 'Occurrences',
ylabel = 'Dog\'s Breed')
# Dog Preference - Algorithm P3
# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p3_dog].p3.value_counts()
# Plotting the graphic.
gp_pop_dog(dog_pref,
gp_title = 'Graphic 3 - Algorithm 3 - Dog\'s Breed with more than 20 appearances.',
xlabel = 'Occurrences',
ylabel = 'Dog\'s Breed')
Based on the graphics, the P1 algorithm has fewer breeds and high concentrated bars in some breeds, in the opposite way the P3 algorithm has much more breeds and the dogs are spread in more breeds, in other words, less concentrated bars.
# Calculating the percentages by algorithms.
p1_perc = round(100 * df_viz[df_viz.p1_dog].shape[0]/df_viz.shape[0],2)
p2_perc = round(100 * df_viz[df_viz.p2_dog].shape[0]/df_viz.shape[0],2)
p3_perc = round(100 * df_viz[df_viz.p3_dog].shape[0]/df_viz.shape[0],2)
# Printing a summary.
print("P1: {}%\nP2: {}%\nP3: {}%".format(p1_perc,p2_perc,p3_perc))
P1: 74.29% P2: 75.15% P3: 72.66%
All three algorithm has almost the same performance (percentage of picture classified as a dog):
The third one has a slightly less performance.
# Calculating the number of breeds in each algorithm with more than 20 occurencies.
p1_qty = sum(df_viz[df_viz.p1_dog].p1.value_counts() > 20)
p2_qty = sum(df_viz[df_viz.p2_dog].p2.value_counts() > 20)
p3_qty = sum(df_viz[df_viz.p3_dog].p3.value_counts() > 20)
# Printing a summary of occurencies.
print("P1: {} breeds.\nP2: {} breeds.\nP3: {} breeds.".format(p1_qty,p2_qty,p3_qty))
P1: 14 breeds. P2: 18 breeds. P3: 21 breeds.
I have found a difference in the number of breeds with more than 20 occurrences.
The third algorithm has 33.33% more breeds than the first algorithm.
# Calculating the number of dogs classified using the threshold of 20.
dog_qty_20_p1 = sum(df_viz[df_viz.p1_dog].p1.value_counts()[df_viz[df_viz.p1_dog].p1.value_counts() > 20])
dog_qty_20_p2 = sum(df_viz[df_viz.p2_dog].p2.value_counts()[df_viz[df_viz.p2_dog].p2.value_counts() > 20])
dog_qty_20_p3 = sum(df_viz[df_viz.p3_dog].p3.value_counts()[df_viz[df_viz.p3_dog].p3.value_counts() > 20])
# Printing a summary of occurencies.
print("P1: {} dogs.\nP2: {} dogs.\nP3: {} dogs.".format(dog_qty_20_p1,dog_qty_20_p2,dog_qty_20_p3))
P1: 732 dogs. P2: 713 dogs. P3: 671 dogs.
Conclusion: The algorithm 1 is more skewness than the other two, and tend to concentrate the classification in a few breeds. On the other hand, the third algorithm is more generalist, spreading the dogs in a more variety of breeds.
I want to see if there are a correlation between Favorite and Retweet. I am going to plot a simple scatter plot.
plt.figure(figsize= [14,6]) # Graphic size
plt.scatter(x = df_viz['retweet_count'], # X varibale
y = df_viz['favorite_count'], # Y variable
alpha = transparency) # Transparency
plt.xlabel('Number of Retweets', fontsize = font) # X label
plt.ylabel('Number of Favorite', fontsize = font) # Y label
plt.title('Graphic 4 - Correlation between retweets and favorite tweets.',
fontsize = font_title) # Graphic Title
plt.show() # Plot the graphic
Graphically, it is possible to see a positive trend (correlation) between both variables. Thus, let's calculate a naive rate for each retweet how many favourites it will generate (in average).
# Calculating the rate for each retweet how many favorite will produce in average.
fav_ret_rate = df_viz.favorite_count/df_viz.retweet_count
# Calculating of mean, standard deviation, etc.
fav_ret_rate[np.logical_not(df_viz.favorite_count.isnull())].describe()
count 1280.000000 mean 3.384297 std 1.150137 min 1.107146 25% 2.580705 50% 3.219539 75% 3.955537 max 11.765060 dtype: float64
Conclusion: Seems there is a very positive correlation when the number of favourites raises the numbers of retweets also raises. Using a straightforward analysis for each new retweet the favourites will increase 3 (in average).
Which one is more adorable?
I will calculate the mean and favourite per retweet rate to compare the dogtionary terms.
# Copying the original data frame for a pontual analysis.
df_dogtionary = df_viz.copy()
# Filling the NaN values with None.
df_dogtionary.dogtionary = df_dogtionary.dogtionary.fillna('None')
# Mean calculation.
df_dog_mean = df_dogtionary.groupby(['dogtionary'])['favorite_count','retweet_count'].mean()
# Rate calculation.
df_dog_mean = df_dog_mean.join(pd.DataFrame(df_dog_mean.favorite_count/df_dog_mean.retweet_count, columns = ['rate']))
# Count calculation.
df_dog_mean_count = df_dog_mean.join(df_dogtionary.dogtionary.value_counts())
# Copying the column's name.
col_names = df_dog_mean_count.columns.tolist()
# Editing the columns names.
col_names[3] = 'count'
# Assigning a proper name.
df_dog_mean_count.columns = col_names
# Printing the summary.
df_dog_mean_count
favorite_count | retweet_count | rate | count | |
---|---|---|---|---|
dogtionary | ||||
None | 8062.129870 | 2393.982375 | 3.367665 | 1665 |
doggo | 14410.558824 | 4851.058824 | 2.970601 | 63 |
floofer | 5643.000000 | 2172.000000 | 2.598066 | 7 |
multiclass | 19020.250000 | 6341.375000 | 2.999389 | 10 |
pupper | 6529.219858 | 2122.219858 | 3.076599 | 201 |
puppo | 14684.933333 | 4292.266667 | 3.421254 | 22 |
Conclusion: Based on the summary above, in average, dogs without a dogtionary classification tend to perform better rates of favourite per retweet. Have in mind, the few observations of floofer and puppo.
Still analysing the dogtionary terms, but in a visual manner, I will use Graphic 4 as the base for Graphic 5.
# Copying the original data frame for a pontual analysis.
df_dog_viz = df_viz.copy()
# Filling the NaN values with None.
df_dog_viz.dogtionary = df_dog_viz.dogtionary.fillna('None')
# List of term in dogtionary.
dog_terms = ['None', 'pupper', 'doggo', 'puppo', 'floofer', 'multiclass'] # The order is ascending.
# Graphic Size.
plt.figure(figsize = [14, 10])
# Loop to construct the graphic "layers".
for index in dog_terms:
# Provisory data frame to subset the specific "index" dogtionary term.
prov = df_dog_viz[df_dog_viz['dogtionary'] == index][['favorite_count','retweet_count']]
# Plotting
plt.scatter(x = prov['favorite_count'], # X axis data.
y = prov['retweet_count'] , # Y axis data.
alpha = transparency) # Points transparency.
plt.xlabel('Number of Retweets', # X label.
fontsize = font) # X axis fontsize.
plt.ylabel('Number of Favorite', # Y label.
fontsize = font) # Y axis fontsize.
plt.title('Graphic 5 - Correlation between retweets and favorite tweets by Dogtionary terms.', # Graphic title.
fontsize = font_title) # Graphic Title.
plt.legend(dog_terms) # Add legend.
plt.show() # Plot the graphic.
Conclusion: It is not possible to identify (visually) any pattern or relationship, using the dogtionary variable, although there is a strong and positive correlation between Retweets and Favorite.
How is the behaviour of tweets along the week? Do they have the same average?
Founded on the rating_numerator
and timestamp
, I will plot how the average rating will vary over the time.
# Copying the data frame to a pontual study.
df_time = df_viz.copy()
# Subsetting.
df_time = df_time[['timestamp','rating_numerator']]
# Converting full date to weekdays. 1 = Monday, 2 = Tuesday, ... , 7 = Sunday
df_time.timestamp = list(map(lambda x : x.isoweekday(), df_time.timestamp))
# Calculating the mean of ratings.
df_time_mean = df_time.groupby(['timestamp']).mean()
# Adding a new columns: number of occurencies.
df_time_mean = df_time_mean.join(pd.DataFrame(df_time.timestamp.value_counts()))
# Setting strings as index (preparing to the plot).
df_time_mean.index = ['Monday','Tuesday','Wednesday','Thurday','Friday','Saturday','Sunday']
# Editing the variable name.
df_time_mean.columns = ['Avg Rating','Tweets Number']
# Summary of Average Rating over the weekdays and grouped by the tweets's number.
df_time_mean
Avg Rating | Tweets Number | |
---|---|---|
Monday | 10.342593 | 324 |
Tuesday | 10.602302 | 291 |
Wednesday | 10.453925 | 293 |
Thurday | 10.706250 | 280 |
Friday | 10.692688 | 279 |
Saturday | 10.586694 | 248 |
Sunday | 10.454545 | 253 |
Analysing the above chunk output, it is not clear the behaviour. For this reason, I have written a function to encapsulate a couple of code lines. This will save code lines in the future.
# Function to plot a barplot and line.
def gp_bar_line(df_x, df_y1, df_y2, gp_title, xlabel, ylabel, font = font, font_title = font_title):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be |
| used. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_x data frame The entire data frame. |
| |
| df_y1 data frame The variable to be displayed as bar. |
| |
| df_y2 data frame The variable to be displayed as line. |
| |
| gp_title str Graphic title. |
| |
| xlabel str X axis label. |
| |
| ylabel str Y axis label. |
| |
| font str Axis labels font size. |
| |
| font_title str Graphic Title font size. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| graphic - Returns a matplotlib graphic. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Graphic Settings.
fig, ax1 = plt.subplots(figsize = [14,6])
# First Element: Blue bars.
color = 'tab:blue'
ax1.set_xlabel(xlabel, fontsize = font)
ax1.set_ylabel('Avg Rating', color = color, fontsize = font)
ax1.bar(df_x.index, df_y1, color = color)
ax1.tick_params(axis = 'y', labelcolor = color)
# Second Element: Red line.
ax2 = ax1.twinx() # share axis with ax1 element.
color = 'indianred'
ax2.set_ylabel(ylabel, color = color, fontsize = font) # we already handled the x-label with ax1
ax2.plot(df_x.index, df_y2, color = color, linewidth = 5)
ax2.tick_params(axis = 'y', labelcolor = color)
fig.tight_layout() # otherwise the right y-label is slightly clipped
plt.title(gp_title, # Graphic title.
fontsize = font_title) # Graphic font title.
return plt.show()
After defining the function, I will apply this function in the Average Rating over the weekdays.
# Plotting Average Rating over the weekdays.
gp_bar_line(df_x = df_time_mean,
df_y1 = df_time_mean['Avg Rating'],
df_y2 = df_time_mean['Tweets Number'],
gp_title = 'Graphic 6 - Average Rating and Number of Tweets per Weekday.',
xlabel = 'Weekday',
ylabel = 'Number of Tweets')
Conclusion: Although the number of tweets decreases along the weekend, the average stay (almost) steadily. In average, Monday is the weekday with highest tweets average.
# Copying the data frame to a pontual study.
df_month = df_viz.copy()
# Subsetting
df_month = df_month[['timestamp','rating_numerator']]
# Converting full date to month.
df_month.timestamp = list(map(lambda x : x.month, df_month.timestamp))
# Calculating the mean of ratings.
df_month_mean = df_month.groupby(['timestamp']).mean()
# Adding a columns with value counts.
df_month_mean = df_month_mean.join(pd.DataFrame(df_month.timestamp.value_counts()))
# Renaming the index.
df_month_mean.index = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
# Editing the variable name.
df_month_mean.columns = ['Avg Rating','Tweets Number']
# Printing the summary.
df_month_mean
Avg Rating | Tweets Number | |
---|---|---|
Jan | 10.711207 | 232 |
Feb | 10.889535 | 172 |
Mar | 10.939394 | 165 |
Apr | 11.287234 | 94 |
May | 11.683673 | 98 |
Jun | 11.073171 | 123 |
Jul | 11.533088 | 136 |
Aug | 11.000000 | 61 |
Sep | 11.194762 | 63 |
Oct | 11.673077 | 65 |
Nov | 9.355491 | 346 |
Dec | 9.877143 | 413 |
In this case, it is much clear to see a behaviour when the number of tweet increases the average rating goes down.
Let's print a graphic using the function gp_bar_line
.
# Plotting Average Rating over the weekdays.
gp_bar_line(df_x = df_month_mean,
df_y1 = df_month_mean['Avg Rating'],
df_y2 = df_month_mean['Tweets Number'],
gp_title = 'Graphic 7 - Average Rating and Number of Tweets per Month.',
xlabel = 'Month',
ylabel = 'Number of Tweets')
Conclusion: Although December and November have the highest number of tweets, the average rating goes in the opposite way. These two months have the lowest number of average ratings. There is strong evidence of seasonality.
What is the Breed with more reaction?
Due to the use of an algorithm to determine (automatically) the dog's breed, It is necessary to subset the df_viz
, based on the p1_dog
, for doing so, I have written the df_gp_triple
function to encapsulate this lines codes.
# This functions is to wrap code lines. The result of this function is the input of gp_triple function.
def df_gp_triple(df_viz, algorithm, threshold = 20, sorting = False):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be |
| used. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_viz data frame The entire data frame (imported from twitter_archive_master.csv). |
| |
| algorithm str Could be p1, p2 or p3. |
| |
| threshold int Breed's number to be displayed in the graphic. |
| |
| sorting bool True: ascending, False: descending. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_alg data frame Data frame with 'threshold' rows and 3 columns (favorite_count, |
| retweet_count, and rate). It is indexed by dog's breed. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# According to the algorithm I will subset the df_viz using differents rows.
if algorithm == 'p1':
df_alg = df_viz[['rating_numerator','p1','p1_dog','retweet_count','favorite_count']]
df_alg = df_alg[df_alg.p1_dog]
df_alg = df_alg.groupby(['p1']).sum()
elif algorithm == 'p2':
df_alg = df_viz[['rating_numerator','p2','p2_dog','retweet_count','favorite_count']]
df_alg = df_alg[df_alg.p2_dog]
df_alg = df_alg.groupby(['p2']).sum()
else:
df_alg = df_viz[['rating_numerator','p3','p3_dog','retweet_count','favorite_count']]
df_alg = df_alg[df_alg.p3_dog]
df_alg = df_alg.groupby(['p3']).sum()
# Removing the 'p#', 'p#_dog', and 'rating_numerator' columns.
df_alg = df_alg[['favorite_count','retweet_count']]
# Adding a new columns with the rate (favorite divided by retweet)
df_alg = df_alg.join(pd.DataFrame(df_alg['favorite_count']/df_alg['retweet_count']))
# Renaming columns.
df_alg.columns = ['favorite_count','retweet_count','rate']
# Sorting by favorite_count.
df_alg = df_alg.sort_values(by = 'rate', ascending = sorting)[:threshold]
# return df_alg.
return df_alg
Let's have a try calculating the subsetted data frame to algorithm p1
. The expected output is a table containing all dog's breed with rate and summation of favourite and retweet.
# Subsetting the df_viz using the function df_gp_triple.
df_p1 = df_gp_triple(df_viz,
algorithm = 'p1')
# Printing the first 5 rows.
df_p1.head()
favorite_count | retweet_count | rate | |
---|---|---|---|
p1 | |||
german_short_haired_pointer | 27523.0 | 4455.0 | 6.178002 |
basset | 122353.0 | 23171.0 | 5.280437 |
irish_terrier | 34710.0 | 6583.0 | 5.272672 |
rhodesian_ridgeback | 12956.0 | 2479.0 | 5.226301 |
bloodhound | 25891.0 | 5080.0 | 5.096654 |
Now, I want to plot this table in a graphic with two bars and one line.
# Function to plot the df_gp_triple data frame.
def gp_triple(df_x, df_y1, df_y2, df_y3, algorithm, gp_title, xlabel, ylabel,
font = font, font_title = font_title, width = width):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be |
| used. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_x data frame The entire data frame (imported from twitter_archive_master.csv). |
| |
| df_y1 data frame The variable to be displayed as blue bar 1. |
| |
| df_y2 data frame The variable to be displayed as orange bar 2. |
| |
| df_y3 data frame The variable to be displayed as line. |
| |
| algorithm str Could be 'p1', 'p2' or 'p3'. |
| |
| gp_title str Graphic title. |
| |
| xlabel str X axis label. |
| |
| ylabel str Y axis label. |
| |
| font str Axis labels font size. |
| |
| font_title str Graphic Title font size. |
| |
| width str Barchart width. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| graphic - Returns a matplotlib graphic. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Defining the subplot and Graphic size.
fig, ax1 = plt.subplots(figsize = [14,8])
# First element: Blue Barchart
color = 'tab:blue'
ax1.set_xlabel('Dog\'s Breed', fontsize = font)
ax1.set_ylabel('Number of Retweets\nNumber of Favorite', color=color, fontsize = font)
ax1.bar(df_x.index, df_y1, width, color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.tick_params(axis='x', rotation = 90)
# Second element: Orange Barchart
ax3 = ax1.twiny()
color = 'tab:orange'
ax3.bar(df_x.index, df_y2, width, color=color)
ax3.tick_params(axis='x', labelcolor=color, top = False)
ax3.set_xticklabels(labels = [], axis='x', labelcolor = color, top = False)
# Third element: Red Line
ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
color = 'indianred'
ax2.set_ylabel('Favorite/Retweet', color=color, fontsize = font) # we already handled the x-label with ax1
ax2.plot(df_x.index, df_y3, color=color, linewidth = 3)
ax2.tick_params(axis='y', labelcolor=color)
# Avoid right y-label to be slightly clipped
fig.tight_layout()
# According to the algorithm the average is different because the subsetting differents rows.
if algorithm == 'p1':
df_alg = df_viz[df_viz.p1_dog]
elif algorithm == 'p2':
df_alg = df_viz[df_viz.p2_dog]
else:
df_alg = df_viz[df_viz.p3_dog]
# Subsetting: removing any NaN or Null rows.
df_alg_fav = df_alg[np.logical_not(df_alg.retweet_count.isnull())].favorite_count
df_alg_ret = df_alg[np.logical_not(df_alg.retweet_count.isnull())].retweet_count
# Calculating the rate = favorites/tweets, this is a vector.
df_alg_rate = df_alg_fav/df_alg_ret
# Calculating the mean, this is a single value.
df_alg_rat_fav_ret = df_alg_rate.mean()
# Plotting a dotted green line as average.
plt.axhline(y = df_alg_rat_fav_ret, color='green', linestyle='--')
# Defining the Graphic Title.
plt.title(gp_title, fontsize = font_title)
# Plot.
return plt.show()
The result of this function will be shown below.
# Plotting a graphic using the output of df_gp_triple as input of gp_triple.
gp_triple(df_x = df_p1,
df_y1 = df_p1['favorite_count'],
df_y2 = df_p1['retweet_count'],
df_y3 = df_p1['rate'],
algorithm = 'p2',
gp_title = 'Graphic 8 - Algortihm 1 - Favorite/Retweets rate for each Dog\'s Breed.',
xlabel = 'Dog\'s Breed',
ylabel = 'Number of Retweets\nNumber of Favorite')
The blue bar means the number of favourites, the orange bar the number of retweets, dotted green line the average of Favorite/Retweets for all dog breeds, and the red line the value of the Favorite/Retweet for each dog breed.
Going a step further, I will bundle these two functions into one.
# This function wrap the df_gp_triple and gp_triple functions.
def wrapper_triple(algorithm, gp_title, xlabel, ylabel,
df_viz = df_viz, font = font, font_title = font_title, threshold = 20, sorting = False):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be |
| used. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_viz data frame The entire data frame (imported from twitter_archive_master.csv). |
| |
| df_x data frame The entire data frame (imported from twitter_archive_master.csv). |
| |
| df_y1 data frame The variable to be displayed as blue bar 1. |
| |
| df_y2 data frame The variable to be displayed as orange bar 2. |
| |
| df_y3 data frame The variable to be displayed as line. |
| |
| algorithm str Could be 'p1', 'p2' or 'p3'. |
| |
| gp_title str Graphic title. |
| |
| xlabel str X axis label. |
| |
| ylabel str Y axis label. |
| |
| font str Axis labels font size. |
| |
| font_title str Graphic Title font size. |
| |
| threshold int Breed's number to be displayed in the graphic. |
| |
| sorting bool True: ascending, False: descending. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| graphic - Returns a matplotlib graphic. |
| |
+----------------------------------------------------------------------------------------------------+
|DEPENDENCIES: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_gp_triple function Generates a Data Frame which is used as input of gp_triple. |
| |
| gp_triple function Create the graphic using the output of df_gp_triple as input. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Subset the df_viz according to the algorthm.
df_alg = df_gp_triple(df_viz,
algorithm = algorithm,
threshold = threshold,
sorting = sorting)
# Plot a graphic according to the df_alg.
gp_triple(df_x = df_alg,
df_y1 = df_alg['favorite_count'],
df_y2 = df_alg['retweet_count'],
df_y3 = df_alg['rate'],
algorithm = algorithm,
gp_title = gp_title,
xlabel = xlabel,
ylabel = ylabel)
Let's test this wrapper on P2 algorithm.
wrapper_triple(algorithm = 'p2',
gp_title = 'Graphic 9 - Algortihm 2 - Favorite/Retweets rate for each Dog\'s Breed.',
xlabel = 'Dog\'s Breed',
ylabel = 'Number of Retweets\nNumber of Favorite')
Finally, I also will plot the same graphic to the P3 algorithm.
wrapper_triple(algorithm = 'p3',
gp_title = 'Graphic 10 - Algortihm 3 - Favorite/Retweets rate for each Dog\'s Breed.',
xlabel = 'Dog\'s Breed',
ylabel = 'Number of Retweets\nNumber of Favorite')
Comparing the three graphics I have realized there is no dog breed which appears in the three graphics (8, 9, and 10), which I can interpret it as a very distinct algorithms calibration.
Something important to point out is to know the procedency of the algorithm and the possible breeds for each algorithm. I assume the list of breeds used to all algorithm is the same, allowing me the comparison.
Conclusion: Unfortunately, it is not possible to determinate which dog breed has the highest impact in respect to the number of retweets and favourite. because for each algorithm a different breed has performed the highest impact. There is no consensus.
In accordance with the Dog Breeds Appeal, let's investigate the output of the algorithm in respect to the Favorite/Retweet rate.
Let's increase the number of breeds analysed by the algorithm to 40 because 20 there are no breeds which appear in the three algorithms.
# Defininf a function to calculate the rate dataframe.
def df_rate(threshold,sorting = False, df_viz = df_viz):
"""
+----------------------------------------------------------------------------------------------------+
|DESCRIPTION: |
| |
| This function creates a data frame with rate (favorit/retweet) of all algorthms, each row is a |
| dog's breed. |
| |
+----------------------------------------------------------------------------------------------------+
|INPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_viz data frame The entire data frame (imported from twitter_archive_master.csv). |
| |
| threshold int Breed's number to be displayed in the graphic. |
| |
| sorting bool True: ascending, False: descending. |
| |
+----------------------------------------------------------------------------------------------------+
|OUTUPUTS: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_thr data frame Returns a data frame with rate calculated of all algorithms. |
| |
+----------------------------------------------------------------------------------------------------+
|DEPENDENCIES: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| df_gp_triple function Generates a Data Frame which is used as input of gp_triple. |
| |
+----------------------------------------------------------------------------------------------------+
"""
# Calculating the new Data Frame for each algorithm.
df_p1_thr = df_gp_triple(df_viz, 'p1', threshold = threshold, sorting = sorting)
df_p2_thr = df_gp_triple(df_viz, 'p2', threshold = threshold, sorting = sorting)
df_p3_thr = df_gp_triple(df_viz, 'p3', threshold = threshold, sorting = sorting)
# Creating df_thr Data Frame to merge df_p1_thr and df_p2_thr.
df_thr = pd.merge(df_p1_thr, df_p2_thr, left_index=True, right_index=True, how='outer')
# Renaming columns.
df_thr.columns = ['favorite_count_p1','retweet_count_p1','rate_p1',
'favorite_count_p2','retweet_count_p2','rate_p2']
# Updating df_thr Data Frame to merging with df_p3_thr.
df_thr = pd.merge(df_thr, df_p3_thr, left_index=True, right_index=True, how='outer')
# Renaming columns.
df_thr.columns = ['favorite_count_p1','retweet_count_p1','rate_p1',
'favorite_count_p2','retweet_count_p2','rate_p2',
'favorite_count_p3','retweet_count_p3','rate_p3']
# Data Cleaning.
# Removing NaN values.
df_thr = df_thr[np.logical_not(df_thr.favorite_count_p1.isnull())]
df_thr = df_thr[np.logical_not(df_thr.favorite_count_p2.isnull())]
df_thr = df_thr[np.logical_not(df_thr.favorite_count_p3.isnull())]
# Subsetting to gather only rates
df_thr = df_thr[['rate_p1', 'rate_p2', 'rate_p3']].sort_values(by = ['rate_p3', 'rate_p2', 'rate_p1'], ascending = False)
return df_thr # Return the df_thr data frame.
# Calculating the rates to threshold equal to 40.
df_rate(threshold = 40)
rate_p1 | rate_p2 | rate_p3 | |
---|---|---|---|
newfoundland | 3.700781 | 5.211597 | 5.745550 |
yorkshire_terrier | 4.204334 | 3.784307 | 3.932011 |
kuvasz | 3.675678 | 4.255575 | 3.885262 |
staffordshire_bullterrier | 4.082564 | 4.182275 | 3.841623 |
Only increasing the threshold (initially defined as 20) to 40 I have found four (4) dog breed.
Based on these results I want to know the correlation between the results of each algorithm in respect to the rate (favorite/retweet).
# Setting: Two graphics side-by-side. Size: [14,8]
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(14, 8))
# Compute the correlation matrix
corr = df_rate(threshold = 40).corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot = True,
square=True, linewidths=.5, ax = ax1,cbar_kws={"shrink": .5})
# Graphic Title to threshold 40
ax1.set_title('Graphic 11a - Correlation Map - Threshold 40', fontsize = font)
# Compute the correlation matrix
corr = df_rate(threshold = 120).corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot = True,
square=True, linewidths=.5, ax = ax2,cbar_kws={"shrink": .5})
# Graphic Title to threshold 100
ax2.set_title('Graphic 11b - Correlation Map - All Breeds', fontsize = font);
Conclusion: There is no correlation between the results of the three algorithms. An analysis oblique using only the 40 breeds with the highest rate could lead us an erroneous conclusion. The Correlation Map using all breeds gave a good measure of (un)similarities between these algorithms.
This interpretation could be biased due to the lack of information about the algorithms.
There are many other questions to be answered:
Subset the df_viz taking only the breed with a rate above of the mean.
Identify the seasons of dog's reproducibility.
This project aims to perform the Data Wrangling and the Exploratory Data Analysis in the WeRateDogs™ Twitter account.
The Data Gathering process englobed three different tasks, the first one download file from URL and later loading to the Jupyter Notebook, which requires a manual step, the second downloading a file programmatically, and the third gathering data from the Twitter API.
Based on the data gathered, I have assessed the most evident issues (17 issues in total) and documented it to create a record of modifications. Later, in Data Cleaning process I have fixed all identified issues to complete, and I have also merged separated data frame into one and added some missing values. The final data frame was stored as twitter_archive_master.csv.
In the Data Analysis and Visualization, which I have interpreted as Exploratory Analysis, I have posed few questions to guide my analysis. I have found strong evidence of:
The Project also has other deliverables, which could be accessed by the following links:
I have consulted this websites to perform this project. Have in mind, the list is not in appearance order.
.isin()
, Stack Overflow, thread;I have indented this fragment of JSON file, just to understand what is the components. Later it will help me to subset/indexing my needs.
See more in Twitter Dev Website.
The object from the tweepy has an outcome a dictionary called _json
, which is what I am looking for.
Status(_api=<tweepy.api.API object at 0x0B288EF0>,
_json={'created_at': 'Wed Dec 26 19:18:12 +0000 2018',
'id': 1078007085848240129,
'id_str': '1078007085848240129',
'text': '@brandimotamedi i hope you love it ☺️',
'truncated': False,
'entities': {'hashtags': [],
'symbols': [],
'user_mentions': [{'screen_name':
'brandimotamedi',
'name': 'Brandi Motamedi',
'id': 344164849,
'id_str': '344164849',
'indices': [0, 15]}],
'urls': []},
'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'in_reply_to_status_id': 1078006885935067136,
'in_reply_to_status_id_str': '1078006885935067136',
'in_reply_to_user_id': 344164849,
'in_reply_to_user_id_str': '344164849',
'in_reply_to_screen_name': 'brandimotamedi',
'user': {'id': 4196983835,
'id_str': '4196983835',
'name': 'WeRateDogs™',
'screen_name': 'dog_rates',
'location': 'merch ⇨',
'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
'url': 'https://t.co/N7sNNHAEXS',
'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
'expanded_url': 'http://weratedogs.com',
'display_url': 'weratedogs.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 7549717,
'friends_count': 12,
'listed_count': 5661,
'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
'favourites_count': 140650,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 9467,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
'profile_link_color': 'F5ABB5',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '000000',
'profile_text_color': '000000',
'profile_use_background_image': False,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': False,
'retweet_count': 0,
'favorite_count': 115,
'favorited': False,
'retweeted': False,
'lang': 'en'},
created_at=datetime.datetime(2018, 12, 26, 19, 18, 12),
id=1078007085848240129,
id_str='1078007085848240129',
text='@brandimotamedi i hope you love it ☺️',
truncated=False,
entities={'hashtags': [],
'symbols': [],
'user_mentions': [{'screen_name': 'brandimotamedi',
'name': 'Brandi Motamedi',
'id': 344164849,
'id_str': '344164849',
'indices': [0, 15]}],
'urls': []},
source='Twitter for iPhone',
source_url='http://twitter.com/download/iphone',
in_reply_to_status_id=1078006885935067136,
in_reply_to_status_id_str='1078006885935067136',
in_reply_to_user_id=344164849,
in_reply_to_user_id_str='344164849',
in_reply_to_screen_name='brandimotamedi',
author=User(_api=<tweepy.api.API object at 0x0B288EF0>,
_json={'id': 4196983835,
'id_str': '4196983835',
'name': 'WeRateDogs™',
'screen_name': 'dog_rates',
'location': 'merch ⇨', 'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
'url': 'https://t.co/N7sNNHAEXS',
'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
'expanded_url': 'http://weratedogs.com',
'display_url': 'weratedogs.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 7549717,
'friends_count': 12,
'listed_count': 5661,
'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
'favourites_count': 140650,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 9467,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
'profile_link_color': 'F5ABB5',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '000000',
'profile_text_color': '000000',
'profile_use_background_image': False,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
id=4196983835,
id_str='4196983835',
name='WeRateDogs™',
screen_name='dog_rates',
location='merch ⇨',
description='Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
url='https://t.co/N7sNNHAEXS',
entities={'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
'expanded_url': 'http://weratedogs.com',
'display_url': 'weratedogs.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
protected=False,
followers_count=7549717,
friends_count=12,
listed_count=5661,
created_at=datetime.datetime(2015, 11, 15, 21, 41, 29),
favourites_count=140650,
utc_offset=None,
time_zone=None,
geo_enabled=True,
verified=True,
statuses_count=9467,
lang='en',
contributors_enabled=False,
is_translator=False,
is_translation_enabled=False,
profile_background_color='000000',
profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png',
profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png',
profile_background_tile=False,
profile_image_url='http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
profile_image_url_https='https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
profile_banner_url='https://pbs.twimg.com/profile_banners/4196983835/1544368760',
profile_link_color='F5ABB5',
profile_sidebar_border_color='000000',
profile_sidebar_fill_color='000000',
profile_text_color='000000',
profile_use_background_image=False,
has_extended_profile=False,
default_profile=False,
default_profile_image=False,
following=False,
follow_request_sent=False,
notifications=False, translator_type='none'),
user=User(_api=<tweepy.api.API object at 0x0B288EF0>,
_json={'id': 4196983835,
'id_str': '4196983835',
'name': 'WeRateDogs™',
'screen_name': 'dog_rates',
'location': 'merch ⇨',
'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
'url': 'https://t.co/N7sNNHAEXS',
'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
'expanded_url': 'http://weratedogs.com',
'display_url': 'weratedogs.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 7549717,
'friends_count': 12,
'listed_count': 5661,
'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
'favourites_count': 140650,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 9467,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': False,
'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
'profile_link_color': 'F5ABB5',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '000000',
'profile_text_color': '000000',
'profile_use_background_image': False,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
id=4196983835,
id_str='4196983835',
name='WeRateDogs™',
screen_name='dog_rates',
location='merch ⇨',
description='Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
url='https://t.co/N7sNNHAEXS',
entities={'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
'expanded_url': 'http://weratedogs.com',
'display_url': 'weratedogs.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
protected=False,
followers_count=7549717,
friends_count=12,
listed_count=5661,
created_at=datetime.datetime(2015, 11, 15, 21, 41, 29),
favourites_count=140650,
utc_offset=None,
time_zone=None,
geo_enabled=True,
verified=True,
statuses_count=9467,
lang='en',
contributors_enabled=False,
is_translator=False,
is_translation_enabled=False,
profile_background_color='000000',
profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png',
profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png',
profile_background_tile=False,
profile_image_url='http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
profile_image_url_https='https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
profile_banner_url='https://pbs.twimg.com/profile_banners/4196983835/1544368760',
profile_link_color='F5ABB5',
profile_sidebar_border_color='000000',
profile_sidebar_fill_color='000000',
profile_text_color='000000',
profile_use_background_image=False,
has_extended_profile=False,
default_profile=False,
default_profile_image=False,
following=False,
follow_request_sent=False,
notifications=False,
translator_type='none'),
geo=None,
coordinates=None,
place=None,
contributors=None,
is_quote_status=False,
retweet_count=0,
favorite_count=115,
favorited=False,
retweeted=False,
lang='en')