Project 02 - Wrangle and Analyze Data¶

Student Tags¶

Author: Anderson Hitoshi Uyekita
Project: Wrangle and Analyze Data
Course: Data Science - Foundations II
COD: ND111
Date: 02/01/2019
Version: 2.0

Table of Contents¶

1. Introduction
2. Data Gathering
3. Data Assessing
4. Data Cleaning
5. Data Analysis and Data Visualization
6. Conclusions
7. References
Appendix

Synopsis¶

This project aims to give to the student a real case of how to gather, assess, clean, and analyze the data, in other words its englobes the Data Wrangling and Exploratory Data Analysis. The database used as an example is about the @dog_rate user from Twitter, as known as WeRateDogs™, this account has more than 7,572,000 followers, 9,500 tweets, and 141,000 likes.

The Data Gathering process bundled three different tasks, the first one download file from URL and later loading to the Jupyter Notebook, which requires a manual step, the second downloading a file programmatically, and the third gathering data from the Twitter API. This step has also required to save these data in a local machine.

Based on the data gathered, I have assessed the most evident issues (17 issues in total) and documented it to create a record of modifications. Later, in Data Cleaning process I have fixed all identified issues, and I have also merged (the two downloaded files from the Data Gathering process) into one and added some missing values (from the archive downloaded from the Twitter API). The final data frame was stored as twitter_archive_master.csv.

In the Data Analysis and Visualization, which I have interpreted as Exploratory Analysis, I have posed few questions to guide my analysis, which lead me to found strong evidence of:

Seasonality in the number of tweets along the week and along the year;
A positive correlation between the number of retweets and the number of favourites, and;
No correlation between the algorithms output used to predict the dog breed.

Finally, the Project also has other deliverables, which could be accessed by the following links:

Act Report, and;
Wrangle Report.

1. Introduction ¶

This Jupyter Notebook aims to document the process of Gathering, Assessing, and Cleaning of Data Science Foundations II Nanodegree Program Project 02.

1.1. Reproducibility¶

I have written this report using the Jupyter Notebook, which will allow anyone to reproduce this document in a local computer. I have made a great effort to document every step to turn this document understandable.

Work envinronment¶

I have performed this project using:

Dell Notebook Inspiron 7348;
Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz 2.40GHz;
8.00 GB, and;
Windows 10 Pro 64-bits.

Softwares¶

This report has written using four software:

Python (version 3.7.0);
Jupyter Notebook (version 5.6.0);
Opera (version 57.0.3098.106), and;
Atom (version 1.33.1 ia32).

I have used the Atom to push to Github repository and to visualize some files, and nothing more than this.

Packages¶

I kindly ask you to install each of this packages before you run the next steps.

pandas (version 0.23.4);
requests (version 2.20.1);
re (version 2.2.1);
seaborn (version 0.9.0);
json (version 2.0.9), and;
tweepy (version 3.7.0).

Repository¶

You can access all files of this report in this repository:

https://github.com/AndersonUyekita/ND111_data_science_foundations_02

1.2. Importing Libraries¶

For this project I will use some libraries to manage data frames, download file, Twitter API Client code, etc.. Let's import this packages to the Jupyter Notebook environment.

In [1]:

# Importing Library.

# Importing pandas to work with DataFrames.
import pandas as pd
pd.set_option('mode.chained_assignment','raise')

# Importing numpy to general methods.
import numpy as np

# Importing OS package to write and open files.
import os

# Importing the requests to create a object from a given URL.
import requests

# Importing the Client Code to use the Twitter API.
import tweepy

# Importing the json package to easy my task with json files
import json

# Importing the time package to calculate the elapse time to access the Twitter API
import time

# Importing the re package to use the regular expressions.
import re

# Importing the matplotlib to create graphics
import matplotlib.pyplot as plt
%matplotlib inline

# Import seaborn to better the visualization
import seaborn as sns

2. Data Gathering ¶

There are two (2) files hosted in Udacity website that I must have to gather (download).

image-predictions.tsv, and;
twitter-archive-enhanced.csv.

Later, I will access the Twitter API to gather additional data.

2.2. Downloading Files from Udacity Website¶

First, I will create a folder to store the data and later I will download each file.

In [2]:

# Creating a folder to store the files.
folder = "01-Dataset" # Folder's name.

# Checking if the folder already exist.
if not folder in os.listdir():
    os.mkdir(folder)  # Creating the folder when 01-Dataset do not exist.

# List of files.
list_url = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv',
            'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv']

# Loop to download the two files hosted in Udacity website.
for url in list_url:
    if not url.split('/')[-1].replace('-','_') in os.listdir(folder):                              # Check if the files already exist.
        response = requests.get(url)                                                               # Creating an object of the given URL.
        with open(os.path.join(folder,url.split('/')[-1].replace('-','_')), mode = 'wb') as file:  # Defining the file name.
            file.write(response.content)                                                           # Saving the object.

2.2.1. Loading the files¶

Now that all file is saved in the local machine (stored in the 01-Dataset folder), let's load each file into the Jupyter Notebook environment.

In [3]:

# Loading the image precictions
df_img = pd.read_csv('01-Dataset/image_predictions.tsv', sep = '\t') # tsv file need to configure the sep as tabular.

# Loading the archive of WeRateDogs.
df_ach = pd.read_csv('01-Dataset/twitter_archive_enhanced.csv') # regular csv file.

These two files are ready for the next step of Data Wrangling, which is called Data Assessing.

2.3. Twitter API¶

One of the Project requirements is to access the Twitter API to create the tweet_json.txt completing some missing/wrong values of the tweet-json file. I will use the tweepy package (a client code) to access the Twitter API.

Sensible Data¶

Please, fill with your own consumer_key, consumer_secret, access_token and access_secret.

In [4]:

# Complete with your own keys and secretes.
consumer_key = ''                               # API key
consumer_secret = ''   # API secret key
access_token = ''      # Access token
access_secret = ''          # Access token secret

OAuth Authentication¶

Based on the given consumer_key, consumer_secret, access_token and access_secret I will access the Twitter API.

The following chunk of code has a default configuration of the API, bear in mind, the wait_on_rate_limit and wait_on_rate_limit_notify, both are set as True.

wait_on_rate_limit – Whether or not to automatically wait for rate limits to replenish

wait_on_rate_limit_notify – Whether or not to print a notification when Tweepy is waiting for rate limits to replenish

Obs.: The excerpt was extracted from the Tweepy website.

In [5]:

# Authentication (default configuration).
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

The reason to use the Twitter API is to gather additional data and missing data from the twitter_archive_enhanced.csv file. For this reason, the strategy is to find for each tweet_id the missing and the additional data.

First, I need to know the quantity of unique tweets twitter_archive_enhanced.csv have.

In [6]:

# Unique tweets of twitter-archive-enhanced-2.csv file.
u_twt_id = df_ach.tweet_id.tolist()

# Print the number of unique tweets.
len(df_ach.tweet_id.unique())

Out[6]:

The quantity probably will raise/trigger the TweepError, which require the API use wait some minutes to keep gathering data from Twitter.

I have also configured the tweet_mode to extended because I want all messages without the ... (elipsis), Twitter has increased the character's number from 140 to 280.

For a deeper understanding of _json dictionary, please read the Appendix.

In [7]:

# Initializing the dicitionary to store the raw data from Twitter API JSON file.
dct_twt_raw = {}

# This is the list of error.
df_error = []

# Current time when the Twitter API starts.
str_loop = time.time()

# Loop to find each tweet_id of u_twt_id list.
for tweet in u_twt_id:
    # This is necessary to avoid an interruption if any tweet_id is not find.
    try:
        # The get_status will find a specific tweet_id and return all the tweet.
        twt_status = api.get_status(tweet,
                                    wait_on_rate_limit = True,         # Automatically wait for rate limits to replenish.
                                    wait_on_rate_limit_notify = True,  # Print the notification "Rate limit reached..."
                                    tweet_mode = 'extended')           # I want the full text without "...".
        # The data I want is the _json dictionary.
        dct_twt_raw[str(tweet)] = twt_status._json                     # Bear in mind, _json is a dictionary.
    # In case of a problem, a message will be printed.
    except:
        df_error.append(str(tweet))                                    # Update df_error list.
        print("tweet_id did not found: {}".format(str(tweet)))

# Current time when the Twitter API ends.
end_loop = time.time()

tweet_id did not found: 888202515573088257
tweet_id did not found: 873697596434513921
tweet_id did not found: 872668790621863937
tweet_id did not found: 869988702071779329
tweet_id did not found: 866816280283807744
tweet_id did not found: 861769973181624320
tweet_id did not found: 845459076796616705
tweet_id did not found: 842892208864923648
tweet_id did not found: 837012587749474308
tweet_id did not found: 827228250799742977
tweet_id did not found: 812747805718642688
tweet_id did not found: 802247111496568832
tweet_id did not found: 775096608509886464
tweet_id did not found: 770743923962707968

Rate limit reached. Sleeping for: 565

tweet_id did not found: 754011816964026368
tweet_id did not found: 680055455951884288

Rate limit reached. Sleeping for: 566

Unfortunately, some tweet_id was not found in @dog_rates tweets database, for some reason the tweet disappeared.

I have stored this information in a separated list in case of necessity I can use it later.

For sanity reason, I have recorded the time elapsed to gather data from Twitter API.

To query all of the tweet IDs in the WeRateDogs Twitter archive, 20-30 minutes of running time can be expected. Printing out each tweet ID after it was queried and using a code timer were both helpful for sanity reasons.

In [8]:

# Calculating the elapsed time in minutes.
elapsed_time = (end_loop - str_loop)/60

# Printing the elapsed time of the loop.
print("Elapsed time: {} minutes".format(round(elapsed_time,2)))

Elapsed time: 33.57 minutes

Following the project's instruction, let's create the tweet_json.txt file.

Bear in mind, the dct_twt_raw is a dictionary, and I will save it as a JSON file.

In [9]:

# Writing the tweet_json.txt in the 01-Dataset folder.
with open('01-Dataset/tweet_json.txt', mode = 'w') as file:
    json.dump(dct_twt_raw, file)

Also according to the project's instructions, I need to read the exported file (tweet_json.txt).

Take consideration tweet_json.txt file is a dictionary and I do want a DataFrame, I can use the read_json from pandas package to import as a DataFrame. I have also defined the orient as an index to be in conformity with the definition of tidy dataset preconized by Hadley Wickham.

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

In [10]:

# Loading the tweet_json.txt as a panda DataFrame.
df_twt_raw = pd.read_json('01-Dataset/tweet_json.txt', orient = 'index') # Atention to the orient defined as index.

# Printing the first 2 rows.
df_twt_raw.head(2)

Out[10]:

	contributors	coordinates	created_at	display_text_range	entities	extended_entities	favorite_count	favorited	full_text	geo	...	quoted_status	quoted_status_id	quoted_status_id_str	quoted_status_permalink	retweet_count	retweeted	retweeted_status	source	truncated	user
1998-04-12 22:37:23.555336193	NaN	NaN	2017-08-01 16:23:56	[0, 85]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 892420639486877696, 'id_str'...	38075	0	This is Phineas. He's a mystical boy. Only eve...	NaN	...	NaN	NaN	NaN	NaN	8328	0	NaN	<a href="http://twitter.com/download/iphone" r...	0	{'id': 4196983835, 'id_str': '4196983835', 'na...
1998-04-10 03:03:41.306343426	NaN	NaN	2017-08-01 00:17:27	[0, 138]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 892177413194625024, 'id_str'...	32680	0	This is Tilly. She's just checking pup on you....	NaN	...	NaN	NaN	NaN	NaN	6151	0	NaN	<a href="http://twitter.com/download/iphone" r...	0	{'id': 4196983835, 'id_str': '4196983835', 'na...

2 rows × 32 columns

Finally, I have gathered all three datasets:

tweet_json.txt as df_twt_raw;
image_predictions.tsv as df_img, and;
twitter_archive_enhanced.csv as df_ach.

3. Data Assessing ¶

The issues could be defined into two types:

Quality issues: Dirty data;
- Missing, duplicated, or incorrect data;
Lack of tidiness: Also known as messy data;
- Structural issues.

3.1. Objectivies¶

Project requirements extracted from the project rubric.

At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.

3.2. Methodology¶

I will perform a straightforward assessment.

Detect, and;
Document.

So, for each dataset, I will point out all problem I have detected, and later in the Data Cleaning process I will clean and fix these issues.

3.3. twitter_archive_enhanced.csv¶

The twitter_archive_enhanced.csv file has loaded to Jupyter Notebook environment as df_ach. As the first contact to this data frame, let's start to investigate this archive using the traditional visual assessment such .head() and .tail() methods.

`.head()`¶

In [11]:

# Print the first 5 rows of the data set.
df_ach.head()

Out[11]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	13	10	Phineas	None	None	None	None
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	13	10	Tilly	None	None	None	None
2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891815181...	12	10	Archie	None	None	None	None
3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891689557...	13	10	Darla	None	None	None	None
4	891327558926688256	NaN	NaN	2017-07-29 16:00:24 +0000	<a href="http://twitter.com/download/iphone" r...	This is Franklin. He would like you to stop ca...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891327558...	12	10	Franklin	None	None	None	None

`.tail()`¶

In [12]:

# Print the last 5 rows of the data set.
df_ach.tail()

Out[12]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
2351	666049248165822465	NaN	NaN	2015-11-16 00:24:50 +0000	<a href="http://twitter.com/download/iphone" r...	Here we have a 1949 1st generation vulpix. Enj...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666049248...	5	10	None	None	None	None	None
2352	666044226329800704	NaN	NaN	2015-11-16 00:04:52 +0000	<a href="http://twitter.com/download/iphone" r...	This is a purebred Piers Morgan. Loves to Netf...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666044226...	6	10	a	None	None	None	None
2353	666033412701032449	NaN	NaN	2015-11-15 23:21:54 +0000	<a href="http://twitter.com/download/iphone" r...	Here is a very happy pup. Big fan of well-main...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666033412...	9	10	a	None	None	None	None
2354	666029285002620928	NaN	NaN	2015-11-15 23:05:30 +0000	<a href="http://twitter.com/download/iphone" r...	This is a western brown Mitsubishi terrier. Up...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666029285...	7	10	a	None	None	None	None
2355	666020888022790149	NaN	NaN	2015-11-15 22:32:08 +0000	<a href="http://twitter.com/download/iphone" r...	Here we have a Japanese Irish Setter. Lost eye...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666020888...	8	10	None	None	None	None	None

As a result of the .head() and .tail() methods, I can point out the following issues recorded in Table 1.

Table 1 - Issues identified visually in twitter_archive_enhanced.csv.

Issue ID	Issue Type	Dimension	Method	Column	Description
1	Quality	Validity	Visual	name	Invalid names or non-standard names.
2	Tidiness	-	Visual	source	HTML tags, URL, and content in a single column.

I have found some weird names, which has required a deeper analysis using the .value_counts() method to examine the recurrence of this error.

`.value_counts()`¶

In [13]:

# What is the most commom name to a Dog!?
df_ach.name.value_counts()

Out[13]:

None          745
a              55
Charlie        12
Lucy           11
Oliver         11
Cooper         11
Penny          10
Lola           10
Tucker         10
Bo              9
Winston         9
the             8
Sadie           8
Bailey          7
an              7
Daisy           7
Toby            7
Buddy           7
Dave            6
Stanley         6
Rusty           6
Jack            6
Koda            6
Oscar           6
Leo             6
Milo            6
Scout           6
Jax             6
Bella           6
Louis           5
             ... 
such            1
Jeb             1
Fletcher        1
Laela           1
space           1
Monkey          1
Ferg            1
Longfellow      1
Mojo            1
Ricky           1
Vinscent        1
Koko            1
Chloe           1
Shiloh          1
Thor            1
Rumpole         1
Kanu            1
Filup           1
Berkeley        1
Goose           1
Superpup        1
Eazy            1
Comet           1
Tonks           1
Zara            1
Bluebert        1
Lili            1
Caryl           1
Sprout          1
Pepper          1
Name: name, Length: 957, dtype: int64

The .value_counts() method applied to the name's columns shows a higher number of weird and probably wrong dogs names.

55 occurrences of a;
8 occurrences of the;
7 occurrences of an, etc.

This list ensures the problem already identified visually.

Let's deep in the rating_denominator and rating_numerator.

In [14]:

# What is the most commom denominator. I guess it must be 10.
df_ach.rating_denominator.value_counts()

Out[14]:

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [15]:

# In the past @dog_rates used to rate with grades between 0 to 10.
df_ach.rating_numerator.value_counts()

Out[15]:

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

As you can see, both columns have problems in some observations. Table 2 summarizes the issues.

Table 2 - Issues identified visually and programmatically in twitter_archive_enhanced.csv.

Issue ID	Issue Type	Dimension	Method	Column	Description
1	Quality	Validity	Visual	name	Invalid names or non-standard names.
2	Tidiness	-	Visual	source	HTML tags, URL, and content in a single column.
3	Quality	Validity	Programmatic	rating_numerator	Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from `int` to `float`.
4	Quality	Validity	Programmatic	rating_denominator	Invalid denominator, I expected a fixed base. Data Structure must be converted from `int` to `float`.

`.info()`¶

In [16]:

# An overview of the df_ach.
df_ach.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 220.9+ KB

The .info() method provide a good overview of each column, which allow me to identify latents issues. Table 3 aggregates these new issues.

Table 3 - Issues identified visually and programmatically in twitter_archive_enhanced.csv.

Issue ID	Issue Type	Dimension	Method	Column	Description
1	Quality	Validity	Visual	name	Invalid names or non-standard names.
2	Tidiness	-	Visual	source	HTML tags, URL, and content in a single column.
3	Quality	Validity	Programmatic	rating_numerator	Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from `int` to `float`.
4	Quality	Validity	Programmatic	rating_denominator	Invalid denominator, I expected a fixed base. Data Structure must be converted from `int` to `float`.
5	Tidiness	-	Programmatic	doggo, floofer, pupper, and puppo	This is a categorical variable, and I can combine these columns into one column.
6	Tidiness	-	Programmatic	text	There is two information in a single column. Split the text from the URL.
7	Quality	Validity	Programmatic	timestamp	Convert to date.
8	Quality	Validity	Programmatic	tweet_id	Following the example of zip code, it must be a string.
9	Quality	Accuracy	Programmatic	retweeted_status_id	The same dog could be recorded twice or more in cases of retweets.
10	Quality	Accuracy	Programmatic	in_reply_to_status_id	The same dog could be recorded twice or more in cases of reply.

The retweeted_status_id, retweeted_status_user_id, in_reply_to_status_id, and in_reply_to_user_id could lead wrong results because it will duplicate the same dog picture in each retweet or reply. For this reason, it's good to remove this tweet_id's.

An example of this problem is shown below. The dog so-called "Sierra" will appears twice due to the retweet.

In [17]:

# Subsetting the df_ach to find a retweet.
df_ach[df_ach.name == 'Sierra']

Out[17]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
97	873337748698140672	NaN	NaN	2017-06-10 00:35:19 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: This is Sierra. She's one preci...	8.732138e+17	4.196984e+09	2017-06-09 16:22:42 +0000	https://www.gofundme.com/help-my-baby-sierra-g...	12	10	Sierra	None	None	pupper	None
98	873213775632977920	NaN	NaN	2017-06-09 16:22:42 +0000	<a href="http://twitter.com/download/iphone" r...	This is Sierra. She's one precious pupper. Abs...	NaN	NaN	NaN	https://www.gofundme.com/help-my-baby-sierra-g...	12	10	Sierra	None	None	pupper	None

Let's take a look at the duplicated tweet_id.

In [18]:

# Duplicated tweet_id.
sum(df_ach.tweet_id.duplicated())

Out[18]:

Although there are no duplicate tweet_id, I need to be aware of the retweet, which creates a new tweet_id for the same dog, resulting in double entries (or more).

3.4. image_predictions.tsv¶

Analogous to the 3.3. item, I will perform the same process to the image_predictions.tsv file, I have stored this file in the df_img variable.

`.head()`¶

In [19]:

# Printing the first 5 rows.
df_img.head()

Out[19]:

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True

`.tail()`¶

In [20]:

# Printing the last 5 rows.
df_img.tail()

Out[20]:

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
2070	891327558926688256	https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg	2	basset	0.555712	True	English_springer	0.225770	True	German_short-haired_pointer	0.175219	True
2071	891689557279858688	https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg	1	paper_towel	0.170278	False	Labrador_retriever	0.168086	True	spatula	0.040836	False
2072	891815181378084864	https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg	1	Chihuahua	0.716012	True	malamute	0.078253	True	kelpie	0.031379	True
2073	892177421306343426	https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg	1	Chihuahua	0.323581	True	Pekinese	0.090647	True	papillon	0.068957	True
2074	892420643555336193	https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg	1	orange	0.097049	False	bagel	0.085851	False	banana	0.076110	False

Visually I did not find much, unless the inconsistency in the breed name, some has the first letter capitalized and other not. I will record this issue in Table 4.

Table 4 - Issues identified visually in image_predictions.tsv.

Issue ID	Issue Type	Dimension	Method	Column	Description
11	Quality	Consistency	Visual	p1, p2, and p3	Dog's breed has no standard. Capital letter or lowercase names.

`.info()`¶

In [21]:

# An overview of the df_img.
df_img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 119.6+ KB

One more issue observed is the type of tweet_id variable. Table 5 summarizes the issues found in image_predictions.tsv.

Table 5 - Issues identified visually and programmatically in image_predictions.tsv.

Issue ID	Issue Type	Dimension	Method	Column	Description
11	Quality	Consistency	Visual	p1, p2, and p3	Dog's breed has no standard. Capital letter or lowercase names.
12	Quality	Validity	Programmatic	tweet_id	Convert to string.

Let's analyze the duplicated entries.

`.duplicated()`¶

In [22]:

# Duplicated tweet_id.
sum(df_img.tweet_id.duplicated())

Out[22]:

In [23]:

# Duplicated jpg_url.
sum(df_img.jpg_url.duplicated())

Out[23]:

The tweet_id is a unique key to the tweet and not to the rated dog. For this reason, it is a good idea to filter for a unique picture of the dog, which will also remove the retweets. Table 6 summarizes the issues of the image_predictions.tsv file.

Table 6 - Issues identified visually and programmatically in image_predictions.tsv.

Issue ID	Issue Type	Dimension	Method	Column	Description
11	Quality	Consistency	Visual	p1, p2, and p3	Dog's breed has no standard. Capital letter or lowercase names.
12	Quality	Validity	Programmatic	tweet_id	Convert to string.
13	Quality	Validity	Programmatic	jpg_url	It has duplicated images and consequently double entry.

3.5. Conjunctural Analysis¶

Based on the tidy data concept, these two tables (twitter_archive_enhanced.csv and image_predictions.tsv) could be merged into one, using the tweet_id as a mapping key. Summarizing Table 3, Table 6, and this new issue, Table 7 is the preliminary version of the issues on all dataset of this project.

Table 7 - Summary of Issues Identified.

Issue ID	Table	Issue Type	Dimension	Method	Column	Description
1	df_ach	Quality	Validity	Visual	name	Invalid names or non-standard names.
2	df_ach	Tidiness	-	Visual	source	HTML tags, URL, and content in a single column.
3	df_ach	Quality	Validity	Programmatic	rating_numerator	Invalid ratings. Value varies from 1776 to 0. Data Structure must be converted from `int` to `float`.
4	df_ach	Quality	Validity	Programmatic	rating_denominator	Invalid denominator, I expected a fixed base. Data Structure must be converted from `int` to `float`.
5	df_ach	Tidiness	-	Programmatic	doggo, floofer, pupper, and puppo	This is a categorical variable, and I can combine these columns into one column.
6	df_ach	Tidiness	-	Programmatic	text	There is two information in a single column. Split the text from the URL.
7	df_ach	Quality	Validity	Programmatic	timestamp	Convert to date.
8	df_ach	Quality	Validity	Programmatic	tweet_id	Following the example of zip code, it must be a string.
9	df_ach	Quality	Accuracy	Programmatic	retweeted_status_id	The same dog could be recorded twice or more in cases of retweets.
10	df_ach	Quality	Accuracy	Programmatic	in_reply_to_status_id	The same dog could be recorded twice or more in cases of reply.
11	df_img	Quality	Consistency	Visual	p1, p2, and p3	Dog's breed has no standard. Capital letter or lowercase names.
12	df_img	Quality	Validity	Programmatic	tweet_id	Convert to string.
13	df_img	Quality	Validity	Programmatic	jpg_url	It has duplicated images and consequently double entry.
14	twt_ach_mstr	Tidiness	-	Programmatic	-	Merging these two tables (`df_ach` and `df_img`) into one.

Where:

df_ach: twitter_archive_enhanced.csv;
df_img: image_predictions.tsv;
twt_ach_mstr: twitter_archive_master.csv, and;
df_twt_raw: tweet_json.txt.

3.6. Iterative Process¶

Have in mind the Data Assessing is not an exhaustive step, and neither aims to find all issues of these two tables. Probably, this step will be revisited again to find hidden issues.

4. Data Cleaning ¶

Founded on Table 7, let's start to fix each issue pointed out.

4.1. Copying¶

Before any edition or data modification, I will make copies of each table to keep the original without any edition.

In [24]:

# Copying the df_img.
df_img_cln = df_img.copy()

# Copying the df_ach.
df_ach_cln = df_ach.copy()

# Copying the df_twt_raw. 
df_twt_cln = df_twt_raw.copy()

There is no correct sequence to fix the issues, for this reason, I have decided to start fixing the problems related to duplicated entries, data type, and later merging the tables df_ach and df_img.

4.2. Issue ID 13¶

Issue: Duplicated images and consequently double entry.

Define¶

Remove all duplicated images (based on the URL).

Code¶

In [25]:

# Vector to indexing the duplicated jpg url.
indexing = df_img.jpg_url.duplicated()

# I want the opposite. All observations except the duplicated.
indexing = np.logical_not(indexing)

# Subsetting and assigning to the df_img cleaned.
df_img_cln = df_img_cln[indexing]

# New dimensions of df_img_cln
print("Before: {} rows.\nAfter: {} rows.".format(df_img_cln.shape[0], df_img.shape[0]))

Before: 2009 rows.
After: 2075 rows.

Test¶

In [26]:

# Testing the code given the defined solution.
print("{} duplicated.".format(sum(df_img_cln.jpg_url.duplicated())))

0 duplicated.

As you can see, I have removed 66 observations (the original observation's number is 2075). To ensure I have printed the number of duplicated items in the jpg_url column, which is now zero.

4.3. Issue ID 9 and 10¶

Due to the same nature of these two issues, I decided to bundle it to avoid repetition.

Issue ID 9: The same dog could be recorded twice or more in cases of retweets.

Issue ID 10: The same dog could be recorded twice or more in cases of reply.

Define¶

Remove any tweet with retweeted_status_id non-null, and;
Remove any tweet with in_reply_to_status_id non-null.

Code¶

Following the same idea of Issue 14, I will remove any retweet and reply.

In [27]:

# CODE - Issue ID 9
# Creating a vector to select the retweeted_status_id different of Null.
indexing = df_ach_cln.retweeted_status_id.isnull()

# Subsetting the df_ach_cln excluding the retweeted. 
df_ach_cln = df_ach_cln[indexing]


# CODE - Issue ID 10
# Creating a vector to select the in_reply_to_status_id different of Null.
indexing = df_ach_cln.in_reply_to_status_id.isnull()

# Subsetting the df_ach_cln excluding the reply.
df_ach_cln = df_ach_cln[indexing]

Test¶

In [28]:

# Testing: Expect all rows as True when using the .isnull()  
print("Number of rows with true in retweeted_status_id:", sum(df_ach_cln.retweeted_status_id.isnull()))

# Testing: Expect all rows as True when using the .isnull()  
print("Number of rows with true in in_reply_to_status_id:", sum(df_ach_cln.in_reply_to_status_id.isnull()))

# Number of rows in df_ach_cln.
print("Number of rows of df_ach_cln:",df_ach_cln.shape[0])

Number of rows with true in retweeted_status_id: 2097
Number of rows with true in in_reply_to_status_id: 2097
Number of rows of df_ach_cln: 2097

The test shows both columns, retweeted_status_id and in_reply_to_status_id, do not have any content, which means are two empty columns.

4.4. Issue ID 8¶

Issue: Following the example of zip code, it must be string.

Following the same idea from this thread of Stack Overflow:

Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.

Define¶

Convert tweet_id to str.

Code¶

In [29]:

# Converting the tweet_id to string.
df_ach_cln.tweet_id = df_ach_cln.tweet_id.astype(str)

Test¶

In [30]:

# Printing the info to ensure the convertion. Bear in mind, in pandas object is string. 
df_ach_cln.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2097 non-null object
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     2097 non-null object
source                        2097 non-null object
text                          2097 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2094 non-null object
rating_numerator              2097 non-null int64
rating_denominator            2097 non-null int64
name                          2097 non-null object
doggo                         2097 non-null object
floofer                       2097 non-null object
pupper                        2097 non-null object
puppo                         2097 non-null object
dtypes: float64(4), int64(2), object(11)
memory usage: 204.8+ KB

Based on the .info() the tweet_id has converted as object, which is in pandas the same of a string.

4.5. Issue ID 12¶

Issue: Convert tweet_id to string to be capable to use it as key in a merging process.

In common with Issue ID 8, I will do the same.

Define¶

Convert tweet_id to str.

Code¶

In [31]:

# Converting the tweet_id to string.
df_img_cln.tweet_id = df_img_cln.tweet_id.astype(str)

Test¶

In [32]:

# Printing the info to ensure the convertion. Bear in mind, in pandas object is string. 
df_img_cln.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2009 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2009 non-null object
jpg_url     2009 non-null object
img_num     2009 non-null int64
p1          2009 non-null object
p1_conf     2009 non-null float64
p1_dog      2009 non-null bool
p2          2009 non-null object
p2_conf     2009 non-null float64
p2_dog      2009 non-null bool
p3          2009 non-null object
p3_conf     2009 non-null float64
p3_dog      2009 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 123.6+ KB

Based on the .info() the tweet_id has converted as object, which is in pandas the same of a string.

4.6. Issue ID 14¶

Issue: Based on the concept of tidy data preconized by Hadley Wickham, it is a good idea to merge these two tables.

The merging process will use the tweet_id as a Key.

Obs.: Read more about merging in Pandas Website.

Define¶

Merge df_ach_cln and df_img_cln.

Code¶

In [33]:

# Creating the twitter_archive_master.
twt_ach_mstr = pd.merge(df_ach_cln, df_img_cln, on = 'tweet_id')

Test¶

In [34]:

# The first 5 rows of the merged table.
twt_ach_mstr.head()

Out[34]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	...	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	...	1	orange	0.097049	False	bagel	0.085851	False	banana	0.076110	False
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	...	1	Chihuahua	0.323581	True	Pekinese	0.090647	True	papillon	0.068957	True
2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891815181...	...	1	Chihuahua	0.716012	True	malamute	0.078253	True	kelpie	0.031379	True
3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891689557...	...	1	paper_towel	0.170278	False	Labrador_retriever	0.168086	True	spatula	0.040836	False
4	891327558926688256	NaN	NaN	2017-07-29 16:00:24 +0000	<a href="http://twitter.com/download/iphone" r...	This is Franklin. He would like you to stop ca...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891327558...	...	2	basset	0.555712	True	English_springer	0.225770	True	German_short-haired_pointer	0.175219	True

5 rows × 28 columns

In [35]:

# Printing the number of columns of df_ach_cln and df_img_cln.
print("df_ach_cln: {} columns\ndf_img_cln: {} columns".format(df_ach_cln.shape[1], df_img_cln.shape[1]))

df_ach_cln: 17 columns
df_img_cln: 12 columns

As you can see, the twt_ach_mstr, data frame result from the merging process of df_ach_cln and df_img_cln, has 28 columns, which is in compass to the summation of the columns of df_ach_cln and df_img_cln minus the "key" parameter (17 + 12 - 1 = 28).

4.7. Issue ID 1¶

Issue: Invalid names or non-standard names.

The expression used to track the dog's name (in other words the regular expression) is likely "This is DOG_NAME" when DOG_NAME is changed by "a funny" or "an incredible" or "very something", the expression did not "scrap" the right name. In some cases neither a name the dog has, but the expression with taking whatever it lays in that position.

For this reason, I have changed every non-standard name to "None".

Define¶

Rename non-standard names to "None".

Code¶

In [36]:

# Initialization of variable.
non_names = []

# Loop to find ordinary word.
for index in twt_ach_mstr.name:
    # Will check every name start with lowercase.
    if index.islower():
        # If yes will append to non_names.
        non_names.append(index)

# The list will filter only unique values
non_names = list(set(non_names))

# Printing non-standard names.
non_names

Out[36]:

['not',
 'this',
 'just',
 'actually',
 'my',
 'one',
 'infuriating',
 'the',
 'unacceptable',
 'all',
 'by',
 'his',
 'light',
 'very',
 'quite',
 'an',
 'officially',
 'a',
 'space',
 'getting',
 'incredibly',
 'such']

The list of names shown above has all non-standard names from name column. I had replaced it by None using the .replace() method.

In [37]:

# Loop to replace each non standard name (non_name).
for index in non_names:
    twt_ach_mstr.name.replace(index,
                              "None",
                              inplace = True)

Test¶

In [38]:

# Based on the list of non-standard names. Let's ensure if all have been removed.
sum(twt_ach_mstr.name.isin(non_names))

Out[38]:

The test ensure there are no more non-standard names.

4.8. Issue ID 2¶

Issue: HTML tags, URL, and content in a single column.

I will edit it because there are two pieces ofinformations stored in a single column.

The source name;
The URL who redirect to the source (href).

Obs.: I have read this thread in Stack Overflow to revise how to use the .concat().

Define¶

Remove HTML tags;
Create a new column to store the href, and;
Assign to the source column the content between HTML tags.

Code¶

In [39]:

# Creating auxiliary list to store the source and source_url.
aux_ls_sr = []
aux_ls_sr_url = [] 

# This loop will remove all HTML tags using .split() and indexing.
for index in twt_ach_mstr.source:
    aux_ls_sr.append(index.split('</a')[0].split('<a href="')[1].split('" rel="nofollow">')[1] )
    aux_ls_sr_url.append(index.split('</a')[0].split('<a href="')[1].split('" rel="nofollow">')[0] )

# Updating the source column with aux_ls_sr.
twt_ach_mstr.source = aux_ls_sr

# Appending a new column with source_url info.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(aux_ls_sr_url, columns = ['source_url'])], axis = 1)

Test¶

In [40]:

# Printing the first 5 rows of a subsetted data frame.
twt_ach_mstr[['tweet_id','source','source_url']].head()

Out[40]:

	tweet_id	source	source_url
0	892420643555336193	Twitter for iPhone	http://twitter.com/download/iphone
1	892177421306343426	Twitter for iPhone	http://twitter.com/download/iphone
2	891815181378084864	Twitter for iPhone	http://twitter.com/download/iphone
3	891689557279858688	Twitter for iPhone	http://twitter.com/download/iphone
4	891327558926688256	Twitter for iPhone	http://twitter.com/download/iphone

In [41]:

# Printing the categories of source column.
twt_ach_mstr.source.value_counts()

Out[41]:

Twitter for iPhone    1932
Twitter Web Client      28
TweetDeck               11
Name: source, dtype: int64

In [42]:

# Printing the new column.
twt_ach_mstr.source_url.value_counts()

Out[42]:

http://twitter.com/download/iphone              1932
http://twitter.com                                28
https://about.twitter.com/products/tweetdeck      11
Name: source_url, dtype: int64

The source column now does not show any HTML tag, even more, the href URL from the HTML tag was converted into a new feature stored in source_url column.

4.9. Issue ID 6¶

Issue: There are two information in a single column.

In the same way, the text column (of twt_ach_mstr) has two pieces of information, the text and the URL at the end of the message. Let's divide this into two separated columns.

Define¶

Remove the URL in the end of the text column, and;
Create a new column to store the URL in the end of each tweet, and;

Code¶

In [43]:

# Creating auxiliary list to store the text and text_url.
aux_txt = []
aux_txt_url = [] 

# This loop will remove all HTML tags using .split() and indexing.
for index in twt_ach_mstr.text:
    aux_txt.append(' '.join(index.split(" ")[:-1]))
    aux_txt_url.append(index.split(" ")[-1])

# Updating the source column with aux_ls_sr.
twt_ach_mstr.text = aux_txt

# Appending a new column with source_url info.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(aux_txt_url, columns = ['text_url'])], axis = 1)

Test¶

In [44]:

# Printing the first 5 rows of a subsetted dataframe. 
twt_ach_mstr[['tweet_id','text','text_url']].head()

Out[44]:

	tweet_id	text	text_url
0	892420643555336193	This is Phineas. He's a mystical boy. Only eve...	https://t.co/MgUWQ76dJU
1	892177421306343426	This is Tilly. She's just checking pup on you....	https://t.co/0Xxu71qeIV
2	891815181378084864	This is Archie. He is a rare Norwegian Pouncin...	https://t.co/wUnZnhtVJB
3	891689557279858688	This is Darla. She commenced a snooze mid meal...	https://t.co/tD36da7qLQ
4	891327558926688256	This is Franklin. He would like you to stop ca...	https://t.co/AtUZn91f7f

In [45]:

# Printing the first 5 rows of text column.
twt_ach_mstr.text.tolist()[:5]

Out[45]:

["This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10",
 "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10",
 'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10',
 'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us',
 'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek']

The URL was removed and assigned to a new column (text_url) and the text is free of URL at the end of each tweet.

4.10. Issue ID 3 and 4¶

Issue: Invalid ratings. Value varies from 1776 to 0. Data Strucutre must be converted from int to float.

Issue: Invalid denominator, I expected a fixed base. Data Strucutre must be converted from int to float.

First of all, I will convert both columns, rating_numerator and rating_denominator, from int to float because in some cases the @dog_rates uses float rating number.

Define¶

Convert rating_numerator and rating_denominator to float;
Remove the extreme values (1776, 420, etc.) of rating_numerator, and;
Remove non expected value of denominator, anything different of 10.

Code¶

In [46]:

# Converting the rating_numerator and rating_denominator to float.
twt_ach_mstr.rating_numerator = twt_ach_mstr.rating_numerator.astype(float)
twt_ach_mstr.rating_denominator = twt_ach_mstr.rating_denominator.astype(float)

Most of the numerator values are in conformity to what I expected, but few have extreme values such as 1776, 420, 204, and so on.

For this reason, I am going to subset the twt_ach_mstr data frame to realize what is in common with these observations.

Obs.: I have read this thread in Stack Overflow to learn about .isin()

In [47]:

# Creating the black list
black_list = sorted(twt_ach_mstr.query('rating_numerator > 15').rating_numerator.tolist())

# Let's subset the twt_ach_mstr using the .isin() method.
twt_ach_mstr[twt_ach_mstr.rating_numerator.isin(black_list)][['tweet_id','text']]

Out[47]:

	tweet_id	text
320	820690176645140481	The floofs have been released I repeat the flo...
382	810984652412424192	Meet Sam. She smiles 24/7 & secretly aspir...
499	786709082849828864	This is Logan, the Chow who lived. He solemnly...
549	778027034220126208	This is Sophie. She's a Jubilant Bush Pupper. ...
658	758467244762497024	Why does this never happen at my front door......
722	749981277374128128	This is Atticus. He's quite simply America af....
842	731156023742988288	Say hello to this unbelievably well behaved sq...
918	716439118184652801	This is Bluebert. He just saw that both #Final...
939	713900603437621249	Happy Saturday here's 9 puppers on a bench. 99...
963	710658690886586372	Here's a brigade of puppers. All look very pre...
981	709198395643068416	From left to right:\nCletus, Jerome, Alejandro...
1045	704054845121142784	Here is a whole flock of puppers. 60/50 I'll ...
1120	697463031882764288	Happy Wednesday here's a bucket of pups. 44/40...
1288	684222868335505415	Someone help the girl is being mugged. Several...
1359	680494726643068929	Here we have uncovered an entire battalion of ...
1420	677716515794329600	IT'S PUPPERGEDDON. Total of 144/120 ...I think
1478	675853064436391936	Here we have an entire platoon of puppers. Tot...
1696	670842764863651840	After so many requests... here you go.\n\nGood...

I have found some tweets not related to grading/rating:

810984652412424192: This is a crowdfund to a specific dog;

And I have found some special ratings:

749981277374128128: This is an outlier, due to the commemorative independency day the rating is outstanding high, and;
670842764863651840: I really do not understand why this rating is so high.

These three observations will be removed. Although this is not a good practice, I will insert manually the tweet_id codes to a list.

In [48]:

# Remove those three tweet_id's.
rm_list = ['810984652412424192',
           '749981277374128128',
           '670842764863651840']

# Creating a vector to subset twt_ach_mstr and remove the tweet_id from the rm_list.
indexing = np.logical_not(twt_ach_mstr.tweet_id.isin(rm_list))

# Updating the twt_ach_mstr data frame.
twt_ach_mstr = twt_ach_mstr[indexing]

I know there are some rating that is not an integer, and for this reason, I will use a .for() loop to find all float rating values. The only way to find the float ratings values are using the regex (regular expression). The problem identified in the rating_numerator and rating_denominator is the original regex used to scrap the rating only take account the digits after the point, which lead an error when the rating is 9.75 becomes 75.

Obs. 1: I have had some trouble to write this chunk of code because I can not update a DataFrame view. I read this document to surpass this issue.

Obs. 2: This is quite difficult to understand the regex, but I found this document and this one to guide me.

In [49]:

# Loop to find any non integer rating.
for index in twt_ach_mstr.tweet_id:
    # Extracting the text of each tweet
    twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
    # This regex will find any "float" rating among the words text.
    rating = re.findall('\d+\.\d+\/\d+', twt_txt)
    # If any rating were found, the length of rating will be different of zero
    if len(rating) > 0:
        # rating is a list of one element like this [9.9/10], I want only the first part the 9.9.
        numerator = rating[0].split('/')[0]
        # This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
        twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)

The next observation there is more than one pet, so I decided to calculate the mean.

Now, I am looking for some exceptions, like messages when the pattern XX/XX appears more than once. It occurs when @dog_rates gives two or more ratings in a single tweet (or even talks about a specific day). The error identified in rating_numerador and rating_denominator is the second rating is totally drop. I will do the average between these two ratings as a final rating of the tweet.

I have identified one special case which the expression 50/50 (fifty/fifty) is used, and the regex parses it as a rating.

In [50]:

# Loop to find any non integer rating.
for index in twt_ach_mstr.tweet_id:
    # Extracting the text of each tweet
    twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
    # This regex will find the pattern NN/NN.
    rating = re.findall('\d+\d\/\d+', twt_txt)
    # If any rating is a list of two elements.
    if len(rating) == 2:
        if(rating[0] == '50/50'):
            # rating is a list of two elements like this [50/50, 9.9/10], I want only the first part of the second element.
            numerator = rating[1].split('/')[0]
            denominator = rating[1].split('/')[1]
            # This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
            twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)
            twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = float(denominator)
        else:
            # Average of two ratings.
            numerator = (float(rating[0].split('/')[0]) + float(rating[1].split('/')[0])) * 0.5
            # This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
            twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator']

Finally, in some tweets the user @dog_rates has an aggregate in a single rating many rates, in other words, If the picture has 7 dogs, the rating going to be 84/70, 12/10 in average.

I will correct these exceptions doing a simple "scale".

In [51]:

# Loop to find any non integer rating.
for index in twt_ach_mstr.query('rating_numerator > 15').tweet_id:
    # Extracting the text of each tweet
    twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
    # This regex will find the pattern NN/NN.
    rating = re.findall('\d+\d\/\d+', twt_txt)
    # If any rating were found, the length of rating will be different of zero
    if len(rating) == 1:
        # rating is a list like this [9.9/10], I want only the first part
        numerator = float(rating[0].split('/')[0])
        denominator = float(rating[0].split('/')[1])
        # This is the hard part. I must use the .loc() to update the data alocated in twt_ach_mstr Data Frame.
        twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = numerator/(denominator/10)
        twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = denominator/(denominator/10)

Let's fix the denominator problems.

In [52]:

# Slicing the problematic values from the value_counts. 
denominator_problem = twt_ach_mstr.rating_denominator.value_counts().index[1:].tolist()

# Subsetting the dataframe to extrac the tweet_id with problem. 
tweet_id_denominator_problem = twt_ach_mstr[twt_ach_mstr.rating_denominator.isin(denominator_problem)].tweet_id

for index in tweet_id_denominator_problem:
    # Extracting the text of each tweet
    twt_txt = twt_ach_mstr[twt_ach_mstr.tweet_id == index].text.tolist()[0]
    # This regex will find the pattern NN/NN.
    rating = re.findall('\d+\d\/\d+', twt_txt)
    # Only update if the rating is valid rating NN/NN
    if len(rating) == 1:
        numerator = rating[0].split('/')[0]
        denominator = rating[0].split('/')[1]
        twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_numerator'] = float(numerator)
        twt_ach_mstr.loc[twt_ach_mstr[twt_ach_mstr['tweet_id'] == index].index,'rating_denominator'] = float(denominator)

Test¶

In [53]:

# Printing the values of numerator.
twt_ach_mstr.rating_numerator.value_counts()

Out[53]:

12.00    450
10.00    419
11.00    398
13.00    254
9.00     150
8.00      95
7.00      51
14.00     34
6.00      32
5.00      32
3.00      19
4.00      15
2.00       9
1.00       5
9.75       1
0.00       1
11.26      1
13.50      1
11.27      1
Name: rating_numerator, dtype: int64

In [54]:

# Printing the values of denominator
twt_ach_mstr.rating_denominator.value_counts()

Out[54]:

10.0    1967
2.0        1
Name: rating_denominator, dtype: int64

Unfortunately, it is not possible to remove all problems in the denominator column (there is still one non-standard value). This happened due to the use of fraction (3 1/2). I did not find a good regex to filter this issue.

4.11. Issue ID 5¶

Issue: This is a categorical variable and could be combine into one column.

In this tidiness issue, I am going to combine four columns into a single column.

I have realized in same cases the same dog is classified twice:

8 cases: doggo and pupper;
1 case: doggo and floofer, and;
1 case: doggo and puppo.

Figure 1 shows the doggo, pupper, puppo, and floofer definitions.

Figure 1 - The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the #WeRateDogs book on Amazon)

Obs.: I have used the numpy arrays to "add" vector, and to do so I have read this thread on Stack Overflow.

Define¶

Merge doggo, pupper, puppo, and floofer in one column.

Code¶

In [55]:

# Subset the twt_ach_mstr just to pick the "dogtionary columns".
dog_cols = twt_ach_mstr[['doggo','floofer','pupper','puppo']]

# Replace all "None" values to "almost null".
dog_cols = dog_cols.replace('None', '') 

# Adding the columns to create a new one.
dogtionary = np.array(dog_cols['doggo']) + np.array(dog_cols['floofer']) + np.array(dog_cols['pupper']) + np.array(dog_cols['puppo'])

# Printing
pd.DataFrame(dogtionary, columns = ['dogtionary']).dogtionary.value_counts()

Out[55]:

                1665
pupper           201
doggo             63
puppo             22
doggopupper        8
floofer            7
doggofloofer       1
doggopuppo         1
Name: dogtionary, dtype: int64

Let's convert all double classified dog as multiclass.

In [56]:

# Fixing the dogs with two classifications.
dogtionary[dogtionary == 'doggopupper'] = 'multiclass'
dogtionary[dogtionary == 'doggofloofer'] = 'multiclass'
dogtionary[dogtionary == 'doggopuppo'] = 'multiclass'

# Printing to see the results.
pd.DataFrame(dogtionary, columns = ['dogtionary']).dogtionary.value_counts()

Out[56]:

              1665
pupper         201
doggo           63
puppo           22
multiclass      10
floofer          7
Name: dogtionary, dtype: int64

Finally, let's append this new column called dogtionary to the twt_ach_mstr data frame.

In [57]:

# Reseting the index of twt_ach_mstr.
twt_ach_mstr.reset_index(drop=True, inplace=True)

# Appending dogtionary to twt_ach_mstr.
twt_ach_mstr = pd.concat([twt_ach_mstr, pd.DataFrame(dogtionary, columns = ['dogtionary'])], axis = 1)

Test¶

In [58]:

# Printing the value_counts of the new columns dogtionary.
twt_ach_mstr.dogtionary.value_counts()

Out[58]:

              1665
pupper         201
doggo           63
puppo           22
multiclass      10
floofer          7
Name: dogtionary, dtype: int64

The new column bundle the 4 columns (pupper, puppo, doggo, and floofer) in one.

4.12. Issue ID 7¶

Issue: Convert the timestamp to a properly data type.

The timestamp variable is an object, which means it is a string for pandas. I am going to convert it to a datetime data type.

Unfortunately, the timestamp notation has some additional info (' +0000') unnecessary, which I need to remove.

Obs.: The numpy Datetimes documentation was essentials to understand how to convert it.

Define¶

Convert timestamp to date time variable.

Code¶

In [59]:

# Extracting the timestamp from a separated variable.
timestamp = twt_ach_mstr.timestamp.tolist()

# I want to remove the ' +0000', to do so I will use the split.
timestamp_str = list(map( lambda x : str(x).split(' +0000')[0], timestamp))

# Converting a regular list to a numpy Data Frame.
timestamp_str = pd.DataFrame(timestamp_str, columns = ['timestamp'])

# Converting using the numpy datetime, and assigning to the twt_ach_mstr.
twt_ach_mstr.timestamp = timestamp_str.timestamp.apply(np.datetime64)

Test¶

Let's print the data type to ensure the conversion of the timestamp.

In [60]:

# Printing the data type of each variable.
twt_ach_mstr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1968 entries, 0 to 1967
Data columns (total 31 columns):
tweet_id                      1968 non-null object
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     1968 non-null datetime64[ns]
source                        1968 non-null object
text                          1968 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1968 non-null object
rating_numerator              1968 non-null float64
rating_denominator            1968 non-null float64
name                          1968 non-null object
doggo                         1968 non-null object
floofer                       1968 non-null object
pupper                        1968 non-null object
puppo                         1968 non-null object
jpg_url                       1968 non-null object
img_num                       1968 non-null int64
p1                            1968 non-null object
p1_conf                       1968 non-null float64
p1_dog                        1968 non-null bool
p2                            1968 non-null object
p2_conf                       1968 non-null float64
p2_dog                        1968 non-null bool
p3                            1968 non-null object
p3_conf                       1968 non-null float64
p3_dog                        1968 non-null bool
source_url                    1968 non-null object
text_url                      1968 non-null object
dogtionary                    1968 non-null object
dtypes: bool(3), datetime64[ns](1), float64(9), int64(1), object(17)
memory usage: 305.6+ KB

4.13. Issue ID 11¶

Issue: Dog's breed is not standardized, some is capitalized and other lowercase.

I have written I straightforward function to deal with the non-standardized names.

Converting all names to lower cases;
Converting all spaces into an underscore, and;
Converting all dash to underscore.

Define¶

Standardization of the dogs' breeds names in P1, P2 and P3 columns.

Code¶

In [61]:

def fix_names(list_names):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This functions aims to convert any letter to lower case, also replace space by underscores.      |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE     TYPE         DESCRIPTION                                                            |
    |                                                                                                    |
    |   list_names   data frame   A column of a given data frame which contains non-standardized letters |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE         DESCRIPTION                                                            |
    |                                                                                                    |
    |   list_names   data frame   Column with 'fixed' content.                                           |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Convert any uppercase in lowercase.
    list_names = list(map(lambda x : x.lower(), list_names ))
    # Change spaces in underscores.
    list_names = list(map(lambda x : x.replace(' ','_'), list_names ))
    # Change dash to underscores.
    list_names = list(map(lambda x : x.replace('-','_'), list_names ))
    
    return list_names # Return the "fixed" names.

Let's use this function to fix p1, p2, and p3.

In [62]:

twt_ach_mstr.p1 = fix_names(twt_ach_mstr.p1)
twt_ach_mstr.p2 = fix_names(twt_ach_mstr.p2)
twt_ach_mstr.p3 = fix_names(twt_ach_mstr.p3)

Test¶

Let's take a look at the breed's name to ensure the conversion.

In [63]:

# Subsetting the twt_ach_mstr to show only p1, p2, p3, and tweet_id.
twt_ach_mstr[['tweet_id','p1','p2','p3']].head(10)

Out[63]:

	tweet_id	p1	p2	p3
0	892420643555336193	orange	bagel	banana
1	892177421306343426	chihuahua	pekinese	papillon
2	891815181378084864	chihuahua	malamute	kelpie
3	891689557279858688	paper_towel	labrador_retriever	spatula
4	891327558926688256	basset	english_springer	german_short_haired_pointer
5	891087950875897856	chesapeake_bay_retriever	irish_terrier	indian_elephant
6	890971913173991426	appenzeller	border_collie	ice_lolly
7	890729181411237888	pomeranian	eskimo_dog	pembroke
8	890609185150312448	irish_terrier	irish_setter	chesapeake_bay_retriever
9	890240255349198849	pembroke	cardigan	chihuahua

4.14. Additional Issues¶

Based on the Project's instructions:

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

For this reasons, I have had to aggregate to Table 7, two more issues, 15 and 16. I also need to remove any non-used columns.

Table 8 - Summary of Issues Identified.

Issue ID	Table	Issue Type	Dimension	Method	Column	Description
15	df_img	Quality	Completeness	Programmatic	"retweet count"	Gather additional info in `tweet_json.txt` file.
16	df_img	Quality	Completeness	Programmatic	"favorite count"	Gather additional info in `tweet_json.txt` file.
17	twt_ach_mstr	Quality	Validity	Programmatic	"many columns"	Remove `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_timestamp`, `retweeted_status_id`, and `retweeted_status_user_id`.

4.15. Issue ID 15¶

Issue: Gather retweet count info in tweet_json.txt file.

As a requirement of the instruction, I have appended a new column called favorite_count.

Define¶

Add to the twitter_archive_master.csv the favorite_count values from the tweet_json.txt

Code¶

In [64]:

# Lesson Learned: DO NOT use chaining indexing.
favorite_count = df_twt_cln.loc[:,('id','favorite_count')]

# Renaming the columns to be apt to merge using tweet_id as key.
favorite_count.columns = ['tweet_id','favorite_count']

# Converting the tweet_if to str.
favorite_count.loc[favorite_count.tweet_id.index,'tweet_id'] = favorite_count.tweet_id.astype(str)

# Merging the subsetted data frame called favorite_count to twt_ach_mstr.
twt_ach_mstr = pd.merge(twt_ach_mstr, favorite_count, on = 'tweet_id', how = 'left')

As you can see, I have configured the .merge() as left join, because I do not want to drop the observations without this favorite_count.

Test¶

In [65]:

# Print the first 5 favorite_count rows. Have in mind, this is a subset with few columns.
twt_ach_mstr.loc[:,('tweet_id','favorite_count')].head()

Out[65]:

	tweet_id	favorite_count
0	892420643555336193	NaN
1	892177421306343426	NaN
2	891815181378084864	24593.0
3	891689557279858688	41433.0
4	891327558926688256	39613.0

Bear in mind, neither all tweet_id has favorite_count.

4.16. Issue ID 16¶

Issue: Gather favorite count info in tweet_json.txt file.

Similarly as done to favorite_count, the retweet_count will require the same sequence of code lines with minor changes.

Define¶

Add to the twitter_archive_master.csv the retweet_count values from the tweet_json.txt

Code¶

In [66]:

#favorite_count = df_twt_cln[['tweet_id','favorite_count']]
retweet_count = df_twt_cln.loc[:,('id','retweet_count')]

# Renaming the columns to be apt to merge using tweet_id as key.
retweet_count.columns = ['tweet_id','retweet_count']

# Converting the tweet_if to str.
retweet_count.tweet_id = retweet_count.tweet_id.astype(str)

# Merging the subsetted data frame called favorite_count to twt_ach_mstr.
twt_ach_mstr = pd.merge(twt_ach_mstr, retweet_count, on = 'tweet_id', how = 'left')

Test¶

I have also configured the .merge() as left join because I do not want to drop the observations without this retweet_count.

In [67]:

# Print the first 5 retweet_count rows. Have in mind, this is a subset with few columns.
twt_ach_mstr.loc[:,('tweet_id','retweet_count')].head()

Out[67]:

	tweet_id	retweet_count
0	892420643555336193	NaN
1	892177421306343426	NaN
2	891815181378084864	4070.0
3	891689557279858688	8471.0
4	891327558926688256	9166.0

Bear in mind, neither all tweet_id has retweet_count.

4.17. Issue ID 17¶

Issue: Remove in_reply_to_status_id, in_reply_to_user_id, retweeted_status_timestamp, retweeted_status_id, and retweeted_status_user_id.

Define¶

Drop the unused variables/columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_timestamp, retweeted_status_id, and retweeted_status_user_id.

Code¶

First, let's take into account the numbers of columns.

In [68]:

# Calculates the number of columns.
len(twt_ach_mstr.columns.tolist())

Out[68]:

What is the columns in twt_ach_mstr?

In [69]:

twt_ach_mstr.columns.tolist()

Out[69]:

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'source_url',
 'text_url',
 'dogtionary',
 'favorite_count',
 'retweet_count']

There more 4 columns from dogtionary, which I can drop as well.

In [70]:

# Columns to be removed.
rm_columns = ['in_reply_to_status_id',
              'in_reply_to_user_id',
              'retweeted_status_timestamp',
              'retweeted_status_id',
              'retweeted_status_user_id',
              'doggo',
              'pupper',
              'puppo',
              'floofer']

# Copy of all column of twt_ach_mstr. 
list_columns = twt_ach_mstr.columns.tolist()

# Loop to remove each non desired column.
for rm in rm_columns:
    # Remove a specific column.
    list_columns.remove(rm)

# Priting the remaining columns.
list_columns

Out[70]:

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'source_url',
 'text_url',
 'dogtionary',
 'favorite_count',
 'retweet_count']

Let's subset the twt_ach_mstr data frame.

In [71]:

# Cleaned twt_ach_mstr.
twt_ach_mstr = twt_ach_mstr[list_columns]

# Printing the first 5 rows.
twt_ach_mstr.head()

Out[71]:

	tweet_id	timestamp	source	text	expanded_urls	rating_numerator	rating_denominator	name	jpg_url	img_num	...	p2_conf	p2_dog	p3	p3_conf	p3_dog	source_url	text_url	favorite_count	retweet_count
0	892420643555336193	2017-08-01 16:23:56	Twitter for iPhone	This is Phineas. He's a mystical boy. Only eve...	https://twitter.com/dog_rates/status/892420643...	13.0	10.0	Phineas	https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg	1	...	0.085851	False	banana	0.076110	False	http://twitter.com/download/iphone	https://t.co/MgUWQ76dJU	NaN	NaN
1	892177421306343426	2017-08-01 00:17:27	Twitter for iPhone	This is Tilly. She's just checking pup on you....	https://twitter.com/dog_rates/status/892177421...	13.0	10.0	Tilly	https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg	1	...	0.090647	True	papillon	0.068957	True	http://twitter.com/download/iphone	https://t.co/0Xxu71qeIV	NaN	NaN
2	891815181378084864	2017-07-31 00:18:03	Twitter for iPhone	This is Archie. He is a rare Norwegian Pouncin...	https://twitter.com/dog_rates/status/891815181...	12.0	10.0	Archie	https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg	1	...	0.078253	True	kelpie	0.031379	True	http://twitter.com/download/iphone	https://t.co/wUnZnhtVJB	24593.0	4070.0
3	891689557279858688	2017-07-30 15:58:51	Twitter for iPhone	This is Darla. She commenced a snooze mid meal...	https://twitter.com/dog_rates/status/891689557...	13.0	10.0	Darla	https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg	1	...	0.168086	True	spatula	0.040836	False	http://twitter.com/download/iphone	https://t.co/tD36da7qLQ	41433.0	8471.0
4	891327558926688256	2017-07-29 16:00:24	Twitter for iPhone	This is Franklin. He would like you to stop ca...	https://twitter.com/dog_rates/status/891327558...	12.0	10.0	Franklin	https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg	2	...	0.225770	True	german_short_haired_pointer	0.175219	True	http://twitter.com/download/iphone	https://t.co/AtUZn91f7f	39613.0	9166.0

5 rows × 24 columns

Test¶

Last look to the .info() to ensure if the columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_timestamp, retweeted_status_id, and retweeted_status_user_id have been removed.

In [72]:

# Printing the info.
twt_ach_mstr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1968 entries, 0 to 1967
Data columns (total 24 columns):
tweet_id              1968 non-null object
timestamp             1968 non-null datetime64[ns]
source                1968 non-null object
text                  1968 non-null object
expanded_urls         1968 non-null object
rating_numerator      1968 non-null float64
rating_denominator    1968 non-null float64
name                  1968 non-null object
jpg_url               1968 non-null object
img_num               1968 non-null int64
p1                    1968 non-null object
p1_conf               1968 non-null float64
p1_dog                1968 non-null bool
p2                    1968 non-null object
p2_conf               1968 non-null float64
p2_dog                1968 non-null bool
p3                    1968 non-null object
p3_conf               1968 non-null float64
p3_dog                1968 non-null bool
source_url            1968 non-null object
text_url              1968 non-null object
dogtionary            1968 non-null object
favorite_count        1280 non-null float64
retweet_count         1280 non-null float64
dtypes: bool(3), datetime64[ns](1), float64(7), int64(1), object(12)
memory usage: 251.8+ KB

4.18. Data Storing¶

After identifying 17 issues and solve them, the final data frame (twt_ach_mstr) has one particularity:

Presence of NaN values in favorite_count and retweet_count columns.

This particularity will not affect the further studies, because it is not related to the quality of the data.

In [73]:

# Saving the `twt_ach_mstr` as twitter_archive_master.csv.
twt_ach_mstr.to_csv('01-Dataset/twitter_archive_master.csv', index=False, encoding = 'utf-8')

Finally, the Data Cleaning ends here.

5. Data Analysis and Visualization ¶

Let's load the twitter_archive_master.csv, this time I will name the loaded data frame as df_viz.

In [74]:

# Loading the tweet_json.txt as a panda DataFrame.
df_viz = pd.read_csv('01-Dataset/twitter_archive_master.csv')

Adjustments¶

Although, I have already converted the timestamp to datetime in Issue ID 7 when I export (to make a backup) and load (the exported file) the timestamp column will be coerced to a string, which obliges me to convert (again).

In [75]:

# Converting using the numpy datetime, and assigning to the twt_ach_mstr.
df_viz.timestamp = df_viz.timestamp.apply(np.datetime64)

Graphic's General Configuration¶

In order to provide a default configuration to all graphics, I will define a chunk to wrap this configuration.

In [76]:

# General configuration for all plots.
font = 14          # X and Y axis fontsize.
font_title = 18    # Graphic's Title fontsize.
transparency = 0.5 # Transparency in cases of scatter plot.
width = 0.7        # Bar width

Exploratory Data Analysis¶

In the next section, I have tried to perform an Exploratory Data Analysis (EDA). I will pose some silly questions to guide me to walk through the project requirements.

5.1. Algorithm Comparison¶

Let's compare the output of each algorithm, plotting a bar graph to analysis visually. To do so, I will write a function to bundle some code lines and to avoid repetition.

In [77]:

# Function to plot a barchart.
def gp_pop_dog(df_popular, gp_title, xlabel, ylabel,font = font, font_title = font_title):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be   |
    |   used.                                                                                            |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_popular data frame   The entire data frame.                                                   |
    |                                                                                                    |
    |   gp_title   str          Graphic title.                                                           |
    |                                                                                                    |
    |   xlabel     str          X axis label.                                                            |
    |                                                                                                    |
    |   ylabel     str          Y axis label.                                                            |
    |                                                                                                    |
    |   font       str          Axis labels font size.                                                   |
    |                                                                                                    |
    |   font_title str          Graphic Title font size.                                                 |    
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE   DESCRIPTION                                                                  |
    |                                                                                                    |
    |   graphic      -      Returns a matplotlib graphic.                                                |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Plotting only the breed's dog with more than 20 occurencies.
    df_popular[df_popular > 20].plot(kind = 'barh',      # Set the kind of graphic.
                                 figsize = [14, 6]); # Graphic size.
    plt.title(label = gp_title,         # Graphic Title.
              fontsize = font_title);   # Graphic title fontsize.
    plt.xlabel(xlabel = xlabel,         # X axis label
               fontsize = font);        # X axis fontsize
    plt.ylabel(ylabel = ylabel,         # Y axis label
               fontsize = font);        # Y axis fontsize
    plt.xlim(0, 140)                    # X axis limit - set default as 140.

Premised on the three (3) algorithms, I will plot three graphics and compare the results.

P1 Algorithm.¶

In [78]:

# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p1_dog].p1.value_counts()

# Plotting the graphic.
gp_pop_dog(dog_pref,
           gp_title = 'Graphic 1 - Algorithm 1 - Dog\'s Breed with more than 20 appearances.',
           xlabel = 'Occurrences',
           ylabel = 'Dog\'s Breed')

P2 algorithm¶

In [79]:

# Dog Preference - Algorithm P2

# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p2_dog].p2.value_counts()

# Plotting the graphic.
gp_pop_dog(dog_pref,
           gp_title = 'Graphic 2 - Algorithm 2 - Dog\'s Breed with more than 20 appearances.',
           xlabel = 'Occurrences',
           ylabel = 'Dog\'s Breed')

P3 Algorithm¶

In [80]:

# Dog Preference - Algorithm P3

# Value counts to build a histogram.
dog_pref = df_viz[df_viz.p3_dog].p3.value_counts()

# Plotting the graphic.
gp_pop_dog(dog_pref,
           gp_title = 'Graphic 3 - Algorithm 3 - Dog\'s Breed with more than 20 appearances.',
           xlabel = 'Occurrences',
           ylabel = 'Dog\'s Breed')

Based on the graphics, the P1 algorithm has fewer breeds and high concentrated bars in some breeds, in the opposite way the P3 algorithm has much more breeds and the dogs are spread in more breeds, in other words, less concentrated bars.

In [81]:

# Calculating the percentages by algorithms.
p1_perc = round(100 * df_viz[df_viz.p1_dog].shape[0]/df_viz.shape[0],2)
p2_perc = round(100 * df_viz[df_viz.p2_dog].shape[0]/df_viz.shape[0],2)
p3_perc = round(100 * df_viz[df_viz.p3_dog].shape[0]/df_viz.shape[0],2)

# Printing a summary.
print("P1: {}%\nP2: {}%\nP3: {}%".format(p1_perc,p2_perc,p3_perc))

P1: 74.29%
P2: 75.15%
P3: 72.66%

All three algorithm has almost the same performance (percentage of picture classified as a dog):

P1: 1462/1968 = 74.29%
P2: 1479/1968 = 75.15%
P3: 1420/1968 = 72.66%

The third one has a slightly less performance.

In [82]:

# Calculating the number of breeds in each algorithm with more than 20 occurencies.
p1_qty = sum(df_viz[df_viz.p1_dog].p1.value_counts() > 20)
p2_qty = sum(df_viz[df_viz.p2_dog].p2.value_counts() > 20)
p3_qty = sum(df_viz[df_viz.p3_dog].p3.value_counts() > 20)

# Printing a summary of occurencies.
print("P1: {} breeds.\nP2: {} breeds.\nP3: {} breeds.".format(p1_qty,p2_qty,p3_qty))

P1: 14 breeds.
P2: 18 breeds.
P3: 21 breeds.

I have found a difference in the number of breeds with more than 20 occurrences.

P1: 14 breeds
P2: 18 breeds
P3: 21 breeds

The third algorithm has 33.33% more breeds than the first algorithm.

In [83]:

# Calculating the number of dogs classified using the threshold of 20.
dog_qty_20_p1 = sum(df_viz[df_viz.p1_dog].p1.value_counts()[df_viz[df_viz.p1_dog].p1.value_counts() > 20])
dog_qty_20_p2 = sum(df_viz[df_viz.p2_dog].p2.value_counts()[df_viz[df_viz.p2_dog].p2.value_counts() > 20])
dog_qty_20_p3 = sum(df_viz[df_viz.p3_dog].p3.value_counts()[df_viz[df_viz.p3_dog].p3.value_counts() > 20])

# Printing a summary of occurencies.
print("P1: {} dogs.\nP2: {} dogs.\nP3: {} dogs.".format(dog_qty_20_p1,dog_qty_20_p2,dog_qty_20_p3))

P1: 732 dogs.
P2: 713 dogs.
P3: 671 dogs.

Conclusion: The algorithm 1 is more skewness than the other two, and tend to concentrate the classification in a few breeds. On the other hand, the third algorithm is more generalist, spreading the dogs in a more variety of breeds.

5.2. Tweets Correlation: Favorite and Retweet¶

I want to see if there are a correlation between Favorite and Retweet. I am going to plot a simple scatter plot.

In [84]:

plt.figure(figsize= [14,6])                               # Graphic size
plt.scatter(x = df_viz['retweet_count'],                  # X varibale
            y = df_viz['favorite_count'],                 # Y variable
            alpha = transparency)                         # Transparency
plt.xlabel('Number of Retweets', fontsize = font)         # X label
plt.ylabel('Number of Favorite', fontsize = font)         # Y label
plt.title('Graphic 4 - Correlation between retweets and favorite tweets.',
          fontsize = font_title)     # Graphic Title
plt.show()                                                # Plot the graphic

Graphically, it is possible to see a positive trend (correlation) between both variables. Thus, let's calculate a naive rate for each retweet how many favourites it will generate (in average).

In [85]:

# Calculating the rate for each retweet how many favorite will produce in average.
fav_ret_rate = df_viz.favorite_count/df_viz.retweet_count

# Calculating of mean, standard deviation, etc.
fav_ret_rate[np.logical_not(df_viz.favorite_count.isnull())].describe()

Out[85]:

count    1280.000000
mean        3.384297
std         1.150137
min         1.107146
25%         2.580705
50%         3.219539
75%         3.955537
max        11.765060
dtype: float64

Conclusion: Seems there is a very positive correlation when the number of favourites raises the numbers of retweets also raises. Using a straightforward analysis for each new retweet the favourites will increase 3 (in average).

5.3. Doggo, Puppo, Pupper or Floofer¶

Which one is more adorable?

I will calculate the mean and favourite per retweet rate to compare the dogtionary terms.

In [86]:

# Copying the original data frame for a pontual analysis.
df_dogtionary = df_viz.copy()

# Filling the NaN values with None.
df_dogtionary.dogtionary = df_dogtionary.dogtionary.fillna('None')

# Mean calculation.
df_dog_mean = df_dogtionary.groupby(['dogtionary'])['favorite_count','retweet_count'].mean()

# Rate calculation.
df_dog_mean = df_dog_mean.join(pd.DataFrame(df_dog_mean.favorite_count/df_dog_mean.retweet_count, columns = ['rate']))

# Count calculation.
df_dog_mean_count = df_dog_mean.join(df_dogtionary.dogtionary.value_counts())

# Copying the column's name.
col_names = df_dog_mean_count.columns.tolist()

# Editing the columns names.
col_names[3] = 'count'

# Assigning a proper name.
df_dog_mean_count.columns = col_names

# Printing the summary.
df_dog_mean_count

Out[86]:

	favorite_count	retweet_count	rate	count
dogtionary
None	8062.129870	2393.982375	3.367665	1665
doggo	14410.558824	4851.058824	2.970601	63
floofer	5643.000000	2172.000000	2.598066	7
multiclass	19020.250000	6341.375000	2.999389	10
pupper	6529.219858	2122.219858	3.076599	201
puppo	14684.933333	4292.266667	3.421254	22

Conclusion: Based on the summary above, in average, dogs without a dogtionary classification tend to perform better rates of favourite per retweet. Have in mind, the few observations of floofer and puppo.

Still analysing the dogtionary terms, but in a visual manner, I will use Graphic 4 as the base for Graphic 5.

In [102]:

# Copying the original data frame for a pontual analysis.
df_dog_viz = df_viz.copy()

# Filling the NaN values with None.
df_dog_viz.dogtionary = df_dog_viz.dogtionary.fillna('None')

# List of term in dogtionary.
dog_terms = ['None', 'pupper', 'doggo', 'puppo', 'floofer', 'multiclass'] # The order is ascending.

# Graphic Size.
plt.figure(figsize = [14, 10])                               

# Loop to construct the graphic "layers".
for index in dog_terms:
    # Provisory data frame to subset the specific "index" dogtionary term.
    prov = df_dog_viz[df_dog_viz['dogtionary'] == index][['favorite_count','retweet_count']]
    # Plotting
    plt.scatter(x = prov['favorite_count'],  # X axis data.
                y = prov['retweet_count'] ,  # Y axis data.
                alpha = transparency)        # Points transparency.
plt.xlabel('Number of Retweets',     # X label.
           fontsize = font)          # X axis fontsize.
plt.ylabel('Number of Favorite',     # Y label.
           fontsize = font)          # Y axis fontsize.
plt.title('Graphic 5 - Correlation between retweets and favorite tweets by Dogtionary terms.', # Graphic title.
          fontsize = font_title)     # Graphic Title.
plt.legend(dog_terms)                # Add legend.
plt.show()                           # Plot the graphic.

Conclusion: It is not possible to identify (visually) any pattern or relationship, using the dogtionary variable, although there is a strong and positive correlation between Retweets and Favorite.

5.4. Tweet's Profile over the time¶

How is the behaviour of tweets along the week? Do they have the same average?

Founded on the rating_numerator and timestamp, I will plot how the average rating will vary over the time.

In [88]:

# Copying the data frame to a pontual study.
df_time = df_viz.copy()

# Subsetting.
df_time = df_time[['timestamp','rating_numerator']]

# Converting full date to weekdays. 1 = Monday, 2 = Tuesday, ... , 7 = Sunday
df_time.timestamp = list(map(lambda x : x.isoweekday(), df_time.timestamp))

# Calculating the mean of ratings.
df_time_mean = df_time.groupby(['timestamp']).mean()

# Adding a new columns: number of occurencies.
df_time_mean = df_time_mean.join(pd.DataFrame(df_time.timestamp.value_counts()))

# Setting strings as index (preparing to the plot).
df_time_mean.index = ['Monday','Tuesday','Wednesday','Thurday','Friday','Saturday','Sunday']

# Editing the variable name.
df_time_mean.columns = ['Avg Rating','Tweets Number']

# Summary of Average Rating over the weekdays and grouped by the tweets's number.
df_time_mean

Out[88]:

	Avg Rating	Tweets Number
Monday	10.342593	324
Tuesday	10.602302	291
Wednesday	10.453925	293
Thurday	10.706250	280
Friday	10.692688	279
Saturday	10.586694	248
Sunday	10.454545	253

Analysing the above chunk output, it is not clear the behaviour. For this reason, I have written a function to encapsulate a couple of code lines. This will save code lines in the future.

In [89]:

# Function to plot a barplot and line.
def gp_bar_line(df_x, df_y1, df_y2, gp_title, xlabel, ylabel, font = font, font_title = font_title):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be   |
    |   used.                                                                                            |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_x       data frame   The entire data frame.                                                   |
    |                                                                                                    |
    |   df_y1      data frame   The variable to be displayed as bar.                                     |
    |                                                                                                    |
    |   df_y2      data frame   The variable to be displayed as line.                                    |
    |                                                                                                    |
    |   gp_title   str          Graphic title.                                                           |
    |                                                                                                    |
    |   xlabel     str          X axis label.                                                            |
    |                                                                                                    |
    |   ylabel     str          Y axis label.                                                            |
    |                                                                                                    |
    |   font       str          Axis labels font size.                                                   |
    |                                                                                                    |
    |   font_title str          Graphic Title font size.                                                 |    
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE   DESCRIPTION                                                                  |
    |                                                                                                    |
    |   graphic      -      Returns a matplotlib graphic.                                                |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Graphic Settings.
    fig, ax1 = plt.subplots(figsize = [14,6])

    # First Element: Blue bars.
    color = 'tab:blue'
    ax1.set_xlabel(xlabel, fontsize = font)
    ax1.set_ylabel('Avg Rating', color = color, fontsize = font)
    ax1.bar(df_x.index, df_y1, color = color)
    ax1.tick_params(axis = 'y', labelcolor = color)

    # Second Element: Red line.
    ax2 = ax1.twinx()  # share axis with ax1 element.
    color = 'indianred'
    ax2.set_ylabel(ylabel, color = color, fontsize = font)  # we already handled the x-label with ax1
    ax2.plot(df_x.index, df_y2, color = color, linewidth = 5)
    ax2.tick_params(axis = 'y', labelcolor = color)

    fig.tight_layout()  # otherwise the right y-label is slightly clipped
    plt.title(gp_title,                # Graphic title.
              fontsize = font_title)   # Graphic font title.
    return plt.show()

After defining the function, I will apply this function in the Average Rating over the weekdays.

In [90]:

# Plotting Average Rating over the weekdays.
gp_bar_line(df_x = df_time_mean,
            df_y1 = df_time_mean['Avg Rating'],
            df_y2 = df_time_mean['Tweets Number'],
            gp_title = 'Graphic 6 - Average Rating and Number of Tweets per Weekday.',
            xlabel = 'Weekday',
            ylabel = 'Number of Tweets')

Conclusion: Although the number of tweets decreases along the weekend, the average stay (almost) steadily. In average, Monday is the weekday with highest tweets average.

In [91]:

# Copying the data frame to a pontual study.
df_month = df_viz.copy()

# Subsetting
df_month = df_month[['timestamp','rating_numerator']]

# Converting full date to month.
df_month.timestamp = list(map(lambda x : x.month, df_month.timestamp))

# Calculating the mean of ratings.
df_month_mean = df_month.groupby(['timestamp']).mean()

# Adding a columns with value counts.
df_month_mean = df_month_mean.join(pd.DataFrame(df_month.timestamp.value_counts()))

# Renaming the index.
df_month_mean.index = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Editing the variable name.
df_month_mean.columns = ['Avg Rating','Tweets Number']

# Printing the summary.
df_month_mean

Out[91]:

	Avg Rating	Tweets Number
Jan	10.711207	232
Feb	10.889535	172
Mar	10.939394	165
Apr	11.287234	94
May	11.683673	98
Jun	11.073171	123
Jul	11.533088	136
Aug	11.000000	61
Sep	11.194762	63
Oct	11.673077	65
Nov	9.355491	346
Dec	9.877143	413

In this case, it is much clear to see a behaviour when the number of tweet increases the average rating goes down.

Let's print a graphic using the function gp_bar_line.

In [92]:

# Plotting Average Rating over the weekdays.
gp_bar_line(df_x = df_month_mean,
            df_y1 = df_month_mean['Avg Rating'],
            df_y2 = df_month_mean['Tweets Number'],
            gp_title = 'Graphic 7 - Average Rating and Number of Tweets per Month.',
            xlabel = 'Month',
            ylabel = 'Number of Tweets')

Conclusion: Although December and November have the highest number of tweets, the average rating goes in the opposite way. These two months have the lowest number of average ratings. There is strong evidence of seasonality.

5.5. Dog Breeds Appeal¶

What is the Breed with more reaction?

Due to the use of an algorithm to determine (automatically) the dog's breed, It is necessary to subset the df_viz, based on the p1_dog, for doing so, I have written the df_gp_triple function to encapsulate this lines codes.

In [93]:

# This functions is to wrap code lines. The result of this function is the input of gp_triple function.
def df_gp_triple(df_viz, algorithm, threshold = 20, sorting = False):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be   |
    |   used.                                                                                            |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_viz     data frame   The entire data frame (imported from twitter_archive_master.csv).        |
    |                                                                                                    |
    |   algorithm  str          Could be p1, p2 or p3.                                                   |
    |                                                                                                    |
    |   threshold  int          Breed's number to be displayed in the graphic.                           |
    |                                                                                                    |
    |   sorting    bool         True: ascending, False: descending.                                      |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_alg     data frame   Data frame with 'threshold' rows and 3 columns (favorite_count,          |
    |                           retweet_count, and rate). It is indexed by dog's breed.                  |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # According to the algorithm I will subset the df_viz using differents rows.
    if algorithm == 'p1':
        df_alg = df_viz[['rating_numerator','p1','p1_dog','retweet_count','favorite_count']]
        df_alg = df_alg[df_alg.p1_dog]
        df_alg = df_alg.groupby(['p1']).sum()
    elif algorithm == 'p2':
        df_alg = df_viz[['rating_numerator','p2','p2_dog','retweet_count','favorite_count']]
        df_alg = df_alg[df_alg.p2_dog]
        df_alg = df_alg.groupby(['p2']).sum()
    else:
        df_alg = df_viz[['rating_numerator','p3','p3_dog','retweet_count','favorite_count']]
        df_alg = df_alg[df_alg.p3_dog]
        df_alg = df_alg.groupby(['p3']).sum()

    # Removing the 'p#', 'p#_dog', and 'rating_numerator' columns.
    df_alg = df_alg[['favorite_count','retweet_count']]

    # Adding a new columns with the rate (favorite divided by retweet)
    df_alg = df_alg.join(pd.DataFrame(df_alg['favorite_count']/df_alg['retweet_count']))

    # Renaming columns.
    df_alg.columns = ['favorite_count','retweet_count','rate']

    # Sorting by favorite_count.
    df_alg = df_alg.sort_values(by = 'rate', ascending = sorting)[:threshold]
    
    # return df_alg.
    return df_alg

Let's have a try calculating the subsetted data frame to algorithm p1. The expected output is a table containing all dog's breed with rate and summation of favourite and retweet.

In [94]:

# Subsetting the df_viz using the function df_gp_triple.
df_p1 = df_gp_triple(df_viz,
                     algorithm = 'p1')

# Printing the first 5 rows.
df_p1.head()

Out[94]:

	favorite_count	retweet_count	rate
p1
german_short_haired_pointer	27523.0	4455.0	6.178002
basset	122353.0	23171.0	5.280437
irish_terrier	34710.0	6583.0	5.272672
rhodesian_ridgeback	12956.0	2479.0	5.226301
bloodhound	25891.0	5080.0	5.096654

Now, I want to plot this table in a graphic with two bars and one line.

In [95]:

# Function to plot the df_gp_triple data frame.
def gp_triple(df_x, df_y1, df_y2, df_y3, algorithm, gp_title, xlabel, ylabel,
              font = font, font_title = font_title, width = width):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be   |
    |   used.                                                                                            |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_x       data frame   The entire data frame (imported from twitter_archive_master.csv).        |
    |                                                                                                    |
    |   df_y1      data frame   The variable to be displayed as blue bar 1.                              |
    |                                                                                                    |
    |   df_y2      data frame   The variable to be displayed as orange bar 2.                            |
    |                                                                                                    |
    |   df_y3      data frame   The variable to be displayed as line.                                    |
    |                                                                                                    |
    |   algorithm  str          Could be 'p1', 'p2' or 'p3'.                                             |
    |                                                                                                    |
    |   gp_title   str          Graphic title.                                                           |
    |                                                                                                    |
    |   xlabel     str          X axis label.                                                            |
    |                                                                                                    |
    |   ylabel     str          Y axis label.                                                            |
    |                                                                                                    |
    |   font       str          Axis labels font size.                                                   |
    |                                                                                                    |
    |   font_title str          Graphic Title font size.                                                 |    
    |                                                                                                    |
    |   width      str          Barchart width.                                                          |    
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE   DESCRIPTION                                                                  |
    |                                                                                                    |
    |   graphic      -      Returns a matplotlib graphic.                                                |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Defining the subplot and Graphic size.
    fig, ax1 = plt.subplots(figsize = [14,8])

    # First element: Blue Barchart
    color = 'tab:blue'
    ax1.set_xlabel('Dog\'s Breed', fontsize = font)
    ax1.set_ylabel('Number of Retweets\nNumber of Favorite', color=color, fontsize = font)
    ax1.bar(df_x.index, df_y1, width, color=color)
    ax1.tick_params(axis='y', labelcolor=color)
    ax1.tick_params(axis='x', rotation = 90)

    # Second element: Orange Barchart
    ax3 = ax1.twiny() 
    color = 'tab:orange'
    ax3.bar(df_x.index, df_y2, width, color=color)
    ax3.tick_params(axis='x', labelcolor=color, top = False)
    ax3.set_xticklabels(labels = [], axis='x', labelcolor = color, top = False)

    # Third element: Red Line
    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
    color = 'indianred'
    ax2.set_ylabel('Favorite/Retweet', color=color, fontsize = font)  # we already handled the x-label with ax1
    ax2.plot(df_x.index, df_y3, color=color, linewidth = 3)
    ax2.tick_params(axis='y', labelcolor=color)

    # Avoid right y-label to be slightly clipped
    fig.tight_layout() 
    
    # According to the algorithm the average is different because the subsetting differents rows. 
    if algorithm == 'p1':
        df_alg = df_viz[df_viz.p1_dog]
    elif algorithm == 'p2':
        df_alg = df_viz[df_viz.p2_dog]
    else:
        df_alg = df_viz[df_viz.p3_dog]
    
    # Subsetting: removing any NaN or Null rows.
    df_alg_fav = df_alg[np.logical_not(df_alg.retweet_count.isnull())].favorite_count
    df_alg_ret = df_alg[np.logical_not(df_alg.retweet_count.isnull())].retweet_count
    
    # Calculating the rate = favorites/tweets, this is a vector.
    df_alg_rate = df_alg_fav/df_alg_ret
    
    # Calculating the mean, this is a single value.
    df_alg_rat_fav_ret = df_alg_rate.mean()
    
    # Plotting a dotted green line as average.
    plt.axhline(y = df_alg_rat_fav_ret, color='green', linestyle='--')

    # Defining the Graphic Title.
    plt.title(gp_title, fontsize = font_title)
    
    # Plot.
    return plt.show()

The result of this function will be shown below.

In [96]:

# Plotting a graphic using the output of df_gp_triple as input of gp_triple.
gp_triple(df_x = df_p1,
          df_y1 = df_p1['favorite_count'],
          df_y2 = df_p1['retweet_count'],
          df_y3 = df_p1['rate'],
          algorithm = 'p2',
          gp_title = 'Graphic 8 - Algortihm 1 - Favorite/Retweets rate for each Dog\'s Breed.',
          xlabel = 'Dog\'s Breed',
          ylabel = 'Number of Retweets\nNumber of Favorite')

The blue bar means the number of favourites, the orange bar the number of retweets, dotted green line the average of Favorite/Retweets for all dog breeds, and the red line the value of the Favorite/Retweet for each dog breed.

Going a step further, I will bundle these two functions into one.

In [97]:

# This function wrap the df_gp_triple and gp_triple functions.
def wrapper_triple(algorithm, gp_title, xlabel, ylabel,
                   df_viz = df_viz, font = font, font_title = font_title, threshold = 20, sorting = False):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function aims to plot a graphic using barchart and lines. Both side of the Y axis will be   |
    |   used.                                                                                            |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_viz     data frame   The entire data frame (imported from twitter_archive_master.csv).        |
    |                                                                                                    |
    |   df_x       data frame   The entire data frame (imported from twitter_archive_master.csv).        |
    |                                                                                                    |
    |   df_y1      data frame   The variable to be displayed as blue bar 1.                              |
    |                                                                                                    |
    |   df_y2      data frame   The variable to be displayed as orange bar 2.                            |
    |                                                                                                    |
    |   df_y3      data frame   The variable to be displayed as line.                                    |
    |                                                                                                    |
    |   algorithm  str          Could be 'p1', 'p2' or 'p3'.                                             |
    |                                                                                                    |
    |   gp_title   str          Graphic title.                                                           |
    |                                                                                                    |
    |   xlabel     str          X axis label.                                                            |
    |                                                                                                    |
    |   ylabel     str          Y axis label.                                                            |
    |                                                                                                    |
    |   font       str          Axis labels font size.                                                   |
    |                                                                                                    |
    |   font_title str          Graphic Title font size.                                                 |
    |                                                                                                    |
    |   threshold  int          Breed's number to be displayed in the graphic.                           |
    |                                                                                                    |
    |   sorting    bool         True: ascending, False: descending.                                      |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE   DESCRIPTION                                                                  |
    |                                                                                                    |
    |   graphic      -      Returns a matplotlib graphic.                                                |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |DEPENDENCIES:                                                                                       |
    |                                                                                                    |
    |   VARIABLE         TYPE        DESCRIPTION                                                         |
    |                                                                                                    |
    |   df_gp_triple     function    Generates a Data Frame which is used as input of gp_triple.         |
    |                                                                                                    |
    |   gp_triple        function    Create the graphic using the output of df_gp_triple as input.       |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Subset the df_viz according to the algorthm.
    df_alg = df_gp_triple(df_viz,
                          algorithm = algorithm,
                          threshold = threshold,
                          sorting = sorting)
    
    # Plot a graphic according to the df_alg.
    gp_triple(df_x = df_alg,
          df_y1 = df_alg['favorite_count'],
          df_y2 = df_alg['retweet_count'],
          df_y3 = df_alg['rate'],
          algorithm = algorithm,
          gp_title = gp_title,
          xlabel = xlabel,
          ylabel = ylabel)

Let's test this wrapper on P2 algorithm.

In [98]:

wrapper_triple(algorithm = 'p2',
               gp_title = 'Graphic 9 - Algortihm 2 - Favorite/Retweets rate for each Dog\'s Breed.',
               xlabel = 'Dog\'s Breed',
               ylabel = 'Number of Retweets\nNumber of Favorite')

Finally, I also will plot the same graphic to the P3 algorithm.

In [99]:

wrapper_triple(algorithm = 'p3',
               gp_title = 'Graphic 10 - Algortihm 3 - Favorite/Retweets rate for each Dog\'s Breed.',
               xlabel = 'Dog\'s Breed',
               ylabel = 'Number of Retweets\nNumber of Favorite')

Comparing the three graphics I have realized there is no dog breed which appears in the three graphics (8, 9, and 10), which I can interpret it as a very distinct algorithms calibration.

Something important to point out is to know the procedency of the algorithm and the possible breeds for each algorithm. I assume the list of breeds used to all algorithm is the same, allowing me the comparison.

Conclusion: Unfortunately, it is not possible to determinate which dog breed has the highest impact in respect to the number of retweets and favourite. because for each algorithm a different breed has performed the highest impact. There is no consensus.

5.6. Algorithm Correlation¶

In accordance with the Dog Breeds Appeal, let's investigate the output of the algorithm in respect to the Favorite/Retweet rate.

Let's increase the number of breeds analysed by the algorithm to 40 because 20 there are no breeds which appear in the three algorithms.

In [100]:

# Defininf a function to calculate the rate dataframe.
def df_rate(threshold,sorting = False, df_viz = df_viz):
    """
    +----------------------------------------------------------------------------------------------------+
    |DESCRIPTION:                                                                                        |
    |                                                                                                    |
    |   This function creates a data frame with rate (favorit/retweet) of all algorthms, each row is a   |
    |   dog's breed.                                                                                     |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |INPUTS:                                                                                             |
    |                                                                                                    |
    |   VARIABLE   TYPE         DESCRIPTION                                                              |
    |                                                                                                    |
    |   df_viz     data frame   The entire data frame (imported from twitter_archive_master.csv).        |
    |                                                                                                    |
    |   threshold  int          Breed's number to be displayed in the graphic.                           |
    |                                                                                                    |
    |   sorting    bool         True: ascending, False: descending.                                      |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |OUTUPUTS:                                                                                           |
    |                                                                                                    |
    |   VARIABLE     TYPE          DESCRIPTION                                                           |
    |                                                                                                    |
    |   df_thr       data frame    Returns a data frame with rate calculated of all algorithms.          |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    |DEPENDENCIES:                                                                                       |
    |                                                                                                    |
    |   VARIABLE         TYPE        DESCRIPTION                                                         |
    |                                                                                                    |
    |   df_gp_triple     function    Generates a Data Frame which is used as input of gp_triple.         |
    |                                                                                                    |
    +----------------------------------------------------------------------------------------------------+
    """
    # Calculating the new Data Frame for each algorithm.
    df_p1_thr = df_gp_triple(df_viz, 'p1', threshold = threshold, sorting = sorting)
    df_p2_thr = df_gp_triple(df_viz, 'p2', threshold = threshold, sorting = sorting)
    df_p3_thr = df_gp_triple(df_viz, 'p3', threshold = threshold, sorting = sorting)

    # Creating df_thr Data Frame to merge df_p1_thr and df_p2_thr.
    df_thr = pd.merge(df_p1_thr, df_p2_thr, left_index=True, right_index=True, how='outer')

    # Renaming columns.
    df_thr.columns = ['favorite_count_p1','retweet_count_p1','rate_p1',
                      'favorite_count_p2','retweet_count_p2','rate_p2']

    # Updating df_thr Data Frame to merging with df_p3_thr.
    df_thr = pd.merge(df_thr, df_p3_thr, left_index=True, right_index=True, how='outer')

    # Renaming columns.
    df_thr.columns = ['favorite_count_p1','retweet_count_p1','rate_p1',
                      'favorite_count_p2','retweet_count_p2','rate_p2',
                      'favorite_count_p3','retweet_count_p3','rate_p3']

    # Data Cleaning.
    # Removing NaN values.
    df_thr = df_thr[np.logical_not(df_thr.favorite_count_p1.isnull())]
    df_thr = df_thr[np.logical_not(df_thr.favorite_count_p2.isnull())]
    df_thr = df_thr[np.logical_not(df_thr.favorite_count_p3.isnull())]

    # Subsetting to gather only rates
    df_thr = df_thr[['rate_p1', 'rate_p2', 'rate_p3']].sort_values(by = ['rate_p3', 'rate_p2', 'rate_p1'], ascending = False)
    
    return df_thr # Return the df_thr data frame.


# Calculating the rates to threshold equal to 40.
df_rate(threshold = 40)

Out[100]:

	rate_p1	rate_p2	rate_p3
newfoundland	3.700781	5.211597	5.745550
yorkshire_terrier	4.204334	3.784307	3.932011
kuvasz	3.675678	4.255575	3.885262
staffordshire_bullterrier	4.082564	4.182275	3.841623

Only increasing the threshold (initially defined as 20) to 40 I have found four (4) dog breed.

New found land;
Yorkshire Terries;
Kuvasz, and;
Staffordshire Bullterrier.

Based on these results I want to know the correlation between the results of each algorithm in respect to the rate (favorite/retweet).

In [101]:

# Setting: Two graphics side-by-side. Size: [14,8]
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(14, 8))

# Compute the correlation matrix
corr = df_rate(threshold = 40).corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot = True,
            square=True, linewidths=.5, ax = ax1,cbar_kws={"shrink": .5})

# Graphic Title to threshold 40
ax1.set_title('Graphic 11a - Correlation Map - Threshold 40', fontsize = font)

# Compute the correlation matrix
corr = df_rate(threshold = 120).corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot = True,
            square=True, linewidths=.5, ax = ax2,cbar_kws={"shrink": .5})

# Graphic Title to threshold 100
ax2.set_title('Graphic 11b - Correlation Map - All Breeds', fontsize = font);

Conclusion: There is no correlation between the results of the three algorithms. An analysis oblique using only the 40 breeds with the highest rate could lead us an erroneous conclusion. The Correlation Map using all breeds gave a good measure of (un)similarities between these algorithms.

This interpretation could be biased due to the lack of information about the algorithms.

All algorithm has the same list of breeds?

5.7. Further Analysis¶

There are many other questions to be answered:

What lead some Breed has better performance than other?

Subset the df_viz taking only the breed with a rate above of the mean.

When the babies arrive? Is there any correlation between dogtionary term over the year?

Identify the seasons of dog's reproducibility.

6. Conclusions ¶

This project aims to perform the Data Wrangling and the Exploratory Data Analysis in the WeRateDogs™ Twitter account.

The Data Gathering process englobed three different tasks, the first one download file from URL and later loading to the Jupyter Notebook, which requires a manual step, the second downloading a file programmatically, and the third gathering data from the Twitter API.

Based on the data gathered, I have assessed the most evident issues (17 issues in total) and documented it to create a record of modifications. Later, in Data Cleaning process I have fixed all identified issues to complete, and I have also merged separated data frame into one and added some missing values. The final data frame was stored as twitter_archive_master.csv.

In the Data Analysis and Visualization, which I have interpreted as Exploratory Analysis, I have posed few questions to guide my analysis. I have found strong evidence of:

Seasonality in the number of tweets along the week and along the year;
A positive correlation between the number of retweets and the number of favourites, and;
No correlation between the algorithms output used to predict the dog breed.

Additional Information¶

The Project also has other deliverables, which could be accessed by the following links:

Act Report, and;
Wrangle Report.

7. References ¶

I have consulted this websites to perform this project. Have in mind, the list is not in appearance order.

ND111 - Data Science Foundations II, Udacity, Nano degree, website;
Roesslein, J.; tweepy, Version 3.7.0, website;
Wickham, H.; Tidy Data, The Journal of Statistical Software, article;
Twitter API, Documentation, website;
Merging Tables, Pandas Package, website;
logical_not, Numpy Package, Stack Overflow, thread;
Reason to use zip code as string, Stack Overflow, thread;
Data frames Concatenation, Pandas Package, Stack Overflow, thread;
.isin(), Stack Overflow, thread;
Indexing View vs Copy, Pandas Documentation, website;
Regular Expression in Python, Guru99, website;
Concise Vector Adding, Stack Overflow, thread;
How to fill NaN, Pandas Documentation, website;
Get weekday in Timestamp, Pandas Documentation, website;
Correlation Map, Seaborn Documentation, website;
SettingWithCopyWarning, Pandas Documentation, website.

Appendix ¶

A.1. JSON file¶

I have indented this fragment of JSON file, just to understand what is the components. Later it will help me to subset/indexing my needs.

See more in Twitter Dev Website.

The object from the tweepy has an outcome a dictionary called _json, which is what I am looking for.

Status(_api=<tweepy.api.API object at 0x0B288EF0>,
_json={'created_at': 'Wed Dec 26 19:18:12 +0000 2018',
       'id': 1078007085848240129,
       'id_str': '1078007085848240129',
       'text': '@brandimotamedi i hope you love it ☺️',
       'truncated': False,
       'entities': {'hashtags': [],
                    'symbols': [],
                    'user_mentions': [{'screen_name':
                                       'brandimotamedi',
                                       'name': 'Brandi Motamedi',
                                       'id': 344164849,
                                       'id_str': '344164849',
                                       'indices': [0, 15]}],
                    'urls': []},
       'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       'in_reply_to_status_id': 1078006885935067136,
       'in_reply_to_status_id_str': '1078006885935067136',
       'in_reply_to_user_id': 344164849,
       'in_reply_to_user_id_str': '344164849',
       'in_reply_to_screen_name': 'brandimotamedi',
       'user': {'id': 4196983835,
                'id_str': '4196983835',
                'name': 'WeRateDogs™',
                'screen_name': 'dog_rates',
                'location': 'merch  ⇨',
                'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
                'url': 'https://t.co/N7sNNHAEXS',
                'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
                                                      'expanded_url': 'http://weratedogs.com',
                                                      'display_url': 'weratedogs.com',
                                                      'indices': [0, 23]}]},
                                    'description': {'urls': []}},
                'protected': False,
                'followers_count': 7549717,
                'friends_count': 12,
                'listed_count': 5661,
                'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
                'favourites_count': 140650,
                'utc_offset': None,
                'time_zone': None,
                'geo_enabled': True,
                'verified': True,
                'statuses_count': 9467,
                'lang': 'en',
                'contributors_enabled': False,
                'is_translator': False,
                'is_translation_enabled': False,
                'profile_background_color': '000000',
                'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
                'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
                'profile_background_tile': False,
                'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
                'profile_link_color': 'F5ABB5',
                'profile_sidebar_border_color': '000000',
                'profile_sidebar_fill_color': '000000',
                'profile_text_color': '000000',
                'profile_use_background_image': False,
                'has_extended_profile': False,
                'default_profile': False,
                'default_profile_image': False,
                'following': False,
                'follow_request_sent': False,
                'notifications': False,
                'translator_type': 'none'},
       'geo': None,
       'coordinates': None,
       'place': None,
       'contributors': None,
       'is_quote_status': False,
       'retweet_count': 0,
       'favorite_count': 115,
       'favorited': False,
       'retweeted': False,
       'lang': 'en'},
created_at=datetime.datetime(2018, 12, 26, 19, 18, 12),
id=1078007085848240129,
id_str='1078007085848240129',
text='@brandimotamedi i hope you love it ☺️',
truncated=False,
entities={'hashtags': [],
          'symbols': [],
          'user_mentions': [{'screen_name': 'brandimotamedi',
                             'name': 'Brandi Motamedi',
                             'id': 344164849,
                             'id_str': '344164849',
                             'indices': [0, 15]}],
          'urls': []},
source='Twitter for iPhone',
source_url='http://twitter.com/download/iphone',
in_reply_to_status_id=1078006885935067136,
in_reply_to_status_id_str='1078006885935067136',
in_reply_to_user_id=344164849,
in_reply_to_user_id_str='344164849',
in_reply_to_screen_name='brandimotamedi',
author=User(_api=<tweepy.api.API object at 0x0B288EF0>,
            _json={'id': 4196983835,
                   'id_str': '4196983835',
                   'name': 'WeRateDogs™',
                   'screen_name': 'dog_rates',
                   'location': 'merch  ⇨', 'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
                   'url': 'https://t.co/N7sNNHAEXS',
                   'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
                                                         'expanded_url': 'http://weratedogs.com',
                                                         'display_url': 'weratedogs.com',
                                                         'indices': [0, 23]}]},
                                       'description': {'urls': []}},
                   'protected': False,
                   'followers_count': 7549717,
                   'friends_count': 12,
                   'listed_count': 5661,
                   'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
                   'favourites_count': 140650,
                   'utc_offset': None,
                   'time_zone': None,
                   'geo_enabled': True,
                   'verified': True,
                   'statuses_count': 9467,
                   'lang': 'en',
                   'contributors_enabled': False,
                   'is_translator': False,
                   'is_translation_enabled': False,
                   'profile_background_color': '000000',
                   'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
                   'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
                   'profile_background_tile': False,
                   'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                   'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                   'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
                   'profile_link_color': 'F5ABB5',
                   'profile_sidebar_border_color': '000000',
                   'profile_sidebar_fill_color': '000000',
                   'profile_text_color': '000000',
                   'profile_use_background_image': False,
                   'has_extended_profile': False,
                   'default_profile': False,
                   'default_profile_image': False,
                   'following': False,
                   'follow_request_sent': False,
                   'notifications': False,
                   'translator_type': 'none'},
            id=4196983835,
            id_str='4196983835',
            name='WeRateDogs™',
            screen_name='dog_rates',
            location='merch  ⇨',
            description='Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
            url='https://t.co/N7sNNHAEXS',
            entities={'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
                                               'expanded_url': 'http://weratedogs.com',
                                               'display_url': 'weratedogs.com',
                                               'indices': [0, 23]}]},
                             'description': {'urls': []}},
            protected=False,
            followers_count=7549717,
            friends_count=12,
            listed_count=5661,
            created_at=datetime.datetime(2015, 11, 15, 21, 41, 29),
            favourites_count=140650,
            utc_offset=None,
            time_zone=None,
            geo_enabled=True,
            verified=True,
            statuses_count=9467,
            lang='en',
            contributors_enabled=False,
            is_translator=False,
            is_translation_enabled=False,
            profile_background_color='000000',
            profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png',
            profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png',
            profile_background_tile=False,
            profile_image_url='http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
            profile_image_url_https='https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
            profile_banner_url='https://pbs.twimg.com/profile_banners/4196983835/1544368760',
            profile_link_color='F5ABB5',
            profile_sidebar_border_color='000000',
            profile_sidebar_fill_color='000000',
            profile_text_color='000000',
            profile_use_background_image=False,
            has_extended_profile=False,
            default_profile=False,
            default_profile_image=False,
            following=False,
            follow_request_sent=False,
            notifications=False, translator_type='none'),
user=User(_api=<tweepy.api.API object at 0x0B288EF0>,
          _json={'id': 4196983835,
                 'id_str': '4196983835',
                 'name': 'WeRateDogs™',
                 'screen_name': 'dog_rates',
                 'location': 'merch  ⇨',
                 'description': 'Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
                 'url': 'https://t.co/N7sNNHAEXS',
                 'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
                                                'expanded_url': 'http://weratedogs.com',
                                                'display_url': 'weratedogs.com',
                                                'indices': [0, 23]}]},
                                     'description': {'urls': []}},
                             'protected': False,
                             'followers_count': 7549717,
                             'friends_count': 12,
                             'listed_count': 5661,
                             'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
                             'favourites_count': 140650,
                             'utc_offset': None,
                             'time_zone': None,
                             'geo_enabled': True,
                             'verified': True,
                             'statuses_count': 9467,
                             'lang': 'en',
                             'contributors_enabled': False,
                             'is_translator': False,
                             'is_translation_enabled': False,
                             'profile_background_color': '000000',
                             'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
                             'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
                             'profile_background_tile': False,
                             'profile_image_url': 'http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                             'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
                             'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1544368760',
                             'profile_link_color': 'F5ABB5',
                             'profile_sidebar_border_color': '000000',
                             'profile_sidebar_fill_color': '000000',
                             'profile_text_color': '000000',
                             'profile_use_background_image': False,
                             'has_extended_profile': False,
                             'default_profile': False,
                             'default_profile_image': False,
                             'following': False,
                             'follow_request_sent': False,
                             'notifications': False,
                             'translator_type': 'none'},
          id=4196983835,
          id_str='4196983835',
          name='WeRateDogs™',
          screen_name='dog_rates',
          location='merch  ⇨',
          description='Your Only Source For Professional Dog Ratings ⠀ ⠀IG, FB, Snapchat ⇨ WeRateDogs ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Business: dogratingtwitter@gmail.com',
          url='https://t.co/N7sNNHAEXS',
          entities={'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
                                             'expanded_url': 'http://weratedogs.com',
                                             'display_url': 'weratedogs.com',
                                             'indices': [0, 23]}]},
                            'description': {'urls': []}},
          protected=False,
          followers_count=7549717,
          friends_count=12,
          listed_count=5661,
          created_at=datetime.datetime(2015, 11, 15, 21, 41, 29),
          favourites_count=140650,
          utc_offset=None,
          time_zone=None,
          geo_enabled=True,
          verified=True,
          statuses_count=9467,
          lang='en',
          contributors_enabled=False,
          is_translator=False,
          is_translation_enabled=False,
          profile_background_color='000000',
          profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png',
          profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png',
          profile_background_tile=False,
          profile_image_url='http://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
          profile_image_url_https='https://pbs.twimg.com/profile_images/1072659905235042304/nm3HWlPG_normal.jpg',
          profile_banner_url='https://pbs.twimg.com/profile_banners/4196983835/1544368760',
          profile_link_color='F5ABB5',
          profile_sidebar_border_color='000000',
          profile_sidebar_fill_color='000000',
          profile_text_color='000000',
          profile_use_background_image=False,
          has_extended_profile=False,
          default_profile=False,
          default_profile_image=False,
          following=False,
          follow_request_sent=False,
          notifications=False,
          translator_type='none'),
geo=None,
coordinates=None,
place=None,
contributors=None,
is_quote_status=False,
retweet_count=0,
favorite_count=115,
favorited=False,
retweeted=False,
lang='en')

Project 02 - Wrangle and Analyze Data¶

Student Tags¶

Table of Contents¶

Synopsis¶

1. Introduction ¶

1.1. Reproducibility¶

Work envinronment¶

Softwares¶

Packages¶

Repository¶

1.2. Importing Libraries¶

2. Data Gathering ¶

2.2. Downloading Files from Udacity Website¶

2.2.1. Loading the files¶

2.3. Twitter API¶

Sensible Data¶

OAuth Authentication¶

3. Data Assessing ¶

3.1. Objectivies¶

3.2. Methodology¶

3.3. twitter_archive_enhanced.csv¶

.head()¶

.tail()¶

.value_counts()¶

.info()¶

3.4. image_predictions.tsv¶

.head()¶

.tail()¶

.info()¶

.duplicated()¶

3.5. Conjunctural Analysis¶

3.6. Iterative Process¶

4. Data Cleaning ¶

4.1. Copying¶

4.2. Issue ID 13¶

Define¶

Code¶

Test¶

4.3. Issue ID 9 and 10¶

Define¶

Code¶

Test¶

4.4. Issue ID 8¶

Define¶

Code¶

Test¶

4.5. Issue ID 12¶

Define¶

Code¶

Test¶

4.6. Issue ID 14¶

Define¶

Code¶

Test¶

4.7. Issue ID 1¶

Define¶

Code¶

Test¶

4.8. Issue ID 2¶

Define¶

Code¶

Test¶

4.9. Issue ID 6¶

Define¶

Code¶

Test¶

4.10. Issue ID 3 and 4¶

Define¶

Code¶

Test¶

4.11. Issue ID 5¶

Define¶

Code¶

Test¶

4.12. Issue ID 7¶

Define¶

Code¶

Test¶

4.13. Issue ID 11¶

Define¶

`.head()`¶

`.tail()`¶

`.value_counts()`¶

`.info()`¶

`.head()`¶

`.tail()`¶

`.info()`¶

`.duplicated()`¶