GitHub - jonkeegan/behind-this-website: Checklist for investigating the provenance and ownership of websites.

Who’s behind this website? A Checklist.

By Prianjana Bengani (@acookiecrumbles) and Jon Keegan (@jonkeegan) Oringinally presented at IRE NICAR Conference - March 4, 2022 - Updated March 2024 Slides: English | Russian (earlier version)

Thank you to Svetlana Borodina at Harriman Institute for the Russian translation!

What is this?

This checklist is meant to be used as a reporting tool to help journalists and researchers when trying to find out who published a website. This is meant to be used in conjunction with offline reporting techniques.

Following this checklist does not guarantee that you can unmask the owner of a website that does not want to be found, but it can help surface crucial clues and connections that can act as leads for further reporting.

🌟 Strong recommendation: while running through this checklist, create a data diary — it can be a TextEdit doc, a Google Doc, just the Notes app, whatever. It is important to be able to retrace your steps.

Documenting and monitoring

Maintain a data diary with detailed notes about what you’ve looked at and how you got there
Try to create a timeline of the website and how it’s evolving over the course of your investigation
Use Hunchly or screen recordings to keep track of everything you’re doing
Set up Klaxon Cloud or VisualPing to be notified of any changes to a site
Use GitHub Actions and ShotScraper for automated screenshots over time
Archive sites consistently, and in some cases, use multiple archival services (archive.org, archive.is)
For public records or social media posts, take screenshots — some of them might not be archivable
Download videos lest they get taken down. Youtube videos: yt-dlp
Take screenshots with timestamps so you can monitor changes and gather receipts (GoFullPage).
Capturing the full browser window with the URL field helps strengthen your evidence

Site Content

Text

Check text fragments from articles, about-us pages, and privacy policies to see if they are unique to the site or duplicated [Use exact string matching on multiple search engines]
Run article text through numerous tools to see if the text is AI-generated, but note lots of false positives in these tools [GPTZero, OpenAI’s Text Classifier, ContentScale’s AI Detector, CopyLeaks]
Browse site for any names (including bylines), email addresses, phone numbers, addresses, social media handles, and company names
✍️ Are there any authors listed?
- If the site is Wordpress, try this wildcard search on Google to reveal the author list: "https://yourwebsite.com/author/*/"
- If the site is Wordpress, active(!) and you are allowed to access it(!!), you can try getting the list of users accessing https://yourwebsite.com/wp-json/wp/v2/users . The list shows names, slugs and gravatar hashes (md5 of email address) for the authors of the site.
- If the site is Wordpress, use wpscan to see the theme a Wordpress site uses as well as the authors, or use built.with for a technology profile.
- Use Bellingcat’s Name Variant Tool to find possible variations on any names.
📫 Are there any e-mail addresses, phone numbers or contact information?
- If there are e-mail addresses, do those share the domain with the website?
- Use tools like Epieos and haveibeenpwned.com to reverse lookup emails and phone numbers: both will show you other services and platforms on which the email address or phone number might exist. TrueCaller also serves as a reverse yellow pages.
- Check to see if there is a Gravatar associated with that address:
  - https://en.gravatar.com/site/check/XXXXX@gmail.com
🏢 Are there any companies listed?
- If you find company names, use OpenCorporates or LinkedIn to see whether any personnel information is available. OpenCorporates also lets you search by addresses — so you can find who else shares the same office location!
🕑 What’s the server’s local time?
- Look at the datetime attribute in links on Wordpress sites. GMT timestamp can reveal time zone based on GMT offset: <time class="updated" datetime="2022-03-04T10:21:40+06:00">March 4, 2022</time>
🕶 Does the website have a privacy policy or terms and conditions that mentions an LLC, or what regional laws apply?
📡 Does the website have an RSS feed?
- Does the RSS feed give any additional information about authors / stories that aren't visible on the site?
- You can pull RSS article links into Google sheets using IMPORTFEED

Features and functionality

🗞 Does the website have a newsletter?
- Check for the physical postal address — required by the CAN-SPAM Act in the US
- If you are allowed to access the site(!) you can try registering to the newsletter and examining the headers of the email you received. Those headers can reveal the IP address of the server
💸 Does the website collect donations?
🛒 Does the website have an e-commerce store? Or, does it sell products?
- Try walking through the checkout process (without paying). Sometimes the real payee name is revealed just before you confirm the payment.

Links

🔗 What domains does the website link to most?
Use photon or urlscan.io to gather the outbound urls, (urls a site links to), as well as some high-level “intel” — who’s the site linking to the most?
Analyze outbound links, especially those to merch stores, for affiliate links — who’s the affiliate? (Especially useful for health and wellness scams)
❤️ Who links to the domain most often?
- Google search operator: "link:yourwebsite.com"
- Find who’s linking to the website of interest consistently by using a backlink checker (ahrefs.com, Moz) — what’s the relationship between the sites?
Do the links have UTM codes?

Photos, images and documents

📸 Are there author photos?
- Use reverse image search to see if the same images appear elsewhere
- Check sensity.ai to see if the image is GAN-generated
- Read more about spotting GAN-generated images here.
🔎 Do the images have EXIF data?
- Instructions here.
👀 Do the images have any other identifying information?
- Run through the list here
🪣 Where are the images hosted?
- If on AWS S3, the bucket name can be revealing — or you might find the bucket isn’t secure.
📄 Are there PDFs hosted on the site?
- On a search engine, "filetype:pdf site:<yourwebsite.com>"
- If you find some, check the metadata with "Get Info" in your PDF viewer.
🕛 Are there old archived images on Wayback Machine?
- Using the "URLs" page may find deleted images, filter on "image/" to narrow the search.

Social Media

If there are any social media profiles mentioned on the site, they are worth investigating.

👤 Are there any social media accounts in the <meta> section of the HTML?
📅 When were the individual accounts created? Does it line up with the site history?
📊 What platform has the biggest reach?
📣 Is the messaging different across platforms?
📇 Do they have completely distinct account names across social media platforms or are they more-or-less the same?
- Note: just because you find the same account name across platforms doesn’t necessarily mean they belong to the same person!
🔎 Use tools like sherlock and Blackbird that will scan multiple platforms to see if the same handle appears elsewhere — you’ll still have to confirm that it’s the same user and not just the same handle.

Facebook

On the Facebook profile, go to Page Transparency:

☎️ Is there an address and phone number for the page?
⏪ Does the page history reveal a different name?
- Has the page shifted topics?
🐣 When was the Facebook page created?
Is the page running any groups?
🗳 Has the page run any ads? Has the page run political ads?
🤖 Does Facebook flag any ‘related pages’ for the given page? Rely on Facebook’s algorithms to find connections!

Other platforms

Don't forget to check to see if the site has accounts on Youtube, Instagram, Reddit, Github,

Infrastructure

Resources & Tools

Books

Open Source Intelligence Techniques - Michael Bazzell https://inteltechniques.com/book1.html

Verification Handbook - edited by Craig Silverman https://datajournalism.com/read/handbook/verification-3

Website Infrastructure

Blacklight: The Markup's real-time website privacy inspector.
builtwith.com: gives you the infrastructure of the site, including IP addresses, analytics codes, tech stack, etc. Freemium model.
DNSDBScout: allows you to search and ‘flexible search’ for passive dns lookups including IP <-> domain mapping.
Dnslytics: offers a range of tools including reverse Analytics and reverse DNS lookups, as well as WHOIS data. Freemium.
RiskIQ: a ‘threat intelligence’ tool that allows you to get reverse IP, reverse analytics, WHOIS, SSL, subdomains, etc.
Whoxy: a tool that lets you see historical WHOIS registrations. Free.
The Internet Archive browser extension.

Social Media Accounts

Sensity AI: check if an image is GAN-generated or not. Freemium.
whotwi.com: create a profile-at-a-glance for any account on Twitter. Free.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BEHIND-THIS-WEBSITE-NICAR24.pdf		BEHIND-THIS-WEBSITE-NICAR24.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEHIND-THIS-WEBSITE-NICAR24.pdf

BEHIND-THIS-WEBSITE-NICAR24.pdf

README.md

README.md

Repository files navigation

Who’s behind this website? A Checklist.

What is this?

Documenting and monitoring

Site Content

Text

Features and functionality

Links

Photos, images and documents

Social Media

Facebook

Other platforms

Infrastructure

Resources & Tools

Books

Website Infrastructure

Social Media Accounts

About

Releases

Packages

Contributors 6

jonkeegan/behind-this-website

Folders and files

Latest commit

History

BEHIND-THIS-WEBSITE-NICAR24.pdf

BEHIND-THIS-WEBSITE-NICAR24.pdf

README.md

README.md

Repository files navigation

Who’s behind this website? A Checklist.

What is this?

Documenting and monitoring

Site Content

Text

Features and functionality

Links

Photos, images and documents

Social Media

Facebook

Other platforms

Infrastructure

Resources & Tools

Books

Website Infrastructure

Social Media Accounts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Packages