-
-
Save gd3kr/948296cf675469f5028911f8eb276dbc to your computer and use it in GitHub Desktop.
/* | |
the twitter api is stupid. it is stupid and bad and expensive. hence, this. | |
Literally just paste this in the JS console on the bookmarks tab and the script will automatically scroll to the bottom of your bookmarks and keep a track of them as it goes. | |
When finished, it downloads a JSON file containing the raw text content of every bookmark. | |
for now it stores just the text inside the tweet itself, but if you're reading this why don't you go ahead and try to also store other information (author, tweetLink, pictures, everything). come on. do it. please? | |
*/ | |
let tweets = []; // Initialize an empty array to hold all tweet elements | |
const scrollInterval = 1000; | |
const scrollStep = 5000; // Pixels to scroll on each step | |
let previousTweetCount = 0; | |
let unchangedCount = 0; | |
const scrollToEndIntervalID = setInterval(() => { | |
window.scrollBy(0, scrollStep); | |
const currentTweetCount = tweets.length; | |
if (currentTweetCount === previousTweetCount) { | |
unchangedCount++; | |
if (unchangedCount >= 2) { // Stop if the count has not changed 5 times | |
console.log('Scraping complete'); | |
console.log('Total tweets scraped: ', tweets.length); | |
console.log('Downloading tweets as JSON...'); | |
clearInterval(scrollToEndIntervalID); // Stop scrolling | |
observer.disconnect(); // Stop observing DOM changes | |
downloadTweetsAsJson(tweets); // Download the tweets list as a JSON file | |
} | |
} else { | |
unchangedCount = 0; // Reset counter if new tweets were added | |
} | |
previousTweetCount = currentTweetCount; // Update previous count for the next check | |
}, scrollInterval); | |
function updateTweets() { | |
document.querySelectorAll('[data-testid="tweetText"]').forEach(tweetElement => { | |
const tweetText = tweetElement.innerText; // Extract text content | |
if (!tweets.includes(tweetText)) { // Check if the tweet's text is not already in the array | |
tweets.push(tweetText); // Add new tweet's text to the array | |
console.log("tweets scraped: ", tweets.length) | |
} | |
}); | |
} | |
// Initially populate the tweets array | |
updateTweets(); | |
// Create a MutationObserver to observe changes in the DOM | |
const observer = new MutationObserver(mutations => { | |
mutations.forEach(mutation => { | |
if (mutation.addedNodes.length) { | |
updateTweets(); // Call updateTweets whenever new nodes are added to the DOM | |
} | |
}); | |
}); | |
// Start observing the document body for child list changes | |
observer.observe(document.body, { childList: true, subtree: true }); | |
function downloadTweetsAsJson(tweetsArray) { | |
const jsonData = JSON.stringify(tweetsArray); // Convert the array to JSON | |
const blob = new Blob([jsonData], { type: 'application/json' }); | |
const url = URL.createObjectURL(blob); | |
const link = document.createElement('a'); | |
link.href = url; | |
link.download = 'tweets.json'; // Specify the file name | |
document.body.appendChild(link); // Append the link to the document | |
link.click(); // Programmatically click the link to trigger the download | |
document.body.removeChild(link); // Clean up and remove the link | |
} |
I made some updates to the X scraping script to improve interaction data extraction and accuracy. A future update could include extracting the URL of images or videos within the tweet.
1. Comprehensive Interaction Data Extraction: The script now effectively captures detailed interaction metrics for each tweet, including replies, reposts, likes, bookmarks, and views, by filtering for relevant aria-labels
.
2. Robust Parsing Logic: Introduced a new function, extractInteractionDataFromString
, utilizing regular expressions to parse interaction metrics from the aria-label
text, ensuring accurate numeric conversion of interaction data.
3. Keyword-Based Numeric Extraction: Added an auxiliary function, extractNumberForKeyword
, to precisely extract numerical values associated with specific interaction terms, enhancing the script's parsing capability.
4. Refined Tweet Uniqueness Check: Improved the logic for identifying new tweets, reducing duplicates and ensuring a cleaner dataset.
5. Maintained JSON Export Functionality: Preserved the functionality to compile and download the scraped data as a JSON file, facilitating easy data export and analysis.
A huge thanks to @gd3kr and @premrajnarkhede for your insights and contributions that inspired these enhancements!
https://gist.github.com/luighifeodrippe/6e67c74bf5123ee0763cc0dd78b9411d
Is it possible to export the links of the tweets as well?
could you make a tutorial how to use it?
Is it possible to export the links of the tweets as well?
@typicallyze you can add the following to the updateTweets
function in @premrajnarkhede 's script and it'll get you the tweet link as well:
const link = tweetElement.querySelector('a[href*="/status/"]').href;
Then, add link
to the object that is being pushed to tweets (under replies
) and you'll be good to go.
could you make a tutorial how to use it?
It's pretty simple @TarasShu .
- Go to your bookmarks page on Twitter.
- Press Command + Option + C to open the Inspector.
- Switch to the Console tab near the top of the Inspector.
- Paste in the full script from above and press Enter.
- You'll get
tweets.json
downloaded onto your machine.
Is it possible to export the links of the tweets as well?
Hello, with the help of chatGPT, I combined the code in this repository and comments to work with collecting links, and I got some errors and also threw them into the chat GPT, here is the final code that works, but some tweets were not collected, I don’t know why, but the code may be useful for development
I don't understand this code myself and I would be glad if someone could tell me how to collect all tweets and check if 100% is collected
let tweets = []; // Initialize an empty array to hold all tweet elements
const scrollInterval = 1000;
const scrollStep = 5000; // Pixels to scroll on each step
let previousTweetCount = 0;
let unchangedCount = 0;
const scrollToEndIntervalID = setInterval(() => {
window.scrollBy(0, scrollStep);
const currentTweetCount = tweets.length;
if (currentTweetCount === previousTweetCount) {
unchangedCount++;
if (unchangedCount >= 2) { // Stop if the count has not changed 5 times
console.log('Scraping complete');
console.log('Total tweets scraped: ', tweets.length);
console.log('Downloading tweets as JSON...');
clearInterval(scrollToEndIntervalID); // Stop scrolling
observer.disconnect(); // Stop observing DOM changes
downloadTweetsAsJson(tweets); // Download the tweets list as a JSON file
}
} else {
unchangedCount = 0; // Reset counter if new tweets were added
}
previousTweetCount = currentTweetCount; // Update previous count for the next check
}, scrollInterval);
function updateTweets() {
document.querySelectorAll('article[data-testid="tweet"]').forEach(tweetElement => {
const authorNameElement = tweetElement.querySelector('[data-testid="User-Name"]');
const handleElement = tweetElement.querySelector('[role="link"]');
const tweetTextElement = tweetElement.querySelector('[data-testid="tweetText"]');
const timeElement = tweetElement.querySelector('time');
const retweetsElement = tweetElement.querySelector('[data-testid="retweet"]');
const likesElement = tweetElement.querySelector('[data-testid="like"]');
const repliesElement = tweetElement.querySelector('[data-testid="reply"]');
const linkElement = tweetElement.querySelector('a[href*="/status/"]');
// Check if all required elements exist
if (authorNameElement && handleElement && tweetTextElement && timeElement && retweetsElement && likesElement && repliesElement && linkElement) {
const authorName = authorNameElement.innerText;
const handle = handleElement.href.split('/').pop();
const tweetText = tweetTextElement.innerText;
const time = timeElement.getAttribute('datetime');
const retweets = retweetsElement.innerText;
const likes = likesElement.innerText;
const replies = repliesElement.innerText;
const link = linkElement.href;
const isTweetNew = !tweets.some(tweet => tweet.tweetText === tweetText);
if (isTweetNew) {
tweets.push({
authorName,
handle,
tweetText,
time,
retweets,
likes,
replies,
link
});
console.log("tweets scraped: ", tweets.length);
}
}
});
}
// Initially populate the tweets array
updateTweets();
// Create a MutationObserver to observe changes in the DOM
const observer = new MutationObserver(mutations => {
mutations.forEach(mutation => {
if (mutation.addedNodes.length) {
updateTweets(); // Call updateTweets whenever new nodes are added to the DOM
}
});
});
// Start observing the document body for child list changes
observer.observe(document.body, { childList: true, subtree: true });
function downloadTweetsAsJson(tweetsArray) {
const jsonData = JSON.stringify(tweetsArray); // Convert the array to JSON
const blob = new Blob([jsonData], { type: 'application/json' });
const url = URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = url;
link.download = 'tweets.json'; // Specify the file name
document.body.appendChild(link); // Append the link to the document
link.click(); // Programmatically click the link to trigger the download
document.body.removeChild(link); // Clean up and remove the link
}
Use this to get all other details. Thanks @gd3kr for amazing simple code