Can't push to GitHub because of large file which I already deleted
You added a big dataset to your repository, realised it was too large, deleted it, committed the deletion, and tried to push again, but GitHub still rejects the push. This is one of the most common Git surprises. Deleting a file in a new commit does not remove it from the commits you already made, and a push sends your whole history, so the large blob still travels to GitHub and gets rejected.
The push fails with a message like this:
$ git push origin master Enumerating objects: 68, done. Counting objects: 100% (68/68), done. Delta compression using up to 8 threads Compressing objects: 100% (58/58), done. Writing objects: 100% (58/58), 296.89 MiB | 3.92 MiB/s, done. Total 58 (delta 24), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (24/24), completed with 7 local objects. remote: warning: File BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Testing.csv is 88.01 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB remote: error: See https://gh.io/lfs for more information. remote: error: File BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv is 352.00 MB; this exceeds GitHub's file size limit of 100.00 MB remote: error: File lish-moa/train_features.csv is 149.10 MB; this exceeds GitHub's file size limit of 100.00 MB remote: error: File split_nn.hdf5 is 111.26 MB; this exceeds GitHub's file size limit of 100.00 MB remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com. error: failed to push some refs to 'https://github.com/devharsh/...'
Why deleting the file did not help
Git stores every version of every file as a snapshot in your commit history. When you delete a file and commit, you only add a new commit that says the file is gone from that point forward. Every earlier commit still contains the full file. Because a push transfers all of those commits, GitHub still receives the oversized blob and refuses it. To fix the push for good, you must rewrite history so the large file was never there.
Important: rewriting history changes commit hashes. Do this on a repository you control, coordinate with collaborators, and make a backup or a fresh clone before you start.
Solution 1: git filter-repo (recommended)
git filter-repo is the tool the Git project itself now recommends for rewriting history. Install it first:
pip install git-filter-repo
Remove a specific file from the entire history:
git filter-repo --path BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv --invert-paths
You can pass several --path options to drop more than one file. Or remove every blob over a size threshold in one go:
git filter-repo --strip-blobs-bigger-than 100M
git filter-repo deliberately removes the origin remote after a rewrite as a safety check, so add it back and force-push:
git remote add origin https://github.com/devharsh/your-repo.git git push --force --all git push --force --tags
Solution 2: BFG Repo-Cleaner
BFG is a fast, simpler alternative to filter-branch. Download bfg.jar (it needs Java), then work on a mirror clone:
git clone --mirror https://github.com/devharsh/your-repo.git java -jar bfg.jar --strip-blobs-bigger-than 100M your-repo.git # or target files by name: java -jar bfg.jar --delete-files split_nn.hdf5 your-repo.git cd your-repo.git git reflog expire --expire=now --all git gc --prune=now --aggressive git push
Solution 3: git filter-branch (built in, no extra tools)
If you cannot install anything, Git ships with filter-branch. It is slow and officially discouraged, but it works:
git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv" \ --prune-empty --tag-name-filter cat -- --all git for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin git reflog expire --expire=now --all git gc --prune=now --aggressive git push origin --force --all
After cleaning: force-push and warn collaborators
Because every commit hash changes, anyone who already cloned the repository must re-clone it or carefully rebase their work onto the new history. If they simply pull, they will reintroduce the old large blobs. The safest message to a team is: stop pushing, let me rewrite, then everyone deletes their local copy and clones fresh.
If you genuinely need the large files: Git LFS
Datasets and model files sometimes have to live with the code. Git Large File Storage keeps a small pointer in Git and stores the real file on a separate server. Note that LFS does not retroactively fix history, so you still have to clean the old blobs first using one of the methods above, then:
git lfs install git lfs track "*.csv" git lfs track "*.hdf5" git add .gitattributes git add BOT-IoT/ lish-moa/ split_nn.hdf5 git commit -m "Track large datasets with Git LFS" git push origin master
Prevent it next time
Add data directories and large artifacts to .gitignore before your first commit, and check what you are about to commit. A quick guard is to list the biggest files in your working tree:
echo "*.csv" >> .gitignore echo "*.hdf5" >> .gitignore echo "data/" >> .gitignore # show the 10 largest files under the current folder du -ah . | sort -rh | head -n 10
GitHub blocks any single file over 100 MB and warns above 50 MB, so keeping datasets out of Git from the start saves you from rewriting history later.
Reference: the original discussion on Stack Overflow: Can't push to GitHub because of large file which I already deleted.




Comments
Post a Comment