Can't push to GitHub because of large file which I already deleted

You added a big dataset to your repository, realised it was too large, deleted it, committed the deletion, and tried to push again, but GitHub still rejects the push. This is one of the most common Git surprises. Deleting a file in a new commit does not remove it from the commits you already made, and a push sends your whole history, so the large blob still travels to GitHub and gets rejected.

The push fails with a message like this:

$ git push origin master
Enumerating objects: 68, done.
Counting objects: 100% (68/68), done.
Delta compression using up to 8 threads
Compressing objects: 100% (58/58), done.
Writing objects: 100% (58/58), 296.89 MiB | 3.92 MiB/s, done.
Total 58 (delta 24), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (24/24), completed with 7 local objects.
remote: warning: File BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Testing.csv is 88.01 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: error: See https://gh.io/lfs for more information.
remote: error: File BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv is 352.00 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: File lish-moa/train_features.csv is 149.10 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: File split_nn.hdf5 is 111.26 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
error: failed to push some refs to 'https://github.com/devharsh/...'

Why deleting the file did not help

Git stores every version of every file as a snapshot in your commit history. When you delete a file and commit, you only add a new commit that says the file is gone from that point forward. Every earlier commit still contains the full file. Because a push transfers all of those commits, GitHub still receives the oversized blob and refuses it. To fix the push for good, you must rewrite history so the large file was never there.

Important: rewriting history changes commit hashes. Do this on a repository you control, coordinate with collaborators, and make a backup or a fresh clone before you start.

Solution 1: git filter-repo (recommended)

git filter-repo is the tool the Git project itself now recommends for rewriting history. Install it first:

pip install git-filter-repo

Remove a specific file from the entire history:

git filter-repo --path BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv --invert-paths

You can pass several --path options to drop more than one file. Or remove every blob over a size threshold in one go:

git filter-repo --strip-blobs-bigger-than 100M

git filter-repo deliberately removes the origin remote after a rewrite as a safety check, so add it back and force-push:

git remote add origin https://github.com/devharsh/your-repo.git
git push --force --all
git push --force --tags

Solution 2: BFG Repo-Cleaner

BFG is a fast, simpler alternative to filter-branch. Download bfg.jar (it needs Java), then work on a mirror clone:

git clone --mirror https://github.com/devharsh/your-repo.git
java -jar bfg.jar --strip-blobs-bigger-than 100M your-repo.git
# or target files by name:
java -jar bfg.jar --delete-files split_nn.hdf5 your-repo.git
cd your-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

Solution 3: git filter-branch (built in, no extra tools)

If you cannot install anything, Git ships with filter-branch. It is slow and officially discouraged, but it works:

git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch BOT-IoT/UNSW_2018_IoT_Botnet_Final_10_best_Training.csv" \
  --prune-empty --tag-name-filter cat -- --all

git for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push origin --force --all

After cleaning: force-push and warn collaborators

Because every commit hash changes, anyone who already cloned the repository must re-clone it or carefully rebase their work onto the new history. If they simply pull, they will reintroduce the old large blobs. The safest message to a team is: stop pushing, let me rewrite, then everyone deletes their local copy and clones fresh.

If you genuinely need the large files: Git LFS

Datasets and model files sometimes have to live with the code. Git Large File Storage keeps a small pointer in Git and stores the real file on a separate server. Note that LFS does not retroactively fix history, so you still have to clean the old blobs first using one of the methods above, then:

git lfs install
git lfs track "*.csv"
git lfs track "*.hdf5"
git add .gitattributes
git add BOT-IoT/ lish-moa/ split_nn.hdf5
git commit -m "Track large datasets with Git LFS"
git push origin master

Prevent it next time

Add data directories and large artifacts to .gitignore before your first commit, and check what you are about to commit. A quick guard is to list the biggest files in your working tree:

echo "*.csv"  >> .gitignore
echo "*.hdf5" >> .gitignore
echo "data/"  >> .gitignore

# show the 10 largest files under the current folder
du -ah . | sort -rh | head -n 10

GitHub blocks any single file over 100 MB and warns above 50 MB, so keeping datasets out of Git from the start saves you from rewriting history later.

Reference: the original discussion on Stack Overflow: Can't push to GitHub because of large file which I already deleted.

Comments

Popular Posts